TrackMyHashtag
Type- Corona Virus (Covid-19) Tweet Metadata Compilation 2020
https://www.trackmyhashtag.com/data/COVID-19.zip
The Twitter Dataset comprises of 60K random tweet from public Twitter accounts related to the search term “Covid-19”. The dataset was amassed using Twitter’s Stream API over a period of 8 weeks (1st Dec, 2019 – 28 Jan 2020). The dataset is available in Excel/CSV format and is segmented into 3 fields, namely tweet data, images, and videos.
TrackMyHashtag
Type- Tweet analysis of top 50 Twitter profiles 2020
https://www.drive.google.com/drive/folders/11w4geFB6p17hFlWseBpHJQbhARINvTOc
This Twitter dataset comprises of the past 3200 Tweets each of the Top 50 Twitter profiles on Twitter for the year 2020 in a raw Excel/CSV format. The data also provides comprehensive PDF analytical reports for each Twitter profile.
Archive.org
Type- Miscellaneous research data (2013-2018).
https://www.archive.org/details/twitterstream
This is a collection of free Twitter datasets gathered through the stream for sentiment analysis, research, history, testing, and data retention. We can go through loads of data in this archive and purposefully select the stream we need.These archives have loads of data that can be sorted and used as needed. The Twitter datasets available here can be downloaded for free.
Data.world
Type- MNC’s Twitter accounts and influential people.
https://www.data.world/datasets/twitter
Data.world is a free Twitter dataset repository. Users can find datasets ranging from companies to influential individuals. We can simply head over to the website and browse through their collection of Twitter datasets.
Github
Type- Russian troll tweets to celebrity accounts.
https://www.github.com/shaypal5/awesome-twitter-data
Like all things on Github, this is a free data repository. The datasets range from Elon musk Tweets to Russian troll tweets. Users can simply head over to the mentioned URL and browse through their vast collection of Twitter datasets.
Kaggle
Type- Scientific research data.
https://www.kaggle.com/datasets?search=twitter
Kaggle is a free online repository for sharing codes, scientific data, and Twitter datasets as well. There is a huge collection of datasets submitted by users that are available to download for free. The data ranges from environmental studies to tweets from demonetization in India.
ICWSM
Type- Academic research data.
https://www.icwsm.org/2015/datasets/datasets/
ICWSM is a data-sharing initiative which has a vast collection of Twitter datasets. The collection is free to download and the users only have to register on the website and sign a disclosure under which he/she agrees not to share the report. These data sets can be extremely beneficial in the field of academic research.
Figshare
Type- Data related to real world events.
https://www.figshare.com/articles/Twitter_event_datasets_2012-2016_/5100460
This collection includes a collection of 30 different data sets associated with real-world events and was collected between 2012 and 2016, using the streaming API with a set of keywords. As per Twitter TOS, this data is available for non-commercial purposes only.
ISI.edu
Type- Old Twitter data from October 2010.
https://www.isi.edu/lerman/downloads/twitter/twitter2010.html
This dataset contains tweets that were posted on Twitter in October 2010. Although quite old, this might still be relevant to data minors and academicians. Just click on the link to download the dataset
Trec.nist.gov
Type- Sample of 16 million unfiltered tweets.
trec.nist.gov/data/tweets/
This archive consists of approximately 16 million tweets sampled between January 23rd to February 8th. This is an unfiltered archive and consists of important and spam tweets. The user just needs to sign a disclosure agreeing not to use the data for commercial purposes and after that, you can download the archive right away.
Kdnuggets
Type- Miscellaneous.
https://www.kdnuggets.com/datasets/index.html
Kdnuggets is a multi-centric portal that provides information on jobs, relevant courses, webinars, and free downloadable Twitter datasets as well. You can go directly to the link provided and browse through their collection of datasets.
Github troll tweets
Type- Russian troll tweets.
https://www.github.com/fivethirtyeight/russian-troll-tweets/
This Github archive provides a large dataset of Russian troll tweets. All the datasets are readily downloadable in CSV format.
Github scraped public tweets
Type- Miscellaneous public tweets.
archive.org/details/twitter_cikm_2010
This dataset is a collection of scraped public twitter updates used in coordination with an academic project to study the geolocation data related to the tweets.
Mega.NZ Reddit data set
Type- Reddit comments data set.
mega.nz/#!ysBWXRqK!yPXLr25PgJi184pbJU3GtnqUY4wG7YvuPpxJjEmnb9A
This is the dataset of entire Reddit’s publicly available comments which can be used for massive analytical research. The file size is about 250 GB compressed and over 1 TB uncompressed. The link provided is of the torrent file which can be easily downloaded using a torrent client.
Kaggle customer support data sets
Type- Customer support tweets.
https://www.kaggle.com/thoughtvector/customer-support-on-twitter
This dataset consists of over 3 Million tweets by customer support of various big brands and companies. This can be used in understanding conversational models, and for the study of modern customer support practices and impact.
FollowTheHashtag
Type- Tweets from NASDAQ companies to UK geolocation tweet data.
followthehashtag.com/datasets/
Follow the hashtag provides a collection of data sets ranging from the top 100 NASDAQ companies to UK geolocation tweet data. Just click on the link to browse the datasets.
Lionbridge
Type- Miscellaneous.
https://www.lionbridge.ai/datasets/top-20-twitter-datasets-for-natural-language-processing-and-machine-learning/
Lionbridge provides a comprehensive list of Twitter datasets which range from everyday news to tweets with the hashtag #Avengersendgame and so on. Just click on the link and browse through the list of their available datasets.
Academic torrents
Type- URL’s posted on Twitter on October 2010.
academictorrents.com/details/d8b3a315172c8d804528762f37fa67db14577cdb
This dataset consists of URLs that were posted on Twitter in October 2010. The link will take you to the Torrent file which can be easily downloaded through a Torrent client.
Sentiment140
Type- Tweet sentiment analysis data.
https://www.help.sentiment140.com/for-students
Sentiment140 allows you to discover the sentiment of a brand, product, or topic on Twitter. It filters through the tweets by understanding the negativity or positivity of the tweet or comment by analyzing emoticons.
Docnow
Type- Miscellaneous.
https://www.docnow.io/catalog/
Docnow provides catalogs of datasets that are publicly available on the web. If you would like to turn these tweet identifier data back into the original JSON format then first download the data sets and then use the Hydrator desktop application, or Twarc if you are comfortable working at the command line.
Harvard dataverse
Type- USA presidential election tweets.
https://www.dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/PDI7IN
This dataset contains the tweet ids of approximately 280 million tweets related to the 2016 United States presidential election. They were collected between July 13, 2016, and November 10, 2016, from the Twitter API using Social Feed Manager.
Dfreelon
Type- Miscellaneous.
https://www.dfreelon.org/2017/01/03/beyond-the-hashtags-twitter-data/
This Twitter dataset contains all 40,815,975 tweets matching at least one of the following 45 keywords that were posted between June 1, 2014, and May 31, 2015, and had not been deleted or protected as of July 2015. Head over to the link to find the list of the 45 keywords and download the data.
Kaist
Type- Miscellaneous public tweets.
https://www.an.kaist.ac.kr/traces/https://www.010.html
The dataset is a collection of 1.47 billion social relations, 4,262 trending topics and 106 million tweets obtained from 41.7 million Twitter user profiles. It was used in a study to identify trending topics, identify influencers, rank profiles based on the size of followers or retweets and, analyze temporal behavior along with user participation.
MPI-SWS
Type- Miscellaneous Twitter Retweets
https://www.twitter.mpi-sws.org
The dataset contains user-to-user links from Twitter and different retweeting variations (RT, via, retweeting, retweet, HT, R/T, and the recycling symbol) per day. The data was accumulated to conduct a study aimed at visualizing media landscape, discovering topic authorities, crowd-sourced opinions, identifying topical content and characterizing information trade on Twitter.
Sentiment140
Type- Miscellaneous public tweets
https://www.help.sentiment140.com/for-students
The data consists of tweets of Twitter users. Its file is available in ExcelCSV format and is divided into six fields (tweet id, date of the tweet, popularity of the tweet, the query (LyX), the user that tweeted and, the text of the tweet).
Thinknook
Type- Miscellaneous Tweets
https://www.thinknook.com/twitter-sentiment-analysis-training-corpus-dataset-2012-09-22/
The dataset contains 1,578,627 tweets from public Twitter profiles, amassed to perform a Twitter sentiment analysis by the University of Michigan.
IEEEDATAPORT
Type- Miscellaneous public tweets
https://www.ieee-dataport.org/comment/221#comment-221
The link provides various tweet datasets collected by the LSTM model to perform sentiment analysis. The data is available in .db files and are SQLite files. The .db files contain three columns. First: date and time, second: tweet and third: sentiment score for the tweet.
Pushshift.io
Type- Kavanaugh Twitter Dataset
https://www.https://www.pushshift.io/kavanaugh-twitter-dataset/
This Twitter dataset was collected using the Twitter API over 3 weeks (Sept-22 to Oct 9, 2018). The dataset contains a total of 56 million tweets from 3.2 million unique accounts. The following keywords were tracked for this Twitter data: #Kavanaugh, “Supreme Court”, #KavanaughHearings, #KavanaughNomination.
GitHub
Type- Movie Rating Tweets
https://www.https://www.github.com/sidooms/MovieTweetings
This is a Twitter dataset consisting of ratings on movies contained in well-structured Tweets on Twitter. On a daily basis, the Twitter API is queried for the term “I rated #IMDB”. Through a series of regular expressions, relevant information such as user, movie, and rating is extracted. The ratings are then cross-referenced with the IMDB page to provide the genre metadata.
Zenodo.org
Type- 2017 German Elections Raw Tweet Data
https://www.https://www.zenodo.org/record/835735#.XnBrPS2B0cS
The Twitter dataset contains Twitter interactions related to German politicians of influential political parties for several months in the pre-phase of the German election campaigns 2017. The dataset comprises of raw data of more than 120,000 active users generating more than 1,200,000 tweets.
MIR Group
Type- Twitter Event Detection Dataset
https://www.https://www.mir.dcs.gla.ac.uk/resources/
This Twitter dataset is a collection of 120 million tweets, with relevance judgments’ for over 500 events. You can obtain the dataset by agreeing to the dataset agreement which can be found here. Once you have filled out the form, you will be given instructions on how to download the dataset.
Archive.org
Type- General Twitter Stream
https://www.https://www.archive.org/search.php?query=twitterstream&sort=-publicdate
The link provides various collections of Twitter datasets in JSON format accumulated from general Twitter streams. The data was collected for research, history, testing, and memory.
Data.world
Type- Gender Classifier Data
https://www.https://www.data.world/crowdflower/gender-classifier-data
The Twitter dataset was used to train a CrowdFlower AI gender predictor. The dataset contains 20,000 rows, each with a user name, a random tweet, account profile and image, location, and even link and sidebar color.