Twitter Sentiment Analysis (Part 1)

#Emmys2022 — Part 1: Tweet Scraping and Data Preprocessing

Başak Buluz Kömeçoğlu
Heartbeat

--

Photo by Chris J. Davis on Unsplash

Twitter, which allows more than 319 million monthly active users to share their feelings and thoughts expressed in texts with the public, is 8th in the global web traffic ranking.

Although the use of rule-free abbreviations due to the limited number of characters and the widespread use of daily spoken language make it difficult to analyze Twitter data, an average of 200 billion tweets per year [1] keeps the pulse of the agenda and has become an important data source for all segments of society. In fact, it is known that Twitter’s official application development interface receives an average of 15 billion calls per day [2]!

In this blog post, where we will deal with this huge data pile containing much important information waiting to be discovered, we will perform an end-to-end sentiment analysis application.

We will look for practical answers to the questions of how to extract data from Twitter, what kind of pre-processing steps are performed before analysis, how to classify Twitter shares according to their mood, how to create a word cloud in identifying and visualizing the highest frequency words in classified shares.

If you are ready, let’s get started!

Known as the Oscars of the television world, the 74th Emmy Award Ceremony was held recently and aroused great interest worldwide. Of course, this interest also found a place on social media.

The shares made on Twitter with the #Emmys2022 hashtag were brought to the forefront as the Global Trend Topic. We rolled up our sleeves to analyze the posts made under this hashtag end-to-end!

Implementation Step 1: Scrape tweets with Snscrape

Snscrape is a data scraper for many social networking services. Today, it offers solutions for data collection via Facebook, Instagram, Twitter, Reddit, Telegram, VKontakte, Weibo, Mastodon social networks.

You can install the Snscrape library [3] with pip with the command pip install snscrape

Snscrape requires Python 3.8 or higher.

You can use the search_hashtag function defined below to capture tweets that contain a certain word or hashtag shared between two dates and in a language determined by the user.

The defined search_hashtag function searches by taking the following parameters and saves the Date (Date) and Tweet Text (Tweet) of the tweets in the result of this search as Pandas DataFrame in .csv format.

  • searchterm — searched word or hashtag
  • dt_since — From date in%Y-%m-%d date format
  • dt_until — %Y-%m-%d until date in date format
  • lang — language code
  • limit — maximum number of tweets

With the help of the function we have now defined, we can scrape tweets posted in English between September 1st and September 14th containing the hashtag #Emmys2022.

search_hashtag('#Emmys2022', '2022-09-14', '2022-09-01', 'en')

As a result of the search, 52.867 tweets were taken, 150 duplicate tweets were deleted and a total of 52.717 Twitter shares were obtained.

Implementation Step 2: Data Preprocessing for Sentiment Analysis and Word Cloud

1. Stopword Cleaning

First of all, of course, the words we want to make visible on the word cloud are the images that play a major role in the differentiation of that emotional state. Therefore, we do not want to see words like stopwords that are frequently used in sentences but have low semantic level.

So here we still need some of the preprocessing. The stopwords are based on the nltk (Natural Language Toolkit) [4] library and the word list there is only “emmys,” “emmy,” “&” during word cloud generation and the words “amp” were added.

You can install the nltk library with pip with the command pip install nltk .

nltk requires Python 3.7, 3.8, 3.9 or 3.10.

Using the stopwords class under nltk, we can list the English stop word sequence as follows.

from nltk.corpus import stopwords
print(stopwords.words('english'))

English Stop Words ListEnglish Stop Words List

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

What tips do big name companies have for students and start ups? We asked them! Read or watch our industry Q&A for advice from teams at Stanford, Google, and HuggingFace.

2. Cleaning Punctuation Marks and Repeated Letters

The presence of punctuation marks on the word cloud makes it difficult to understand, so cleaning these punctuation marks and some special characters is commonly used as one of the preprocessing steps. On the other hand, especially on noisy and often devoid of language rules such as Twitter, users tend to write long words for the purpose of reinforcing an emotion or emphasizing it. For this reason, various controls and regulations become necessary in order to prevent the consecutive repetition of a letter more than two times in a word.

3. Cleaning the Emojis

Even if the emojis carry information about the emotional states of the people’s posts, the presence of these emojis on the word cloud is not a desirable situation. In addition, users can not only benefit from ready-made emojis, but also create their own emojis with punctuation marks.

For example: using ‘:)’ instead of 😃.

Either way, removing emojis from text is an important pre-process before creating a word cloud.

4. Some other preprocessing steps

Twitter posts include hashtags, mentions, various links, etc. These are texts that contain elements and have a very high noise level. For this reason, besides the pre-processes listed above, many small pre-processes are needed.

The function described below includes the following preprocessing steps, respectively:

  • Converting all characters to lowercase
  • Cleaning the numbers
  • Cleaning the URLs
  • Cleaning the Mentions
  • Cleaning Hashtags
  • Cleaning RT expressions
  • Replacing more than 2 dots with a space character
  • Limiting multiple consecutive spaces to a single space
  • Cleaning some special characters

Let’s see how these preprocesses cause a change on a sample tweet!

Original Tweet: ‘ With every snub (Better Call Saul having zero Emmys to its name, Only Murders shut out) there was a nice surprise (A+ for Abbott Elementary!) at the #Emmys2022. https://t.co/IS9xBQZd9f

Post-process Tweet: ‘with every snub better call saul having zero emmys to its name only murders shut out there was a nice surprise a for abbott elementary at the’

We obtained the Twitter shares under the #Emmys2022 hashtag chosen as the case study, and after many preprocessing steps, we now have our texts ready for analysis. We can now categorize the sentiment of the tweets and create a word cloud for a more comprehensive analysis.

So continue with part 2 of the blog series and complete the end-to-end implementation!

Happy reading ☕

References:

  1. Sayce, D. (2020). The Number of tweets per day in 2020. Retrieved from Dsayce
  2. Pandya, A., Oussalah, M., Kostakos, P., & Fatima, U. (2020, June). Mated: metadata-assisted twitter event detection system. In International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems (pp. 402–414). Springer, Cham.
  3. Snscrape Library
  4. Natural Language Toolkit (nltk Library)

Editor’s Note: Heartbeat is a contributor-driven online publication and community dedicated to providing premier educational resources for data science, machine learning, and deep learning practitioners. We’re committed to supporting and inspiring developers and engineers from all walks of life.

Editorially independent, Heartbeat is sponsored and published by Comet, an MLOps platform that enables data scientists & ML teams to track, compare, explain, & optimize their experiments. We pay our contributors, and we don’t sell ads.

If you’d like to contribute, head on over to our call for contributors. You can also sign up to receive our weekly newsletter (Deep Learning Weekly), check out the Comet blog, join us on Slack, and follow Comet on Twitter and LinkedIn for resources, events, and much more that will help you build better ML models, faster.

--

--

Research Assistant at Information Technologies Institute of Gebze Technical University | Phd Candidate at Gebze Technical University