Sentiment Analysis on Tweets

Introduction

The 2020 US Presidential Election was one of the most closely watched and contentious elections in recent history, and the development of mass media has greatly influenced the way people obtain information and express their emotions. This study aims to explore the relationship between Twitter sentiment and election results during the 2020 US Presidential Election and examine how changes in media consumption and expression affect political attitudes and beliefs.

Load Data

The two datasets used in this study are tweets posted during October 15th to November 4th, 2020, which have approximately 15,000 tweets for Biden and 36,000 tweets for Trump.

trump_df <- read.csv('/Users/cathy/Documents/Columbia Sem 2/5205_R Framework/Final Project/Final/trump.csv')
trump_df <- tibble::rowid_to_column(trump_df, "id")
biden_df <- read.csv('/Users/cathy/Documents/Columbia Sem 2/5205_R Framework/Final Project/Final/biden.csv')
biden_df <- tibble::rowid_to_column(biden_df, "id")

Explore Data

Data Structure

str(trump_df)

## 'data.frame':    36554 obs. of  7 variables:
##  $ id        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ created_at: chr  "2020-10-15 00:00:08" "2020-10-15 00:00:26" "2020-10-15 00:01:14" "2020-10-15 00:01:30" ...
##  $ tweet     : chr  "You get a tie! And you get a tie! #Trump ‘s rally #Iowa https://t.co/jJalUUmh5D" "#Trump #PresidentTrump #Trump2020LandslideVictory #Trump2020 #MAGA #KAG #4MoreYears #America #AmericaFirst #All"| __truncated__ "#Trump: Nobody likes to tell you this, but some of the farmers were doing better the way I was doing it than th"| __truncated__ "@karatblood @KazePlays_JC Grab @realDonaldTrump by the balls &amp; chuck the bastard out the door onto #Pennsyl"| __truncated__ ...
##  $ tweetNew  : chr  "['get', 'tie', 'get', 'tie', 'trump', 'rally', 'iowa']" "['trump', 'presidenttrump', 'trump', 'landslidevictory', 'trump', 'maga', 'kag', 'moreyears', 'america', 'ameri"| __truncated__ "['trump', 'nobody', 'likes', 'tell', 'farmers', 'better', 'way', 'working', 'asses', 'check', 'totally', 'mail', 'right']" "['karatblood', 'kazeplays', 'jc', 'grab', 'realdonaldtrump', 'balls', 'amp', 'chuck', 'bastard', 'door', 'onto'"| __truncated__ ...
##  $ month     : int  10 10 10 10 10 10 10 10 10 10 ...
##  $ day       : int  15 15 15 15 15 15 15 15 15 15 ...
##  $ year      : int  2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 ...

str(biden_df)

## 'data.frame':    15157 obs. of  7 variables:
##  $ id        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ created_at: chr  "10/15/20 0:01" "10/15/20 0:01" "10/15/20 0:03" "10/15/20 0:05" ...
##  $ tweet     : chr  "Comments on this? \"Do Democrats Understand how Ruthless China is?\" https://t.co/QevK00yhs3 #China #HunterBide"| __truncated__ "@RealJamesWoods #BidenCrimeFamily #JoeBiden #HunterBiden #HunterBidenEmails https://t.co/ottX1yP37j" "@realDonaldTrump #TrumpIsALaughingStock @realDonaldTrump at his Iowa cult rally compared #JoeBiden to Putin, XI"| __truncated__ "Laptop computer abandoned at Delaware repair shop contains #emails between #HunterBiden &amp; senior #Burisma #"| __truncated__ ...
##  $ tweetNew  : chr  "['comments', 'democrats', 'understand', 'ruthless', 'china', 'china', 'hunterbiden', 'joebiden', 'bidenharris',"| __truncated__ "['realjameswoods', 'bidencrimefamily', 'joebiden', 'hunterbiden', 'hunterbidenemails']" "['realdonaldtrump', 'trumpisalaughingstock', 'realdonaldtrump', 'iowa', 'cult', 'rally', 'compared', 'joebiden'"| __truncated__ "['laptop', 'computer', 'abandoned', 'delaware', 'repair', 'shop', 'contains', 'emails', 'hunterbiden', 'amp', '"| __truncated__ ...
##  $ month     : int  10 10 10 10 10 10 10 10 10 10 ...
##  $ day       : int  15 15 15 15 15 15 15 15 15 15 ...
##  $ year      : int  2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 ...

Common Words

Let see which words are used most frequently in these reviews.

To do this, we will employ the tidytext library which uses a tidy data approach. Each review is tokenized into words and then pivoted to a tall format. dplyr functions are used to summarize, sort, and filter the top 25. Also, we need to remove stop words to. This is accomplished through anti_join with a list of stop words, tidytext::stop_words.

Trump

library(tidytext)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

trump_df %>%
  unnest_tokens(input = tweetNew, output = word) %>%
  select(word) %>%
  anti_join(stop_words)%>%
  group_by(word) %>%
  summarize(count = n()) %>%
  ungroup() %>%
  arrange(desc(count)) %>%
  top_n(25)

## Joining, by = "word"

## Selecting by count

## # A tibble: 25 × 2
##    word            count
##    <chr>           <int>
##  1 trump           43989
##  2 election         6458
##  3 amp              5560
##  4 donaldtrump      3770
##  5 realdonaldtrump  3467
##  6 covid            3201
##  7 biden            3005
##  8 vote             2877
##  9 joebiden         2377
## 10 maga             2304
## # … with 15 more rows

Biden

biden_df %>%
  unnest_tokens(input = tweetNew, output = word) %>%
  select(word) %>%
  anti_join(stop_words)%>%
  group_by(word) %>%
  summarize(count = n()) %>%
  ungroup() %>%
  arrange(desc(count)) %>%
  top_n(25)

## Joining, by = "word"
## Selecting by count

## # A tibble: 25 × 2
##    word        count
##    <chr>       <int>
##  1 joebiden    12450
##  2 biden        9034
##  3 bidenharris  4105
##  4 amp          2220
##  5 trump        2161
##  6 vote         2114
##  7 election     1938
##  8 joe          1556
##  9 debates      1468
## 10 hunterbiden   982
## # … with 15 more rows

Categorize

One of the simplest approaches to natural language processing is to categorize words based on their meaning. Words may be categorized based on their valence (positive or negative), emotion (e.g., happy, sad). This can be done conveniently using a relevant lexicon.

Afinn Sentiment Lexicons

We will begin by examining lexicons that classify tokens into two categories based on valence, usually as positive or negative.Assign a score between -5 and +5 to each word based on the extent to which it is positive or negative.

afinn =read.csv('/Users/cathy/Documents/Columbia Sem 2/5205_R Framework/Final Project/Final/afinn.csv')
as.data.frame(get_sentiments('afinn'))[1:20,]

##          word value
## 1     abandon    -2
## 2   abandoned    -2
## 3    abandons    -2
## 4    abducted    -2
## 5   abduction    -2
## 6  abductions    -2
## 7       abhor    -3
## 8    abhorred    -3
## 9   abhorrent    -3
## 10     abhors    -3
## 11  abilities     2
## 12    ability     2
## 13     aboard     1
## 14   absentee    -1
## 15  absentees    -1
## 16    absolve     2
## 17   absolved     2
## 18   absolves     2
## 19  absolving     2
## 20   absorbed     1

Trump

We match the words in the dictionary with the ones in the reviews to determine sentiment score.

trump_df_1 <- trump_df %>%
  select(id, tweetNew) %>%
  group_by(id) %>%
  unnest_tokens(output = word, input = tweetNew) %>%
  inner_join(afinn) %>%
  summarize(reviewSentiment = mean(value)) %>%
  ungroup()

## Joining, by = "word"

trump_sentiment <- trump_df_1 %>%
  inner_join(trump_df)

## Joining, by = "id"

We visualize distribution of the sentiment score.

library(ggplot2)
trump_sentiment %>%
  select(id,tweetNew)%>%
  group_by(id)%>%
  unnest_tokens(output=word,input=tweetNew)%>%
  inner_join(afinn)%>%
  summarize(reviewSentiment = mean(value))%>%
  ungroup()%>%
  ggplot(aes(x=reviewSentiment,fill=reviewSentiment>0))+
  geom_histogram(binwidth = 0.1)+
  scale_x_continuous(breaks=seq(-5,5,1))+
  scale_fill_manual(values=c('tomato','seagreen'))+
  guides(fill=F)

## Joining, by = "word"

## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.

Biden

We match the words in the dictionary with the ones in the reviews to determine sentiment score.

biden_df_1 <- biden_df %>%
  select(id, tweetNew) %>%
  group_by(id) %>%
  unnest_tokens(output = word, input = tweetNew) %>%
  inner_join(afinn) %>%
  summarize(reviewSentiment = mean(value)) %>%
  ungroup()

## Joining, by = "word"

biden_sentiment <- biden_df_1 %>%
  inner_join(biden_df)

## Joining, by = "id"

We visualize distribution of the sentiment score.

biden_sentiment %>%
  select(id,tweetNew)%>%
  group_by(id)%>%
  unnest_tokens(output=word,input=tweetNew)%>%
  inner_join(afinn)%>%
  summarize(reviewSentiment = mean(value))%>%
  ungroup()%>%
  ggplot(aes(x=reviewSentiment,fill=reviewSentiment>0))+
  geom_histogram(binwidth = 0.1)+
  scale_x_continuous(breaks=seq(-5,5,1))+
  scale_fill_manual(values=c('tomato','seagreen'))+
  guides(fill=F)+
  ylim(0,3000)

## Joining, by = "word"

Result

Tweets about Biden and the election were slightly more positive than negative, while tweets about Trump and the election are relatively polarized with a commensurate distribution on both the positive and negative scale

Binary Sentiment Lexicons

We will begin by examining lexicons that classify tokens into two categories based on valence, usually as positive or negative.

The “bing” lexicon categorizes words as being positive or negative. The lexicon is included with the tidytext library and can be accessed by calling get_sentiments(‘bing’). Here are the first twenty words.

bing =read.csv('/Users/cathy/Documents/Columbia Sem 2/5205_R Framework/Final Project/Final/bing.csv')
as.data.frame(get_sentiments('bing'))[1:20,]

##             word sentiment
## 1        2-faces  negative
## 2       abnormal  negative
## 3        abolish  negative
## 4     abominable  negative
## 5     abominably  negative
## 6      abominate  negative
## 7    abomination  negative
## 8          abort  negative
## 9        aborted  negative
## 10        aborts  negative
## 11        abound  positive
## 12       abounds  positive
## 13        abrade  negative
## 14      abrasive  negative
## 15        abrupt  negative
## 16      abruptly  negative
## 17       abscond  negative
## 18       absence  negative
## 19 absent-minded  negative
## 20      absentee  negative

Trump

We match the words in the dictionary with the ones in the reviews to determine valence.

trump_sentiment%>%
  group_by(id)%>%
  unnest_tokens(output = word, input = tweetNew)%>%
  inner_join(get_sentiments('bing'))%>%
  group_by(sentiment)

## Joining, by = "word"

## # A tibble: 77,675 × 3
## # Groups:   sentiment [2]
##       id word    sentiment
##    <int> <chr>   <chr>    
##  1     2 trump   positive 
##  2     2 trump   positive 
##  3     2 trump   positive 
##  4     2 winning positive 
##  5     3 trump   positive 
##  6     3 likes   positive 
##  7     3 better  positive 
##  8     3 right   positive 
##  9     4 bastard negative 
## 10     5 trump   positive 
## # … with 77,665 more rows

head(trump_sentiment)

## # A tibble: 6 × 8
##      id reviewSentiment created_at          tweet      tweet…¹ month   day  year
##   <int>           <dbl> <chr>               <chr>      <chr>   <int> <int> <int>
## 1     2           4     2020-10-15 00:00:26 "#Trump #… ['trum…    10    15  2020
## 2     3           2     2020-10-15 00:01:14 "#Trump: … ['trum…    10    15  2020
## 3     4          -2.33  2020-10-15 00:01:30 "@karatbl… ['kara…    10    15  2020
## 4     5          -0.714 2020-10-15 00:01:53 "#TheWeek… ['thew…    10    15  2020
## 5     6          -3     2020-10-15 00:02:14 "I have l… ['lost…    10    15  2020
## 6     8           1     2020-10-15 00:03:32 "#Trump: … ['trum…    10    15  2020
## # … with abbreviated variable name ¹tweetNew

We visualize distribution of the sentiment score.

trump_sentiment%>%
  group_by(id)%>%
  unnest_tokens(output = word, input = tweetNew)%>%
  inner_join(get_sentiments('bing'))%>%
  group_by(sentiment)%>%
  count()%>%
  ggplot(aes(x=sentiment,y=n,fill=sentiment))+
  geom_col()+
  guides(fill=F)+
  coord_flip()

## Joining, by = "word"

Biden

biden_sentiment%>%
  group_by(id)%>%
  unnest_tokens(output = word, input = tweetNew)%>%
  inner_join(get_sentiments('bing'))%>%
  group_by(sentiment)

## Joining, by = "word"

## # A tibble: 15,202 × 3
## # Groups:   sentiment [2]
##       id word        sentiment
##    <int> <chr>       <chr>    
##  1     3 sharp       positive 
##  2     6 misleading  negative 
##  3     8 threatening negative 
##  4    10 right       positive 
##  5    10 benefit     positive 
##  6    11 worry       negative 
##  7    16 scandal     negative 
##  8    16 lying       negative 
##  9    18 supported   positive 
## 10    18 upset       negative 
## # … with 15,192 more rows

biden_sentiment%>%
  group_by(id)%>%
  unnest_tokens(output = word, input = tweetNew)%>%
  inner_join(get_sentiments('bing'))%>%
  group_by(sentiment)%>%
  count()%>%
  ggplot(aes(x=sentiment,y=n,fill=sentiment))+
  geom_col()+
  guides(fill=F)+
  coord_flip()+
  ylim(0, 45000)

## Joining, by = "word"

Result

The result depicted the disparity in the volume of tweets on Biden and on Trump. Tweets on Biden have relatively equal counts of positive and negative tweets, adding up to a total of nearly 20,000 tweets. While tweets on Trump and the election are greater in volume, with positive tweets surpassing the count of negative tweets by over nearly 20,000.

Binary Sentiment Analysis by Date

In order to examine the trend of tweet sentiment leading up to the election, we plotted the line graph to inspect the trend of positive and negative tweet counts throughout the period of October 15th to November 3rd in 2020.

Trump

#Plot line chart Positive/ Negative sentiment of trump daily
trump_sentiment_daily <- trump_sentiment %>%
  mutate(date = as.Date(paste(year, month, day, sep = "-"))) %>%
  unnest_tokens(output = word, input = tweetNew) %>%
  inner_join(get_sentiments('bing')) %>%
  group_by(date, sentiment) %>%
  count() %>%
  ggplot(aes(x = date, y = n, color = sentiment)) +
  geom_line(linewidth = 1.1) + # Set line thickness to 1.5
  scale_color_manual(values = c("positive" = "#74C476", "negative" = "#F15854")) +
  labs(x = "Date", y = "Count", title = "Positive and Negative Sentiments of Trump by Date") +
  theme_minimal() +
  theme(text = element_text(face = "bold"),
        plot.title = element_text(face = "bold", size = 14))

## Joining, by = "word"

trump_sentiment_daily

Biden

#Plot line chart Positive/ Negative sentiment of Biden daily
biden_sentiment_daily <- biden_sentiment %>%
  mutate(date = as.Date(paste(year, month, day, sep = "-"))) %>%
  unnest_tokens(output = word, input = tweetNew) %>%
  inner_join(get_sentiments('bing')) %>%
  group_by(date, sentiment) %>%
  count() %>%
  ggplot(aes(x = date, y = n, color = sentiment)) +
  geom_line(size = 1.1) + # Set line thickness to 1.5
  scale_color_manual(values = c("positive" = "#74C476", "negative" = "#F15854")) +
  labs(x = "Date", y = "Count", title = "Positive and Negative Sentiments of Biden by Date") +
  theme_minimal() +
  ylim(0, 4000)+
  theme(text = element_text(face = "bold"),
        plot.title = element_text(face = "bold", size = 14))

## Joining, by = "word"

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.

biden_sentiment_daily

Result

We noticed two abrupt spikes in tweet counts for both Biden and Trump on October 15th and October 22nd. These spikes corresponded to the presidential debate, which was originally scheduled on October 15th, but was postponed to October 22nd due to health safety concerns during the pandemic. The fluctuations of positive and negative tweets were dramatic and high in count for Trump, while tweets on Biden are generally low in count and showed slight fluctuations in trend. Furthermore, tweets about Trump are consistently more positive than negative throughout the entire two weeks period leading up to the election. This was rather unexpected, indicating that higher positive tweet sentiments did not align with the final election result. As the campaign neared the election day November 3rd, we could observe an increase on both graphs, demonstrating intensifying sentiments and increasing engagement on both sides of the race.

NRC Sentiment Lexicon

The NRC Emotion Lexicon is a list of 5,636 English words and their associations with eight basic emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and two sentiments (negative and positive).

nrc =read.csv("/Users/cathy/Documents/Columbia Sem 2/5205_R Framework/Final Project/Final/nrc.csv")
head(nrc)

##        word sentiment
## 1    abacus     trust
## 2   abandon      fear
## 3   abandon  negative
## 4   abandon   sadness
## 5 abandoned     anger
## 6 abandoned      fear

Trump

trump_sentiment %>%
  group_by(id) %>%
  unnest_tokens(output = word, input = tweetNew) %>%
  left_join(nrc, by = "word") %>%
  group_by(sentiment) %>%
  count() %>%
  filter(!is.na(sentiment)) %>% # Filter out records with no sentiment match in nrc dataset
  arrange(desc(n))

## # A tibble: 10 × 2
## # Groups:   sentiment [10]
##    sentiment        n
##    <chr>        <int>
##  1 surprise     35973
##  2 negative     31595
##  3 positive     28571
##  4 trust        21117
##  5 fear         16903
##  6 sadness      16328
##  7 anger        16113
##  8 anticipation 15507
##  9 joy          11544
## 10 disgust      10849

#graph it
trump_sentiment%>%
  group_by(id)%>%
  unnest_tokens(output = word, input = tweetNew)%>%
  left_join(nrc, by = "word") %>%
  group_by(sentiment)%>%
  count()%>%
  filter(!is.na(sentiment)) %>%
  ggplot(aes(x=reorder(sentiment,X = n), y=n, fill=sentiment))+
  geom_col()+
  guides(fill=F)+
  coord_flip()

Biden

biden_sentiment %>%
  group_by(id) %>%
  unnest_tokens(output = word, input = tweetNew) %>%
  left_join(nrc, by = "word") %>%
  group_by(sentiment) %>%
  count() %>%
  filter(!is.na(sentiment)) %>% # Filter out records with no sentiment match in nrc dataset
  arrange(desc(n))

## # A tibble: 10 × 2
## # Groups:   sentiment [10]
##    sentiment        n
##    <chr>        <int>
##  1 positive     10726
##  2 negative      8867
##  3 trust         7914
##  4 anticipation  6020
##  5 sadness       5000
##  6 joy           4828
##  7 fear          4785
##  8 surprise      4719
##  9 anger         4621
## 10 disgust       2408

#graph it
biden_sentiment%>%
  group_by(id)%>%
  unnest_tokens(output = word, input = tweetNew)%>%
  left_join(nrc, by = "word") %>%
  group_by(sentiment)%>%
  count()%>%
  filter(!is.na(sentiment)) %>%
  ggplot(aes(x=reorder(sentiment,X = n), y=n, fill=sentiment))+
  geom_col()+
  guides(fill=F)+
  coord_flip()+
  ylim(0,30000)

Result

The NRC lexicon categorizes tweets into various emotions. To analyze the tweet sentiments with more nuance, we applied the NRC lexicon to explore a variety of sentiments instead of binary positive and negative sentiments. This graph displayed the overall count of each categorized tweet. Tweets about Biden mostly expressed positive sentiment with a total of over 10,000 tweets, followed by negative then trust. On the other hand, tweets about Trump revealed that surprise was the most dominant sentiment with a total of over 35,000 counts, followed by negative tweets as the second most frequent sentiment and positive tweets the third. Once again, we observed a significantly larger amount of tweet volume for Trump compared to Biden.

NRC Sentiment Analysis by Date

To further inspect the NRC sentiments on tweets, we decided to plot the average count of each sentiment by weeks from 10/15 to 10/22 and 10/23 to 11/2. We also isolated the election day, 11/3, as its own unit to focus on the count of tweets categorized in each sentiment.

Trump

# Create a function to extract the date from month and day columns
get_date <- function(month, day) {
  as.Date(paste("2020", month, day, sep = "-"))
}

# Use mutate to create a new column with date
trump_sentiment <- trump_sentiment %>%
  mutate(date = get_date(month, day))
head(trump_sentiment)

## # A tibble: 6 × 9
##      id reviewSentiment created_at    tweet tweet…¹ month   day  year date      
##   <int>           <dbl> <chr>         <chr> <chr>   <int> <int> <int> <date>    
## 1     2           4     2020-10-15 0… "#Tr… ['trum…    10    15  2020 2020-10-15
## 2     3           2     2020-10-15 0… "#Tr… ['trum…    10    15  2020 2020-10-15
## 3     4          -2.33  2020-10-15 0… "@ka… ['kara…    10    15  2020 2020-10-15
## 4     5          -0.714 2020-10-15 0… "#Th… ['thew…    10    15  2020 2020-10-15
## 5     6          -3     2020-10-15 0… "I h… ['lost…    10    15  2020 2020-10-15
## 6     8           1     2020-10-15 0… "#Tr… ['trum…    10    15  2020 2020-10-15
## # … with abbreviated variable name ¹tweetNew

# Use case_when to create a new column with period
trump_sentiment <- trump_sentiment %>%
  mutate(period = case_when(
    date >= as.Date("2020-10-15") & date <= as.Date("2020-10-22") ~ "10/15-10/22",
    date >= as.Date("2020-10-23") & date <= as.Date("2020-11-02") ~ "10/23-11/2",
    date == as.Date("2020-11-03") ~ "11/3"
  ))
head(trump_sentiment)

## # A tibble: 6 × 10
##      id reviewSentim…¹ creat…² tweet tweet…³ month   day  year date       period
##   <int>          <dbl> <chr>   <chr> <chr>   <int> <int> <int> <date>     <chr> 
## 1     2          4     2020-1… "#Tr… ['trum…    10    15  2020 2020-10-15 10/15…
## 2     3          2     2020-1… "#Tr… ['trum…    10    15  2020 2020-10-15 10/15…
## 3     4         -2.33  2020-1… "@ka… ['kara…    10    15  2020 2020-10-15 10/15…
## 4     5         -0.714 2020-1… "#Th… ['thew…    10    15  2020 2020-10-15 10/15…
## 5     6         -3     2020-1… "I h… ['lost…    10    15  2020 2020-10-15 10/15…
## 6     8          1     2020-1… "#Tr… ['trum…    10    15  2020 2020-10-15 10/15…
## # … with abbreviated variable names ¹reviewSentiment, ²created_at, ³tweetNew

# Group by period and sentiment

sentiment_by_period_trump <- trump_sentiment %>%
  unnest_tokens(output = word, input = tweetNew) %>%
  left_join(nrc, by = "word") %>%
  group_by(period, sentiment) %>%
  summarize(total_count = n(),
            distinct_dates = n_distinct(date),
            .groups = "drop") %>%
  mutate(avg_count = if_else(distinct_dates > 1, as.double(total_count) / distinct_dates, as.double(total_count))) %>%
  mutate(avg_count = round(avg_count)) %>% 
  filter(!is.na(sentiment), !is.na(period))
head(sentiment_by_period_trump)

## # A tibble: 6 × 5
##   period      sentiment    total_count distinct_dates avg_count
##   <chr>       <chr>              <int>          <int>     <dbl>
## 1 10/15-10/22 anger               5116              8       640
## 2 10/15-10/22 anticipation        4909              8       614
## 3 10/15-10/22 disgust             3783              8       473
## 4 10/15-10/22 fear                5835              8       729
## 5 10/15-10/22 joy                 3535              8       442
## 6 10/15-10/22 negative           10789              8      1349

ggplot(sentiment_by_period_trump, aes(x = period, y = avg_count, fill = sentiment)) +
  geom_col() +
  coord_flip() +
  #ylim(0, 6000)
  guides(fill = F) +
  facet_wrap(~sentiment, scales = "free_y") +
  theme_minimal() +
  labs(title = "Average count of sentiment by Trump by week")

Biden

# Create a function to extract the date from month and day columns
get_date <- function(month, day) {
  as.Date(paste("2020", month, day, sep = "-"))
}

# Use mutate to create a new column with date
biden_sentiment <- biden_sentiment %>%
  mutate(date = get_date(month, day))

# Use case_when to create a new column with period
biden_sentiment <- biden_sentiment %>%
  mutate(period = case_when(
    date >= as.Date("2020-10-15") & date <= as.Date("2020-10-22") ~ "10/15-10/22",
    date >= as.Date("2020-10-23") & date <= as.Date("2020-11-02") ~ "10/23-11/2",
    date == as.Date("2020-11-03") ~ "11/3"
  ))

# Group by period and sentiment

sentiment_by_period_biden <- biden_sentiment %>%
  unnest_tokens(output = word, input = tweetNew) %>%
  left_join(nrc, by = "word") %>%
  group_by(period, sentiment) %>%
  summarize(total_count = n(),
            distinct_dates = n_distinct(date),
            .groups = "drop") %>%
  mutate(avg_count = if_else(distinct_dates > 1, as.double(total_count) / distinct_dates, as.double(total_count))) %>%
  mutate(avg_count = round(avg_count)) %>% 
  filter(!is.na(sentiment), !is.na(period))


# Create a bar chart with sentiment average count by period
ggplot(sentiment_by_period_biden, aes(x = period, y = avg_count, fill = sentiment)) +
  geom_col() +
  coord_flip() +
  ylim(0, 2500)+
  guides(fill = F) +
  facet_wrap(~sentiment, scales = "free_y") +
  theme_minimal() +
  labs(title = "Average count of sentiment by Biden by week")

Result

For tweets about both Biden and Trump, there was an increase in the average count of tweets in each category by week. However, the increase was much more substantial and distinct for tweets about Trump.

Sentiment Analysis on Tweets

Cathy(Yi-An) Ko

2023-09-13

Introduction

Load Data

Explore Data

Data Structure

Common Words

Trump

Biden

Categorize

Afinn Sentiment Lexicons

Trump

Biden

Result

Binary Sentiment Lexicons

Trump

Biden

Result

Binary Sentiment Analysis by Date

Trump

Biden

Result

NRC Sentiment Lexicon

Trump

Biden

Result

NRC Sentiment Analysis by Date

Trump

Biden

Result