Twitter Analysis about new Omicron Variant

This project discusses the latest Omicron Variant of COVID-19 virus. The dataset was collected from Twitter and was updated on December 16th with recent tweets about the new Omicron Variant of COVID-19. The dataset depict the feeling of users towards the latest Omicron Variant. The goal of our project is visualizing user feelings and descriptive trends in Twitter about Omicron topics.

Data Import

Question 1: How many tweets are there in the dataset? Are there any missing data in user description, user location and hashtags?

There are (how many?) tweets.

Question 2: How many tweets are disseminated per day? On which day witnessed the highest and lowest number of tweets about Omicron?

What trend do you see about Twitter posts about Omicron?

Question 3: How many tweets are disseminated in each hour throughout all days? On which hour witnessed the highest and lowest number of tweets about Omicron?

What trend do you see about Twitter posts about Omicron?

Question 4: What is the distribution of text length among Twitter posts about Omicron?

Text Cleaning

In this following tab, we are going to clean the raw text inputs from the Twitter data to extract information regarding user's perception about Omicron.

The technique of cleaning text for our data include cleaning emojis from text, remove punctuations, links, newline characters, hashtags, symbols and filtering special characters such as & and $ appearing in text.

As such, you can use the text_len column of the omicron dataset in order to understand the distribution of Twitter text length for every post. But before doing so, please generate an uninformative prior and an informative prior about the length distribution of Twitter posts and give a reason about each of your choice.

What distribution do you use for 1) uninformative prior and 2) informative prior? What is the reason why you use that?

Now we can plot the distribution of text length of Twitter posts. You may consider using either a kernel density plot or a histogram.

What trend do you observe regarding the distribution of text length of Omicron? What could you infer from the graph?

Using each of the prior distribution and the data likelihood, what will the posterior distribution? How would the posterior distribution differ? What other characteristics about the posterior distribution could we see with the kdeplot?

Question 5: What is the correlation of the number of Twitter followers and the number of retweets for each Omicron user topic?

What trend do you observe regarding the correlation of user popularity and the sharing of Omicron posts described by Twitter retweets? What could you infer from the graph?

Interpret your observations.

Question 6: Draw a heatmap. What does the correlation between variables tell you about Omicron?

What major associations do you observe in the heatmap?

Congratulations! You have reached the end of the project. Finally, what are 1-2 research questions that you think relevant to Omicron and the dataset we have?