Import dataset & libraries

Drop columns

Update data types

Created is supposed to be date, some data types should be int, such as Thankful, Page Likes at Posting. Using regular expression, I can detect which Created date has wrong format.

For now set Thankful that are not numbers to NA. Imputation will come next.

Imputation

Use other numerical variables to impute 'Page Likes at Posting' and 'Thankful'

Remove all shared posts from another page.

Too many missing, drop this variable

These sponsors are legitimate authorities in heathcare. Perhaps they are flagged as anti-vaccination due to the comments. In fact, Bill Gates is often accused of 'creating virus'.

Sponsor name is dropped. For all the missing message, link text and description, replace NA with empty string.

Natural Language Processing

Combine message, link text and description into a new column, text.

Preprocessing removes web links, numbers, punctuations. After tokenisation and lemmatisation, non-english words and stopwords are removed.

I had to chunk this section, as it takes forever to run on my computer.

These are the rows with no text remain at all after pre-processing

Removed the rows that don't have any valid text. Now, I compute TF-IDF and distance.

Sentiment analysis can now be done.

EDA

For EDA, I will use z-score to remove outlier, only for the purpose of EDA. I want to analyze the comment section of the most commented posts, so I will still keep the original dataframe with outliers.

Angry reactions have highest average, followed by love and share. Thankful is extremely low, almost 0 on average. It could be because Thankful is a reaction that was introduced last.

The source of the dataset did not explain what 'score' means, however it seems to be moderately correlated with likes and shares. Likes and loves share a positive correlation of 0.5, perhaps due to how they are quite similar, compared to other reactions.

I create a dashboard to show box plots before and after outlier treatment. I create a class using param library and through bd2.param, I get the selected metric. There's quite a lot of outliers. I had to use append_end function to clear my plt because there is always an extra line plot. To get updated chart, I use gcf.

Clustering TF-IDF

The main reason I am clustering in this section, is because I want analyze the comment section in the most commented posts, so I am dividing the posts into clusters to sample the comments.

Elbow method does not work all the time, such as how it is not clear cut in this case.

Peak performance is seen in n=6, increase in Silhouette score diminishes after 6. So, number of custers is set to 6 and each post is assigned a cluster.

There is a class imbalance issue.

I want to see the trends of cluster vs created date. So, I group cluster and created.

I built a dashboard here for my line plot to show cluster vs created date. I'm glad that there's at least a little data from 2020, meaning there's probably few mentions of the pandemic.

This is similar to the last cell, but more complicated. I seperate out cluster vs created date into its own dashboard, because this dashboard will use the original dataframe. Through get_data, I subset dataframe by selected cluster number. With the subset dataframe, I can get new TF-IDF, word cloud, average reactions and correlation plots to display in the dashboard.

Unfortunately, for all clusters, polarity and subjectivity don't just seem to be correlated with anything else. Like tends to be correlated with comments, shares and love. To a lesser extent, I would say sad and angry are like this too.

In terms of the word clouds, it seems to me that cluster 3 and 4 are mostly words related to vet and animals. The fact that in cluster 3, there is only one FaceBook page, which is an animal shelter, futher supports my point.

I wanted to do further analysis on the comments of the most commented posts, but unfortunately, quite a lot of the most commented posts were deleted and censored, presumably by FaceBook Team.

So yeah, I'm afraid I've hit a roadblock here. I will start another anti-vax analysis project using a different dataset.