#279: Sentiment Analysis in Python
Sentiment analysis is a powerful tool that allows us to understand the emotions and opinions behind written text. Be it reviews, social media posts, or customer feedback – if we know the emotions and how strong they are, we can flag important texts and prioritise them to address problems before they blow up.
In this post we use pre-trained models from Hugging Face and the Natural Language Toolkit (NLTK). That way we can run everything locally and jump directly to the analysis part to get fast feedback.
Why comparing different models?
Since I am new to the topic, I do not yet have the knowledge to pick a good model for sentiment analysis. To help me find something suitable, I look at 5 often named models and compare the results. The reviews I want to analyse look a bit like that list:
It is important that we use test cases that match the data we want to work with. If we test for such reviews and then try to work on completely different text snippets, our results may be totally different.
Install all the packages
For the comparison experiment we need to install many tools. We best create a virtual environment with uv, activate it and then run this command to install all the packages we need:
In our script we need these import statements to work with our freshly installed libraries:
VADER
VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool in the NLTK library. It is specifically designed to analyse sentiments expressed in social media and uses a dictionary of words and grammatical rules to determine sentiment polarity and intensity.
We first need to download the vader lexicon, then we can create a SentimentIntensityAnalyzer that checks our text. From the result we best use the compound value and split it into a negative, neutral, and positive range:
If we run this code against our first review ("This product is fantastic! I love it."), we get this result back from the model:
We can then convert it to return "positive", "neutral" or "negative" for each of our statements.
DistilBERT
DistilBERT is a raw model for topic classification that is made to be fine-tuned on a downstream task. For this post we go without the fine-tuning to see how it performs out of the box.
When we run it against our first sentence, we get a label and a score back:
All we need to do is to convert the result from the model into lower case before we return it.
Twitter roBERTa
Twitter roBERTa is a roBERTa-base model trained on ~58M tweets and finetuned for sentiment analysis with the TweetEval benchmark.
We need to build our pipeline where we set a tokenizer and use the AutoModelForSequenceClassification class to build the model based on the pre-trained data:
When we run it, we get LABEL_0, LABEL_1 and LABEL_2 back, that we need to convert to "negative", "neutral" or "positive":
Twitter roBERTa (latest)
Twitter roBERTa (latest) is an updated version of the roBERTa model and trained on ~124M tweets. As we can see in the code, it is not done with changing the URL. We must adjust our code and run our review statements through a tokenizer before we can hand it to the sentiment analysis:
When we run it, we get a SequenceClassifierOutput back from our model:
SequenceClassifierOutput(loss=None, logits=tensor([[ 2.3379, -0.5528, -2.3123]]), hidden_states=None, attentions=None)
Bert base
Bert base is a bert-base-multilingual-uncased model finetuned for sentiment analysis on product reviews in six languages: English, Dutch, German, French, Spanish, and Italian. It predicts the sentiment of the review as a number of stars (between 1 and 5).
In our code we create a pipeline, hand our reviews to it in one go and then convert the result from something like "5 stars" into "positive":
The model gives us the star rating and a score back:
Glue everything together for the comparison
With our methods for the different models in place, we need a bit more glue code to call the methods, build a Pandas data frame and create a nice table with the result:
If we run this, we end up with a comparison table like this one:
| text | VADER | DistilBERT | roBERTa | roBERTa Latest | Bert |
|---|---|---|---|---|---|
| This product is fantastic! I love it. | positive | positive | positive | positive | positive |
| Terrible quality, broke after one use. | negative | negative | negative | negative | negative |
| Absolutely wonderful. Highly recommend! | positive | positive | positive | positive | positive |
| Worst purchase I have ever made. | negative | negative | negative | negative | negative |
| Pretty decent, does the job. | positive | positive | positive | positive | positive |
| Not worth the money. | negative | negative | negative | negative | negative |
| Exceeded my expectations! | neutral | positive | positive | positive | positive |
| Very disappointing experience. | negative | negative | negative | negative | negative |
| It's okay, not the best but not the worst. | positive | positive | neutral | neutral | neutral |
| I am extremely satisfied with this. | positive | positive | positive | positive | positive |
| It works. | neutral | positive | positive | positive | positive |
| It is fine. | positive | positive | positive | positive | positive |
| OK for the price. | positive | positive | neutral | neutral | neutral |
| :-( | negative | negative | negative | negative | neutral |
| :-) | positive | positive | positive | positive | positive |
| 0.0096 sec | 0.7568 sec | 2.3965 sec | 1.7444 sec | 0.9992 sec |
I manually highlighted the differences to show that the models sometimes yield different results. However, for most reviews, all 5 models detect the same emotion. When it comes to the time this sentiment analysis took, VADER is massively faster than any of the models hosted on Hugging Face. The hard-coded rules of VADER are fast but restricted to the ones it ships with.
Conclusion
The 5 different models we explored in this post help us to find the sentiment of a review. As we can see in the table, they are not always all agreeing. That difference is important to notice, especially when you use real data for the comparison. If you mess this step up, you will attribute the wrong sentiment to your reviews. Therefore, take the time and compare the models before you build a whole pipeline for something that does not offer a good fit for the kind of sentences you have to work with.
Next week we play a bit with the star rating of Bert and compare it to Goodreads reviews.