#279: Sentiment Analysis in Python

Sentiment analysis is a powerful tool that allows us to understand the emotions and opinions behind written text. Be it reviews, social media posts, or customer feedback – if we know the emotions and how strong they are, we can flag important texts and prioritise them to address problems before they blow up.

In this post we use pre-trained models from Hugging Face and the Natural Language Toolkit (NLTK). That way we can run everything locally and jump directly to the analysis part to get fast feedback.

Why comparing different models?

Since I am new to the topic, I do not yet have the knowledge to pick a good model for sentiment analysis. To help me find something suitable, I look at 5 often named models and compare the results. The reviews I want to analyse look a bit like that list:

reviews = [
    "This product is fantastic! I love it.",
    "Terrible quality, broke after one use.",
    "Absolutely wonderful. Highly recommend!",
    "Worst purchase I have ever made.",
    "Pretty decent, does the job.",
    "Not worth the money.",
    "Exceeded my expectations!",
    "Very disappointing experience.",
    "It's okay, not the best but not the worst.",
    "I am extremely satisfied with this.",
    "It works.",
    "It is fine.",
    "OK for the price.",
    ":-(",
    ":-)",
]

It is important that we use test cases that match the data we want to work with. If we test for such reviews and then try to work on completely different text snippets, our results may be totally different.

Install all the packages

For the comparison experiment we need to install many tools. We best create a virtual environment with uv, activate it and then run this command to install all the packages we need:

uv pip install pandas transformers nltk vaderSentiment torch tabulate

In our script we need these import statements to work with our freshly installed libraries:

import time
import pandas as pd
from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline
import nltk
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
import torch
import torch.nn.functional as F
from transformers import logging

# Optional: suppress warnings
logging.set_verbosity_error()

VADER

VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool in the NLTK library. It is specifically designed to analyse sentiments expressed in social media and uses a dictionary of words and grammatical rules to determine sentiment polarity and intensity.

We first need to download the vader lexicon, then we can create a SentimentIntensityAnalyzer that checks our text. From the result we best use the compound value and split it into a negative, neutral, and positive range:

def vader(reviews):
    """
    Model 1: VADER (Valence Aware Dictionary and sEntiment Reasoner) 
    """
    # Download the necessary NLTK data
    nltk.download('vader_lexicon')

    start = time.time()
    # Initialize the Sentiment Intensity Analyzer
    sia = SentimentIntensityAnalyzer()

    pred = []
    for index, sentence in enumerate(reviews):
        sentiment_scores = sia.polarity_scores(sentence)

        if index == 0:
            print(f"VADER: \t\t {sentiment_scores}")

        if sentiment_scores['compound'] >= 0.05:
            pred.append("positive")
        elif sentiment_scores['compound'] <= -0.05:
            pred.append("negative")
        else:
            pred.append("neutral")

    end = time.time()
    duration = round(end - start, 4)

    return pred, duration

If we run this code against our first review ("This product is fantastic! I love it."), we get this result back from the model:

{'neg': 0.0, 'neu': 0.382, 'pos': 0.618, 'compound': 0.8439}

We can then convert it to return "positive", "neutral" or "negative" for each of our statements.

DistilBERT

DistilBERT is a raw model for topic classification that is made to be fine-tuned on a downstream task. For this post we go without the fine-tuning to see how it performs out of the box.

def distilbert(reviews):
    """
    Model 2: distilbert-base-uncased-finetuned-sst-2-english
    """    
    start = time.time()

    model = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")
    results = model(reviews)
    print(f"distilbert: \t {results[0]}")

    pred = [r['label'].lower() for r in results]

    end = time.time()
    duration = round(end - start, 4)

    return pred, duration

When we run it against our first sentence, we get a label and a score back:

{'label': 'POSITIVE', 'score': 0.9998844861984253}

All we need to do is to convert the result from the model into lower case before we return it.

Twitter roBERTa

Twitter roBERTa is a roBERTa-base model trained on ~58M tweets and finetuned for sentiment analysis with the TweetEval benchmark.

We need to build our pipeline where we set a tokenizer and use the AutoModelForSequenceClassification class to build the model based on the pre-trained data:

def roberta(reviews):
    """
    Model 3: cardiffnlp/twitter-roberta-base-sentiment
    """
    start = time.time()

    model_name = "cardiffnlp/twitter-roberta-base-sentiment"

    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSequenceClassification.from_pretrained(model_name)

    sentiment_pipeline = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)
    results = sentiment_pipeline(reviews)
    print(f"roberta: \t {results[0]}")

    label_map = {'LABEL_0': 'negative', 'LABEL_2': 'positive'}
    pred = [label_map.get(r['label'], 'neutral') for r in results]

    end = time.time()
    duration = round(end - start, 4)

    return pred, duration

When we run it, we get LABEL_0, LABEL_1 and LABEL_2 back, that we need to convert to "negative", "neutral" or "positive":

{'label': 'LABEL_2', 'score': 0.9924222826957703}

Twitter roBERTa (latest)

Twitter roBERTa (latest) is an updated version of the roBERTa model and trained on ~124M tweets. As we can see in the code, it is not done with changing the URL. We must adjust our code and run our review statements through a tokenizer before we can hand it to the sentiment analysis:

def roberta_latest(reviews):
    """
    Model 4: cardiffnlp/twitter-roberta-base-sentiment-latest
    """
    start = time.time()

    model_name = "cardiffnlp/twitter-roberta-base-sentiment-latest"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSequenceClassification.from_pretrained(model_name)

    # Sentiment labels
    labels = ['negative', 'neutral', 'positive']

    pred = []
    for index, sentence in enumerate(reviews):
        encoded_input = tokenizer(sentence, return_tensors='pt')
        with torch.no_grad():
            output = model(**encoded_input)

            if index == 1:
                print(f"roberta latest:  {output}")

            scores = F.softmax(output.logits, dim=1)
            predicted_class = torch.argmax(scores).item()
            pred.append(labels[predicted_class])

    end = time.time()
    duration = round(end - start, 4)

    return pred, duration

When we run it, we get a SequenceClassifierOutput back from our model:

SequenceClassifierOutput(loss=None, logits=tensor([[ 2.3379, -0.5528, -2.3123]]), hidden_states=None, attentions=None)

Bert base

Bert base is a bert-base-multilingual-uncased model finetuned for sentiment analysis on product reviews in six languages: English, Dutch, German, French, Spanish, and Italian. It predicts the sentiment of the review as a number of stars (between 1 and 5).

In our code we create a pipeline, hand our reviews to it in one go and then convert the result from something like "5 stars" into "positive":

def bert(reviews):
    """
    Model 5: nlptown/bert-base-multilingual-uncased-sentiment    
    """

    start = time.time()

    model = pipeline("sentiment-analysis", model="nlptown/bert-base-multilingual-uncased-sentiment")
    results = model(reviews)
    print(f"bert: \t\t {results[0]}")

    # This model outputs star ratings: 1-2 stars = negative, 4-5 = positive, 3 = neutral
    pred = []
    for r in results:
        label = r['label']
        stars = int(label[0])
        if stars <= 2:
            pred.append('negative')
        elif stars >= 4:
            pred.append('positive')
        else:
            pred.append('neutral')

    end = time.time()
    duration = round(end - start, 4)

    return pred, duration

The model gives us the star rating and a score back:

{'label': '5 stars', 'score': 0.9508458971977234}

Glue everything together for the comparison

With our methods for the different models in place, we need a bit more glue code to call the methods, build a Pandas data frame and create a nice table with the result:

vader_result, vader_time = vader(reviews)
dist_result, dist_time = distilbert(reviews)
roberta_result, roberta_time = roberta(reviews)
roberta_latest_result, roberta_latest_time = roberta_latest(reviews)
bert_result, bert_time = bert(reviews)


# -------- Build Comparison Table --------
df = pd.DataFrame({
    "text": reviews,
    "VADER": vader_result,
    "DistilBERT": dist_result,
    "roBERTa": roberta_result,
    "roBERTa Latest": roberta_latest_result,
    "Bert": bert_result
})

# Add timing row
times = [f"{t:.4f} sec" for t in [vader_time, dist_time, roberta_time, roberta_latest_time, bert_time]]
df.loc["time used"] = [""] + times

# Print table
print(df.to_markdown(index=False))

If we run this, we end up with a comparison table like this one:

text	VADER	DistilBERT	roBERTa	roBERTa Latest	Bert
This product is fantastic! I love it.	positive	positive	positive	positive	positive
Terrible quality, broke after one use.	negative	negative	negative	negative	negative
Absolutely wonderful. Highly recommend!	positive	positive	positive	positive	positive
Worst purchase I have ever made.	negative	negative	negative	negative	negative
Pretty decent, does the job.	positive	positive	positive	positive	positive
Not worth the money.	negative	negative	negative	negative	negative
Exceeded my expectations!	neutral	positive	positive	positive	positive
Very disappointing experience.	negative	negative	negative	negative	negative
It's okay, not the best but not the worst.	positive	positive	neutral	neutral	neutral
I am extremely satisfied with this.	positive	positive	positive	positive	positive
It works.	neutral	positive	positive	positive	positive
It is fine.	positive	positive	positive	positive	positive
OK for the price.	positive	positive	neutral	neutral	neutral
:-(	negative	negative	negative	negative	neutral
:-)	positive	positive	positive	positive	positive
	0.0096 sec	0.7568 sec	2.3965 sec	1.7444 sec	0.9992 sec

I manually highlighted the differences to show that the models sometimes yield different results. However, for most reviews, all 5 models detect the same emotion. When it comes to the time this sentiment analysis took, VADER is massively faster than any of the models hosted on Hugging Face. The hard-coded rules of VADER are fast but restricted to the ones it ships with.

Conclusion

The 5 different models we explored in this post help us to find the sentiment of a review. As we can see in the table, they are not always all agreeing. That difference is important to notice, especially when you use real data for the comparison. If you mess this step up, you will attribute the wrong sentiment to your reviews. Therefore, take the time and compare the models before you build a whole pipeline for something that does not offer a good fit for the kind of sentences you have to work with.

Next week we play a bit with the star rating of Bert and compare it to Goodreads reviews.