Sentiment Analysis - Docker / AWS implementation

Project and deployment documentation

The Idea
The Data
Data Processing
1. Stemming vs Lemmatization
Fine-Tuning BERT language model
1. Tokenization in NLP
Deployment of trained model
1. Locally Hosted Docker Image:
2. AWS S3 Bucket Deployment (images below):

The Idea

Why sentiment analysis was chosen as an idea?
- Sentiment analysis was chosen because it allows us to gather valuable insights about customer opinion on a product, which is a crucial aspect of business decision-making.
- Implementing sentiment analysis demonstrates my expertise in Natural Language Processing (NLP) and showcases my profeciency in AI.
- The project involves hosting the sentiment analysis system on AWS (Amazon Web Services) OR utilizing Docker for containerization, showcasing my skills in full pipeline development from deployment to production.

The Data

What the data was and how it was collected?
- The data used for this project was collected from Twitter. Specifically, the data consisted of sentiment-labeled tweets obtained from previous challenges related to the ongoing “SemEval” series of evaluation and computational semantic analysis.
- The data collected process involved gathering and merging sentiment-tagged tweets. The tweets had already been tagged with either positive or negative sentiment labels.
- The project did not consider neutral sentiment texts, and it was a deliberate choice made to showcase the feasibility of me developing the project. However, this requirement can be modified if necessary for future cases.

Positive Wordcloud

drawing Wordcloud showing the most commonly used positive words present in the text

Negative Wordcloud

drawing Wordcloud showing the most commonly used negative words present in the text

Commonly used Positive Words frequency

Barplot showing the frequency of the commonly used positive words present in the text

Commonly used Negative Words frequency

Barplot showing the frequency of the commonly used negative words present in the text

Data Processing

Data pre-processing steps included cleaning the input text by removing HTTP/HTTPS links, converting it to lowercase, eliminating punctuation, numbers, stopwords and reducing excessive whitespace.
The data per class was balanced for training the model.
These steps ensure that the data is standardized and free from irrelevant elements.

Stemming vs Lemmatization

Stemming vs Lemmatization Img Source

Stemming simplifies words by removing suffixes, while lemmatization transforms words to their base form.
- Stemming is faster and applies simple rules, but the resulting stems may not be the actual words.
- Lemmatization considers word meawning and context, producing accurate base forms.

Fine-Tuning BERT language model

Why did I fine-tune a model?
- The decision to fine-tune a model was made because training a model from scratch with the available data was not feasible due to limited data and the time it would have taken. However, various model configurations were explored during the process to reach this decision.
What model was chosen for fine-tuning?
- The chosen model for fine-tuning was the base BERT (Bidirectional Encoder Representations from Transformers) model, as it was freely available (vs GPT model) and considerably suitable for the task at hand.
How was the model fine-tuned?
- The model was fine-tuned by tokenizing the text with a maximum length of 18 tokens (value found experimentally). It was then compiled with Adam optimizer, using sparse categorical cross-entropy loss. The model underwent a total of 5 epochs during the fine-tuning training process.

Tokenization in NLP

Tokenization Img Source

Tokenization is a fundamental process in NLP that breaks text into smaller units called tokens. These tokens can be words, characters, or subwords, depending on the tokenization strategy used.
The BERT Tokenizer used in this project is specifically designed for the BERT model. It employs a technique called WordPiece tokenization, which splits words into subwords. This approach allows the tokenizer to handle out-of-vocabulary (OOV) words and capture the more fine-grained linguistic information present in the text.

Deployment of trained model

The fine-tuned model was finally deployed in two different ways:

Locally Hosted Docker Image:

The model was hosted within a Docker container locally using TFServing.
A RESTful API was set up to interact with the model for making predictions.
The SentimentAnalyzer class performs sentiment analysis using an API endpoint.
- It takes the API url, a tokenizer, and a stoplist as parameters in its constructor.
- The clean_text() method cleans the input text by removing URLs, punctuation, numbers, and stop words.
- The make_prediction_api() method sends a POST request to the API and returns the predictions.
- The predict_sentiment() method cleans the text, tokenizes it, makes the API call, and returns the predicted sentiment, confidence level, and API response as a list.

class SentimentAnalyzer:
    def __init__(self, url, tokenizer, stoplist):
        self.url = url
        self.tokenizer = tokenizer
        self.stoplist = stoplist

    def clean_text(self, text):
        # Remove http / https links
        text = re.sub(r'http\S+|https\S+', '', text)
        # Convert to lowercase
        text = text.lower()
        # Remove punctuation
        text = text.translate(str.maketrans('', '', string.punctuation))
        # Remove numbers
        text = re.sub(r'\d+', '', text)
        # Remove stopwords
        text = ' '.join(word for word in text.split() if word not in stoplist)
        # Remove extra whitespaces
        text = re.sub(r'\s+', ' ', text).strip()
        return text

    def make_prediction_api(self, instances):
        data = json.dumps({"signature_name": "serving_default", "instances": instances})
        headers = {"content-type": "application/json"}
        json_response = requests.post(self.url, data=data, headers=headers)
        predictions = json.loads(json_response.text)
        return predictions

    def predict_sentiment(self, text):
        # Clean input text
        text = self.clean_text(text)

        # Tokenize the cleaned text
        encoded_input_api = self.tokenizer(
            text,
            max_length=18,
            padding="max_length",
            truncation=True,
            return_tensors='tf'
        )

        # Get text tensor values
        input_ids = encoded_input_api["input_ids"]
        attention_mask = encoded_input_api["attention_mask"]
        token_type_ids = encoded_input_api["token_type_ids"]

        # Prepare instance for API call
        instances_api = [{
            "input_ids": input_ids.numpy().tolist()[0],
            "attention_mask": attention_mask.numpy().tolist()[0],
            "token_type_ids": token_type_ids.numpy().tolist()[0]
        }]

        # Make API call
        result_api = self.make_prediction_api(instances_api)

        # Calculate Confidence level of the prediction and get the predicted label
        logits = result_api['predictions'][0]
        probabilities = tf.nn.softmax(logits)
        predicted_label = tf.argmax(probabilities).numpy()
        confidence_level = np.max(probabilities)

        # Map the predicted label to its corresponding sentiment category
        sentiment_categories = ["Negative", "Positive"]
        predicted_sentiment = sentiment_categories[predicted_label]

        return [result_api, predicted_sentiment, confidence_level]

Output for calling this code:

AWS S3 Bucket Deployment (images below):

To deploy the trained model on AWS, the following steps were followed:

Creating a Flask Application:
The Flask application was developed to create an interface for the trained model.
Setting up an EC2 Instance:
An EC2 instance was created on AWS to host the application.
Configuring Deep Learning Amazon Machine Image:
A deep learning Amazon Machine Image (Ubuntu 20.04) was set up on the EC2 instance.
Starting the Application on EC2 Instance:
Using WinSCP and putty, communication with the operating system on the cloud was established, and the Flask application was started and run on the EC2 instance.

In case of larger models or data, I would have used a S3 bucket to store and access the required resources.

Moreover, I attempted to leverage AWS Comprehend for the task. However, the process of fine-tuning the model with a custom dataset incurs costs, which I decided against at the moment. Nonetheless, this remains a feasible option that can be undertaken if necessary.

Positive Inference:

Negative Inference:

Sentiment Analysis - Docker / AWS implementation

The Idea

The Data

Positive Wordcloud

Negative Wordcloud

Commonly used Positive Words frequency

Commonly used Negative Words frequency

Data Processing

Stemming vs Lemmatization

Fine-Tuning BERT language model

Tokenization in NLP

Deployment of trained model

Locally Hosted Docker Image:

AWS S3 Bucket Deployment (images below):

Amitoj Battu

Error

The Idea

The Data

Positive Wordcloud

Negative Wordcloud

Commonly used Positive Words frequency

Commonly used Negative Words frequency

Data Processing

Stemming vs Lemmatization

Fine-Tuning BERT language model

Tokenization in NLP

Deployment of trained model

Locally Hosted Docker Image:

AWS S3 Bucket Deployment (images below):

Templates (for web app):

Error