Predicting sentiment from Arabic text

Introduction

In this article, we will explain how to implement a neural network for predicting sentiments from Arabic text. Our example uses tweets as input for training and evaluation. We will start by downloading data. Then, we will process our data. After that, we will implement the tokenizer used for transforming our texts to tokens, before implementing our model for sentiment analysis. We will then build our dataset. We will define our evaluation metric and perform training. To test our model, we will launch it on some examples. The code is based on hugging face Transformers library. This is a link to the corresponding colab notebook.

Downloading Data

Regarding data, we selected this one https://github.com/mahmoudnabil/ASTD which contains example of tweets. This is the code for downloading the data:

!wget -P /content/data https://raw.githubusercontent.com/mahmoudnabil/ASTD/master/data/4class-balanced-train.txt

!wget -P /content/data https://raw.githubusercontent.com/mahmoudnabil/ASTD/master/data/4class-balanced-test.txt

!wget -P /content/data https://raw.githubusercontent.com/mahmoudnabil/ASTD/master/data/Tweets.txt

The “4class-balanced-train.txt” and “4class-balanced-test.txt” contain ids of tweets to use for training and testing. The “Tweets.txt” contain all tweets. Each line contains a different tweet and its corresponding class: ‘NEUTRAL’, ‘NEG’, ‘OBJ’, ‘POS’.

“balanced” means that the tweets were selected in a way that we have the same number of tweets for each class. The tweet ids correspond to line indexes in “Tweets.txt” file.

Data Processing

After downloading data, we load it into memory:

import codecs
import re

data = codecs.open(tweets_file, 'r', 'utf-8').readlines()
data = [d.strip() for d in data]

with open(train_ids_file) as f:
  lines = f.readlines()
  train_ids = [int(l.strip()) for l in lines]

with open(test_ids_file) as f:
  lines = f.readlines()
  test_ids = [int(l.strip()) for l in lines]

Then we gets the tweets and labels by splitting each line using tabulation character:

tweets = [d.split(u"\t")[0] for d in data]
labels = [d.split(u"\t")[1] for d in data]

To get the train data and test data, we retrieve tweets and labels using indexes in “train_ids” and “test_ids”:

train_tweets = [tweets[ind] for ind in train_ids]
train_labels = [labels[ind] for ind in train_ids]
test_tweets = [tweets[ind] for ind in test_ids]
test_labels = [labels[ind] for ind in test_ids]

Tokenizer

In NLP, tokenizers are used for transforming texts into numbers. A tokenizer takes as input a text, then splits it to tokens, for each token it assigns its corresponding vocabulary id. In our case, as we are dealing with Arabic text, we should use an adequate tokenizer, we chose this one: “UBC-NLP/MARBERT” from hugging face models. This is the code of loading our tokenizer using transformers library:

from transformers import AutoTokenizer

model_name = "UBC-NLP/MARBERT"
tokenizer = AutoTokenizer.from_pretrained(model_name)

To check the tokenizer execution, we launch it on a tweet text:

max_len = max([len(t) for t in train_tweets])
encoded_example = tokenizer.encode_plus(
    tweets[0],
    max_length = max_len + 1,
    add_special_tokens = True,
    return_token_type_ids = False,
    pad_to_max_length = True,
    truncation = True,
    return_attention_mask = True,
    return_tensors = 'pt'
  )
encoded_example

We first calculate the maximum length of our tweets to use in our encoding function “encode_plus” which takes as input the text to encode and other parameters like the maximum length of the output tensor (See the documentation for more information). The tokenization returns a dictionary with two values: the input ids tensor containing ids of tokens parsed from the original text, and an attention mask which is used for selecting which tokens should be used for the subsequent model.

Dataset

Before training our model, we must implement our dataset class. First, we create a dictionary which maps each label to a number:

unique_labels = set(labels)
label_map = { label:index for index, label in enumerate(unique_labels)}
print(label_map)

{'NEUTRAL': 0, 'NEG': 1, 'OBJ': 2, 'POS': 3}

After that, we create our dataset class which inherits from “torch.utils.data.Dataset”. It wraps the code for retrieving tweets. This class is used by the training and evaluation code for getting tweets and their labels.

from torch.utils.data import Dataset
from transformers.data.processors.utils import InputFeatures

class TweetDataset(Dataset):
  def __init__(self, tweets, labels, tokenizer, max_len, label_map, device):
    super(TweetDataset).__init__()
    self.tweets = tweets
    self.labels = labels
    self.tokenizer = tokenizer
    self.max_len = max_len
    self.label_map = label_map
    self.device = device

  def __len__(self):
    return len(self.tweets)

  def __getitem__(self, index):
    tweet = str(self.tweets[index])
    encoded_tweet = self.tokenizer.encode_plus(
        tweet,
        max_length= self.max_len,
        add_special_tokens = True,
        return_token_type_ids = False,
        pad_to_max_length=True,
        truncation='longest_first',
        return_attention_mask=True,
        return_tensors='pt'
    )
    input_ids = encoded_tweet['input_ids'].to(self.device)
    attention_mask = encoded_tweet['attention_mask'].to(self.device)

    return InputFeatures(input_ids = input_ids.flatten(), attention_mask = attention_mask.flatten(), label = self.label_map[self.labels[index]])

The initialization consists in setting object attributes. The “__getitem__” method is called to get a new item. First it retrieves the text from the “tweets” list, and then it encodes it to tokens using the tokenizer. The outputs are provided to “InputFeatures” constructor. We flatten the outputs to get a tensor of one dimension, because for each retrieved element we need to provide an array containing the token ids and another representing the mask. Thanks to the label mapping dictionary, we get the label id corresponding to the label name.

After that we create the train set and the test set by instantiating the dataset class created previously.

import torch

device = torch.device("cuda")
train_dataset = TweetDataset(train_tweets, train_labels, tokenizer, max_len, label_map, device)
test_dataset = TweetDataset(test_tweets, test_labels, tokenizer, max_len, label_map, device)

Model

After tokenization, we passes the encoded text to the model for prediction. In our case we use this model “UBC-NLP/MARBERT” from hugging face models. This is an example code for performing a prediction on a tweet text:

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(model_name, return_dict=True, num_labels=len(label_map))
model.to(device)
encoded_example = tokenizer.encode_plus(
  tweets[0],
  max_length= max_len + 1,
  add_special_tokens = True,
  return_token_type_ids = False,
  pad_to_max_length=True,
  truncation='longest_first',
  return_attention_mask=True,
  return_tensors='pt'
)
input_ids = encoded_example['input_ids'].to(device)
attention_mask = encoded_example['attention_mask'].to(device)
output = model(input_ids, attention_mask)
print(tweets[0])
_, prediction = torch.max(output[0], dim=1)
print(prediction[0].item())

First, we load our model into memory. Then we tokenize our tweet example. After that, we passes the tokenization output to our model for prediction. The result is a tensor containing 4 numbers, each of which corresponds to a probability of belonging to a different class .To retrieve the predicted label id, we search for the maximum probability.

Evaluation Metric

The neural network model predicts a class for each input text. To evaluate the performance of our model, we use the accuracy metric which is the rate of correct predictions among the whole prediction set. This is the code for calculating this metric on example data.

from sklearn.metrics import accuracy_score
import numpy as np

example_predictions = np.array([0,1,2,0,0,2,3,0,1,1,2,1,3])
example_labels = np.array([0,1,1,3,0,2,3,0,1,1,2,0,3])

print(example_predictions)
print(example_labels)
accuracy_score(example_labels, example_predictions)

The “accuracy_score” function takes two inputs: the correct or ground truth labels, and the predicted labels.

Then we define our function that computes the metric. This function is called during the training; after each defined number of epochs, the accuracy is recomputed on the test set.

def compute_metrics(eval_data):
  predictions = np.argmax(eval_data.predictions, axis=1)
  assert len(predictions) == len(eval_data.label_ids)
  acc = accuracy_score(eval_data.label_ids, predictions)
  return {
      'accuracy': acc
  }

Training / Finetuning

The training process consists in updating the neural network model weights to improve its accuracy. Finetuning refers to a more specific way of training, it is training all or some layers of a pretrained model on a specific dataset. In fact in our case the used model is already pretrained on NLP general Arabic tasks, but we want to adapt it to our sentiment analysis task. This is the code based on hugging face transformers library for finetuning our model:

from transformers import Trainer, TrainingArguments
from transformers.trainer_utils import EvaluationStrategy

training_args = TrainingArguments("/content/train")
training_args.lr_scheduler_type = "cosine"
training_args.evaluate_during_training = True
training_args.adam_epsilon =1e-8
training_args.learning_rate =  1.215e-05
training_args.fp16 = True
training_args.per_device_train_batch_size = 16
training_args.per_device_eval_batch_size = 16
training_args.gradient_accumulation_steps = 2
training_args.num_train_epochs= 10
training_args.warmup_steps = 0
training_args.evaluation_strategy = EvaluationStrategy.EPOCH
training_args.logging_steps = 200
training_args.save_steps = 100000
training_args.seed = 42
training_args.disable_tqdm = False
training_args.dataloader_pin_memory = False

trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset = train_dataset,
    eval_dataset= test_dataset,
    compute_metrics=compute_metrics
)

trainer.train()

Among the main arguments defined in that code:

per_device_train_batch_size: number of tweets processed for each training iteration. This number is called batch size.
per_device_eval_batch_size: number of tweets processed for each evaluation iteration.
num_train_epochs: number of training epochs. During each epoch, the training algorithm processes all training data.

Inference

Finally to verify our model, we launch it on an example data. This is the code for performing this inference (similar to the one used for testing our model after its creation):

encoded_review = tokenizer.encode_plus(
    tweets[0],
    max_length= max_len + 1,
    add_special_tokens=True,
    return_token_type_ids=False,
    pad_to_max_length=True, #True,
    truncation=True,
    return_attention_mask=True,
    return_tensors='pt'
)

input_ids = encoded_review['input_ids'].to(device)
attention_mask = encoded_review['attention_mask'].to(device)

output = trainer.model(input_ids, attention_mask)
_, prediction = torch.max(output[0], dim=1)
print(prediction[0].item())

Conclusion

In this article, we presented the steps to build a neural network for Arabic sentiment analysis. The code is based on transformers library of hugging face. We used ASTD data, and finetuned the hugging face model UBC-NLP/MARBERT on the training set. The accuracy results are still low; the best one is almost 64%. It is because the small number of data used for training. One could use bigger dataset to obtain better results, like this one: https://www.kaggle.com/competitions/arabic-sentiment-analysis-2021-kaust/data