Customized Sentiment Analysis: Transfer Learning Using Tensorflow with Hugging Face Fine-tune a pretrained transformer model for customized sentiment analysis using TensorFlow Keras with Hugging Face

Customized Sentiment Analysis: Transfer Learning Using Tensorflow with Hugging Face


Transfer learning is also called pretrained model fine-tuning. It refers to training a model with a small dataset while leveraging the stored information from a model trained with a large dataset for another task. In this tutorial, we will talk about how to use a small review dataset to build a sentiment prediction model while leveraging a pretrained transformer model. We will cover:

  • What are the benefits of building a customized sentiment analysis transfer learning model?
  • How to build a transfer learning model for sentiment analysis?
  • How to make predictions for a transfer learning model and evaluate the prediction accuracy?
  • How to save a fine-tuned transfer learning model and reload it for sentiment prediction?
  • How to make the transfer learning training process faster for a large dataset?

Resources for this post:

  • Video tutorial for this post on YouTube
  • Click here for the Colab notebook
  • More video tutorials on NLP
  • More blog posts on NLP
Sentiment Analysis Transfer Learning Using Tensorflow – GrabNGoInfo.com

Let’s get started!


Step 1: Benefits of Transfer Learning for Sentiment Analysis

Firstly, let’s talk about the benefits of transfer learning (aka fine-tuning) for sentiment analysis.

  • Compared with lexicon-based sentiment analysis such as VADER or TextBlob, the transfer learning (aka fine-tuning) model for sentiment analysis is usually more accurate. To learn more about lexicon-based sentiment analysis, please check out my previous tutorial TextBlob vs. VADER for Sentiment Analysis Using Python.
  • Compared with building a customized sentiment analysis model from scratch, the transfer learning (aka fine-tuning) model for sentiment analysis needs fewer data and fewer computation resources. So it saves the cost of collecting data, labeling data, and computation resources.
  • Because the transfer learning (aka fine-tuning) model leverages the knowledge learned from the pretrained model, it usually has better prediction accuracy than a customized sentiment analysis model built from scratch.
  • Compared with the cloud services for sentiment analysis such as Amazon Comprehend, Azure Cognitive Service for Language, Google Natural Language API, and IBM Watson Natual Language Understanding API, the transfer learning (aka fine-tuning) sentiment analysis models have much lower cost because the pretrained models are mostly open source and free to use. It also has better performance because it is custom-trained on the domain data.
  • Compared with the zero-shot text classification models or pretrained sentiment analysis models, the transfer learning (aka fine-tuning) model for sentiment analysis is more customized for the specific domain, so it tends to produce more accurate predictions, especially for the highly specialized domains. If you are curious about the Hugging Face zero-shot model vs. Flair pretrained model performance comparison for sentiment analysis, please check out my previous tutorial.

Step 2: Sentiment Analysis Algorithms

Transfer learning is applied to a pretrained model by replacing its last layer with a randomly initialized new head for the new task. For a sentiment analysis transfer learning (aka fine-tuning) model on a pretrained BERT model, we will remove the head that classifies mask words, and replace it with the sentiment analysis labels. Transfer learning usually has the following steps:

  1. Choose a pretrained model that was trained on a large dataset.
  2. Delete the output layer of the pretrained model and the weights and bias feeding into the output layer.
  3. Create a set of randomly initialized weights and biases for the new output layer with the sentiment analysis task. The sentiment analysis task is a classifier with two outputs, positive and negative.
  4. Retrain the weights and biases.

Step 3: Install And Import Python Libraries

In step 3, we will install and import python libraries.

Firstly, let’s import transformers and datasets.

# Install libraries
!pip install transformers datasets

After installing the python packages, we will import the python libraries.

  • pandas and numpy are imported for data processing.
  • train_test_split is imported from sklearn to split dataset.
  • tensorflow and transformers are imported for modeling.
  • Dataset is imported for the Hugging Face dataset format.
  • The accuracy_score is imported for model performance evaluation.
# Data processing
import pandas as pd
import numpy as np

# Train test split
from sklearn.model_selection import train_test_split

# Modeling
import tensorflow as tf
from tensorflow.keras.optimizers import Adam
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

# Hugging Face Dataset
from datasets import Dataset

# Import accuracy_score to check performance
from sklearn.metrics import accuracy_score

Step 4: Download And Read Data

The fourth step is to download and read the dataset.

The UCI Machine Learning Repository has the review data from three websites: imdb.com, amazon.com, and yelp.com. We will use the review data from amazon.com for this tutorial. Please follow these steps to download the data.

  1. Go to: https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences
  2. Click “Data Folder”
  3. Download “sentiment labeled sentences.zip”
  4. Unzip “sentiment labeled sentences.zip”
  5. Copy the file “amazon_cells_labelled.txt” to your project folder

Those who are using Google Colab for this analysis need to mount Google Drive to read the dataset. You can ignore the code below if you are not using Google Colab.

  • drive.mount is used to mount to the Google drive so the colab notebook can access the data on the Google drive.
  • os.chdir is used to change the default directory on Google drive. I set the default directory to the folder where the review dataset is saved.
  • !pwd is used to print the current working directory.

Please check out Google Colab Tutorial for Beginners for details about using Google Colab for data science projects.

# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Change directory
import os
os.chdir("drive/My Drive/contents/nlp")

# Print out the current directory
!pwd

Now let’s read the data into a pandas dataframe and see what the dataset looks like.

The dataset has two columns. One column contains the reviews and the other column contains the sentiment label for the review.

# Read in data
amz_review = pd.read_csv('sentiment labelled sentences/amazon_cells_labelled.txt', sep='\t', names=['review', 'label'])

# Take a look at the data
amz_review.head()
Customized Sentiment Analysis: Transfer Learning Using Tensorflow with Hugging Face Fine-tune a pretrained transformer model for customized sentiment analysis using TensorFlow Keras with Hugging Face
Amazon reviews for sentiment analysis — GrabNGoInfo.com

.info helps us to get information about the dataset.

# Get the dataset information
amz_review.info()

From the output, we can see that this data set has 1000 records and no missing data. The review column is the object type and the label column is the int64 type.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 review 1000 non-null object
1 label 1000 non-null int64
dtypes: int64(1), object(1)
memory usage: 15.8+ KB

The label value of 0 represents negative reviews and the label value of 1 represents positive reviews. The dataset has 500 positive reviews and 500 negative reviews. It is well-balanced, so we can use accuracy as the metric to evaluate the model performance.

# Check the label distribution
amz_review['label'].value_counts()

Output:

0    500
1 500
Name: label, dtype: int64

Step 5: Train Test Split

In step 5, we will split the dataset and have 80% as the training dataset and 20% as the testing dataset.

# Train test split
X_train, X_test, y_train, y_test = train_test_split(amz_review['review'],
amz_review['label'],
test_size = 0.20,
random_state = 42)

# Check the number of records in training and testing dataset.
print(f'The training dataset has {len(X_train)} records.')
print(f'The testing dataset has {len(X_test)} records.')

After the train test split, there are 800 reviews in the training dataset and 200 reviews in the testing dataset.

The training dataset has 800 records.
The testing dataset has 200 records.

Step 6: Tokenize Text

In step 6, we will tokenize the review text using a tokenizer.

A tokenizer converts text into numbers to use as the input of the NLP (Natural Language Processing) models. Each number represents a token, which can be a word, part of a word, punctuation, or special tokens.

  • AutoTokenizer.from_pretrained("bert-base-cased") downloads vocabulary from the pretrained bert-base-cased model.
  • return_tensors="np" indicates that the return format is NumPy array. Besides np, return_tensors can take the value of tf or pt, where tf returns TensorFlow tf.constant object and pt returns PyTorch torch.tensor object. If not set, it returns a list of python integers.
  • padding means adding zeros to shorter reviews in the dataset. The padding argument controls how padding is implemented.
  • padding=True is the same as padding='longest'. It checks the longest sequence in the batch and pads zeros to that length. There is no padding if only one text document is provided.
  • padding='max_length' pads to max_length if it is specified, otherwise, it pads to the maximum acceptable input length for the model.
  • padding=False is the same as padding='do_not_pad'. It is the default, indicating that no padding is applied, so it can output a batch with sequences of different lengths.

The labels for the reviews are converted to one-dimensional numpy arrays.

# Tokenizer from a pretrained model
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

# Tokenize the reviews
tokenized_data_train = tokenizer(X_train.to_list(), return_tensors="np", padding=True)
tokenized_data_test = tokenizer(X_test.to_list(), return_tensors="np", padding=True)

# Labels are one-dimensional numpy or tensorflow array of integers
labels_train = np.array(y_train)
labels_test = np.array(y_test)

# Tokenized ids
print(tokenized_data_train["input_ids"][0])

After printing out the tokenized IDs for the first review, we can see that the tokenized words are converted into integers, and the sentence is padded with zeros at the end of the review.

There are two special tokens in the token IDs, 101 at the beginning of the sentence and 102 at the end of the sentence. The BERT tokenizer uses 101 to encode the special token [CLS] and 102 to encode the special token [SEP], but the other models may use other special tokens.

[  101 17554   112   189  2080  2965   119   102     0     0     0     0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0]

Step 7: Compile Transfer Learning Model for Sentiment Analysis

In step 7, we will build a customized transfer learning model for sentiment analysis.

  • TFAutoModelForSequenceClassification loads the BERT model without the sequence classification head.
  • The method from_pretrained() loads the weights from the pretrained model into the new model, so the weights in the new model are not randomly initialized. Note that the new weights for the new sequence classification head are going to be randomly initialized.
  • bert-base-cased is the name of the pretrained model. We can change it to a different model based on the nature of the project.
  • num_labels indicates the number of classes. Our dataset has two classes, positive and negative, so num_labels=2.
# Load model
model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=2)

After loading the pretrained model, we will compile the model.

  • SparseCategoricalCrossentropy is used as the loss function, but the Hugging Face documentation mentioned that Hugging Face models automatically choose a loss that is appropriate for their task and model architecture if the loss is not explicitly specified.
  • from_logits=True informs the loss function that the output values are logits before applying softmax, so the values do not represent probabilities.
  • We are using Adam as the optimizer and the number 5e-6 is the learning rate. A smaller learning rate corresponds to a more stable weights value update and a slower training process.
  • accuracy is used as the metrics because we have a balanced dataset.
# Loss
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

# Compile model
model.compile(optimizer=Adam(5e-6), loss=loss, metrics=['accuracy'])

Step 8: Train Transfer Learning Model for Sentiment Analysis

In step 8, we will talk about how to fit a transfer learning model using TensorFlow Keras.

When fitting the model, we convert the tokenized dataset into a dictionary for Keras. batch_size=4 means that four reviews are processed for each weights and bias update. epochs=2 means that the model fitting process will go through the training dataset 2 times.

# Fit the model
model.fit(dict(tokenized_data_train),
labels_train,
validation_data=(dict(tokenized_data_test), labels_test),
batch_size=4,
epochs=2)

We can see that the accuracy is above 90 percent in just 2 epochs.

Epoch 1/2
200/200 [==============================] - 745s 4s/step - loss: 0.5634 - accuracy: 0.7225 - val_loss: 0.2766 - val_accuracy: 0.9150
Epoch 2/2
200/200 [==============================] - 796s 4s/step - loss: 0.1955 - accuracy: 0.9287 - val_loss: 0.1933 - val_accuracy: 0.9250
<keras.callbacks.History at 0x7fb8bce7bc10>

All the weights will be updated by default for the transfer learning model.

  • If we would like to keep the pretrained model weights as is and only update the weights and bias of the output layer, we can use model.layers[0].trainable = False to freeze the weights of the BERT model.
  • If we would like to keep the weights of some layers and update others, we can use model.bert.encoder.layer[i].trainable = False to freeze the weights of the corresponding layers.
  • In general, if the dataset for the transfer learning model is large, it is suggested to update all weights, and if the dataset for the transfer learning model is small, it is suggested to freeze the pretrained model weights. But we can always compare the model performance by adding the tunable pretrained model layers one by one.

Step 9: Sentiment Analysis Transfer Learning Model Prediction and Evaluation

In step 9, we will talk about model prediction and evaluation for sentiment analysis transfer learning.

Passing the tokenized text to the .predict method, we get the predictions for the customized transfer learning sentiment model. logits is the last layer of the neural network before softmax is applied.

# Predictions
y_test_predict = model.predict(dict(tokenized_data_test))['logits']

# First 5 predictions
y_test_predict[:5]

We can see that the prediction has two columns. The first column is the predicted logit for label 0 and the second column is the predicted logit for label 1. logit values do not sum up to 1.

7/7 [==============================] - 31s 4s/step
array([[-1.8452915 , 0.8984053 ],
[-2.20764 , 0.46542013],
[-2.1693563 , 1.2320726 ],
[ 2.4532032 , -2.0910175 ],
[-2.0541303 , 0.98942846]], dtype=float32)

To get the predicted probabilities, we need to apply softmax on the predicted logit values.

# Predicted probabilities
y_test_probabilities = tf.nn.softmax(y_test_predict)

# First 5 predicted probabilities
y_test_probabilities[:5]

After applying softmax, we can see that the predicted probability for each review sums up to 1.

<tf.Tensor: shape=(5, 2), dtype=float32, numpy=
array([[0.06044362, 0.93955636],
[0.06458187, 0.9354181 ],
[0.03225084, 0.9677492 ],
[0.9894833 , 0.01051667],
[0.04549637, 0.9545036 ]], dtype=float32)>

To get the predicted labels, argmax is used to return the index of the maximum probability for each review, which corresponds to the labels of zeros and ones.

# Predicted label
y_test_class_preds = np.argmax(y_test_probabilities, axis=1)

# First 5 predicted labels
y_test_class_preds[:5]

Output:

array([1, 1, 1, 0, 1])

accuracy_score is used to evaluate the model performance. We can see that the customized sentiment analysis model with transfer learning gives us 92.5% accuracy, meaning that the predictions are correct 92.5% of the time.

# Accuracy
accuracy_score(y_test_class_preds, y_test)

Output:

0.925

Step 10: Save and Load Model

In step 10, we will talk about how to save the model and reload it for prediction.

tokenizer.save_pretrained saves the tokenizer information to the drive and model.save_pretrained saves the model to the drive.

# Save tokenizer
tokenizer.save_pretrained('./sentiment_transfer_learning_tensorflow/')

# Save model
model.save_pretrained('./sentiment_transfer_learning_tensorflow/')

We can load the saved tokenizer later using AutoTokenizer.from_pretrained() and load the saved model using TFAutoModelForSequenceClassification.from_pretrained().

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("./sentiment_transfer_learning_tensorflow/")

# Load model
loaded_model = TFAutoModelForSequenceClassification.from_pretrained('./sentiment_transfer_learning_tensorflow/')

To verify that the customized transfer learning model is loaded correctly, the loaded model is used to make predictions on the testing dataset. We can see that the prediction results are exactly the same as the fine-tuned model, confirming that the model is loaded correctly.

# Predict logit using the loaded model
y_test_predict = loaded_model.predict(dict(tokenized_data_test))['logits']

# Take a look at the first 5 predictions
y_test_predict[:5]

Output:

7/7 [==============================] - 31s 4s/step
array([[-1.8452915 , 0.8984053 ],
[-2.20764 , 0.46542013],
[-2.1693563 , 1.2320726 ],
[ 2.4532032 , -2.0910175 ],
[-2.0541303 , 0.98942846]], dtype=float32)

Step 11: Sentiment Model Using Transfer Learning on Large Dataset

In step 11, we will talk about how to handle large datasets with Hugging Face transfer learning.

The training process can be very slow for large datasets because of the size of the tokenized array and the padding tokens. But we can load the data as tf.data.Dataset to make the process faster.

  • Firstly, the python dataframe needs to be converted to the Hugging Face arrow dataset using Dataset.from_pandas().
  • Then a tokenizer needs to be initiated.
  • After that, the tokenizer is applied to the Hugging Face arrow dataset.
  • The pretrained model is loaded using TFAutoModelForSequenceClassification.from_pretrained().
  • Finally, the dataset is loaded using prepare_tf_dataset().
# Convert pyhton dataframe to Hugging Face arrow dataset
hg_amz_review = Dataset.from_pandas(amz_review)

# Tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

# Funtion to tokenize data
def tokenize_dataset(data):
return tokenizer(data["review"])

# Tokenize the dataset
dataset = hg_amz_review.map(tokenize_dataset)

# Load model
model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased")

# TF dataset
tf_dataset = model.prepare_tf_dataset(dataset=dataset,
batch_size=16,
shuffle=True,
tokenizer=tokenizer)

prepare_tf_dataset() is a method to wrap Hugging Face Dataset as a tf.data.Dataset with collation and batching.

  • dataset takes in a Hugging Face Dataset that is to be wrapped as a tf.data.Dataset.
  • batch_size=16 means that in each batch 16 records will be processed. The default batch_size is 8.
  • shuffle=True indicates that the samples from the dataset will be returned in random order. The default value is True. The Hugging Face documentation mentioned that it is usually set to True for training datasets and False for validation or testing datasets.
  • tokenizer is a PreTrainedTokenizer for padding samples.

After the dataset is converted to a tf.data.Dataset, the model is compiled and fit on the dataset.

Because the Hugging Face datasets are stored on disk by default, they will not increase memory usage. The batches can be streamed from the dataset and the paddings can be added to each batch, which saves time and memory compared to padding the entire dataset.

# Loss
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

# Compile model
model.compile(optimizer=Adam(5e-6), loss=loss, metrics=['accuracy'])

# Fit the model
model.fit(tf_dataset,
epochs=2)

For more information about data science and machine learning, please check out my YouTube channel and Medium Page or follow me on LinkedIn.


Recommended Tutorials

References

Leave a Comment

Your email address will not be published. Required fields are marked *