Zero-shot Topic Modeling with Deep Learning Using Python Transformer-based zero-shot text classification model from Hugging Face for predicting NLP topic classes

Zero-shot Topic Modeling with Deep Learning Using Python

Zero-shot learning (ZSL) refers to building a model and using it to make predictions on the tasks that the model was not trained to do. For example, if we would like to classify millions of news articles into different topics, building a traditional multi-class classification model would be very costly because manually labeling the news topics takes a lot of time.

Zero-shot text classification is able to make class predictions without explicitly building a supervised classification model using a labeled dataset. This tutorial will use an Amazon review dataset to illustrate how to build a zero-shot topic model using Hugging Face’s zero-shot text classification model. We will talk about:

  • What’s the algorithm behind the zero-shot text classification model?
  • How to install and import Hugging Face libraries for the zero-shot text classification model?
  • How to implement zero-shot topic modeling for single-topic and multiple topics predictions separately?
  • What to do if there is no list of topic labels for the prediction?

Resources for this post:

  • Video tutorial for this post on YouTube
  • Click here for the Colab notebook
  • More video tutorials on NLP
  • More blog posts on NLP
Zero-shot Topic Modeling with Deep Learning – GrabNGoInfo.com

Let’s get started!


Step 0: Zero-shot Topic Modeling Algorithm

In step 0, we will talk about the model algorithm behind the zero-shot topic model.

Zero-shot topic modeling is a use case of zero-shot text classification on topic predictions. Zero-shot text classification is a Natural Language Inference (NLI) model where two sequences are compared to see if they contradict each other, entail each other, or are neutral (neither contradict nor entail).

When using zero-shot topic modeling, we will have the text as the premise and the pre-defined candidate labels as hypotheses. If the model predicts a text document such as a review entails the topic in the candidate labels, then the document is likely to belong to the topic. Otherwise, the document is not likely to belong to the topic.

Step 1: Install And Import Python Libraries

In step 1, we will install and import python libraries.

Firstly, let’s import transformers.

!pip install transformers

After installing the python packages, we will import the python libraries.

  • pandas is imported for data processing.
  • Hugging Face pipeline is imported from transformers for the zero-shot classification model.
  1. task describes the task for the pipeline. The task name we use is zero-shot-classification.
  2. model is the model name for the prediction used in the pipeline. You can find the full list of available models for zero-shot classification on the Hugging Face website. At the time this tutorial was created in January 2023, the bart-large-mnli by Facebook(Meta) is the model with the highest number of downloads and likes, so we will use it for the pipeline.
  3. device defines the device type. device=0 means that we are using GPU for the pipeline.
# Data processing
import pandas as pd

# Modeling
from transformers import pipeline
classifier = pipeline(task="zero-shot-classification",
model="facebook/bart-large-mnli",
device=0)

Step 2: Download And Read Data

The second step is to download and read the dataset.

The UCI Machine Learning Repository has the review data from three websites: imdb.com, amazon.com, and yelp.com. We will use the review data from amazon.com for this tutorial. Please follow these steps to download the data.

  1. Go to: https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences
  2. Click “Data Folder”
  3. Download “sentiment labeled sentences.zip”
  4. Unzip “sentiment labeled sentences.zip”
  5. Copy the file “amazon_cells_labelled.txt” to your project folder

Those who are using Google Colab for this analysis need to mount Google Drive to read the dataset. You can ignore the code below if you are not using Google Colab.

  • drive.mount is used to mount to the Google drive so the colab notebook can access the data on the Google drive.
  • os.chdir is used to change the default directory on Google drive. I set the default directory to the folder where the review dataset is saved.
  • !pwd is used to print the current working directory.

Please check out Google Colab Tutorial for Beginners for details about using Google Colab for data science projects.

# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Change directory
import os
os.chdir("drive/My Drive/contents/nlp")

# Print out the current directory
!pwd

Now let’s read the data into a pandas dataframe and see what the dataset looks like.

The dataset has two columns. One column contains the reviews and the other column contains the sentiment label for the review. Since this tutorial is for topic modeling, we will not use the sentiment label column, so we removed it from the dataset.

# Read in data
amz_review = pd.read_csv('sentiment labelled sentences/amazon_cells_labelled.txt', sep='\t', names=['review', 'label'])

# Drop te label
amz_review = amz_review.drop('label', axis=1);

# Take a look at the data
amz_review.head()

.info helps us to get information about the dataset.

From the output, we can see that this data set has 1000 records and no missing data. The review column is the object type.

# Get the dataset information
amz_review.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 review 1000 non-null object
dtypes: object(1)
memory usage: 7.9+ KB

Step 3: Zero-shot Topic Prediction of a Single Topic

In step 3, we will use the zero-shot topic model to predict one topic for each text document.

  • Firstly, the reviews are put into a list for the pipeline.
  • Then, the candidate labels are defined. We set four candidate labels, sound quality, battery, price, and comfortable.
  • After that, the hypothesis template is defined. The default template is used by the Hugging Face pipeline is This example is {}, we use a hypothesis template that is more specific to the topic modeling The topic of this review is {}. and it helps to improve the results.
  • Finally, the text, the candidate labels, and the hypothesis template are passed into the zero-shot classification pipeline called classifier.

The output is in a list format and we converted it into a Pandas dataframe.

# Put reviews in a list
sequences = amz_review['review'].to_list()

# Define the candidate labels
candidate_labels = ["sound quality", "battery", "price", "comfortable"]

# Set the hyppothesis template
hypothesis_template = "The topic of this review is {}."

# Prediction results
single_topic_prediction = classifier(sequences, candidate_labels, hypothesis_template=hypothesis_template)

# Save the output as a dataframe
single_topic_prediction = pd.DataFrame(single_topic_prediction)

# Take a look at the data
single_topic_prediction.head()
Zero-shot Topic Modeling with Deep Learning Using Python Transformer-based zero-shot text classification model from Hugging Face for predicting NLP topic classes
Zero-shot topic modeling for single topic — GrabNGoInfo.com

It is not uncommon to get an out-of-memory error when running the zero-shot classification model. To resolve the error, we can set smaller batch_size for the model. batch_size = 4 means that the model will process 4 text documents each time. Below is a sample code for your reference.

# Tune the batch_size to fit in the memory
batch_size = 4

# Put reviews in a list
sequences = amz_review['review'].to_list()

# Define the candidate labels
candidate_labels = ["sound quality", "battery", "price", "comfortable"]

# Set the hyppothesis template
hypothesis_template = "The topic of this review is {}."

# Create an empty list to save the prediciton results
single_topic_prediction = []

# Loop through the batches
for i in range(0, len(sequences), batch_size):
# Append the results
single_topic_prediction += classifier(sequences[i:i+batch_size], candidate_labels, hypothesis_template=hypothesis_template)

By default, the sum of all scores is 1, so the scores represent the relative relevance to each topic.

The first label in the labels list is the predicted topic for each review, and the first score in the scores list is the corresponding score prediction. For example, the review Great for the jawbone. has the predicted topic of comfortable and the predicted score of 0.76, indicating that comfortable is a much more relevant topic than the other three topics. Note that the score values are not the absolute predicted probability of the topic, and it represents only the relative probability among the given candidate labels.

To make the prediction results easy to read and process, two new columns are created, one for the predicted topic and the other for the score of the predicted topic.

# The column for the predicted topic
single_topic_prediction['predicted_topic'] = single_topic_prediction['labels'].apply(lambda x: x[0])

# The column for the score of predi ted topic
single_topic_prediction['predicted_topic_score'] = single_topic_prediction['scores'].apply(lambda x: x[0])

# Take a look at the data
single_topic_prediction.head()
Zero-shot Topic Modeling with Deep Learning Using Python Transformer-based zero-shot text classification model from Hugging Face for predicting NLP topic classes
Zero-shot topic modeling for single topic — GrabNGoInfo.com

Step 4: Zero-shot Topic Prediction of Multiple Topics

In step 4, we will use the zero-shot topic model to predict multiple topics. This is useful when one text document belongs to multiple topics, and we would like to assign one or more topics to a document.

The syntax for multiple topics prediction is similar to the code for the single topic prediction, the only difference is that we set multi_label=True to allow multiple-label predictions.

The scores in the multiple-topic prediction are the absolute values for the predicted probabilities, so they do not sum up to one anymore. Each score is a value between 0 and 1 indicating the probability of the document belonging to the corresponding topic.

# Put reviews in a list
sequences = amz_review['review'].to_list()

# Define the candidate labels
candidate_labels = ["sound quality", "battery", "price", "comfortable"]

# Set the hyppothesis template
hypothesis_template = "The topic of this review is {}."

# Prediction results
multi_topic_prediction = classifier(sequences, candidate_labels, hypothesis_template=hypothesis_template, multi_label=True)

# Save the output in a dataframe
multi_topic_prediction = pd.DataFrame(multi_topic_prediction)

# Take a look at the data
multi_topic_prediction.head()
Zero-shot Topic Modeling with Deep Learning Using Python Transformer-based zero-shot text classification model from Hugging Face for predicting NLP topic classes
Zero-shot topic modeling for multiple topics — GrabNGoInfo.com

To assign multiple labels to a review, a threshold probability for the topic predictions is needed. We set the threshold = 0.6 meaning that the labels with a predicted probability of greater than or equal to 0.6 is assigned to the reviews.

Before applying the threshold, we expanded the label list and the scores list using pd.Series.explode.

After applying the threshold, all the scores in the dataframe are greater than 0.6. The reviews with multiple topics have multiple rows, one row for each topic.

# Threshold probability
threshold = 0.6

# Expand the lists
multi_topic_prediction = multi_topic_prediction.set_index('sequence').apply(pd.Series.explode).reset_index()

# Filter by threshold
multi_topic_prediction = multi_topic_prediction[multi_topic_prediction['scores'] >= threshold]

# Take a look at the data
multi_topic_prediction.head()
Zero-shot Topic Modeling with Deep Learning Using Python Transformer-based zero-shot text classification model from Hugging Face for predicting NLP topic classes
Zero-shot topic modeling for multiple topics — GrabNGoInfo.com

Some reviews are not assigned to any topic because none of the candidate labels have a predicted score of more than 0.6. For those records, we can examine if there are common topics missing from the candidate labels.

  • If there is a common theme that is not listed in candidate_labels, we can add a new topic and rerun the model.
  • If there is not a common theme across the documents, we can create an other topics category.

Step 5: Topic Model with Unkown Candidate Labels

You might have noticed that a pre-defined list of candidate labels is required for the Hugging Face zero-shot text classification model. These candidate labels are usually from business domain knowledge or past experiences. What if there is no prior knowledge about candidate labels?

In step 5, we will talk about how to build a deep-learning topic model with unknown candidate labels.

If there is no business domain knowledge about what are the typical topics for the corpus, we can train an unsupervised topic model and let the model find the topics for us automatically.

BERTopic is a topic modeling python library that combines transformer embeddings and clustering model algorithms to identify topics in NLP (Natual Language Processing). Please check out my previous tutorial Topic Modeling with Deep Learning Using Python BERTopic to learn how to build topic models when the topics are not pre-defined.

The topic predictions from BERTopic can be used in two ways:

  • The first way is to use the topic predictions directly as the final topic assignment of the text documents.
  • The second way is to extract the candidate labels based on the BERTopic predictions, and then apply the candidate labels in the zero-shot topic model to create the final topic prediction.

For more information about data science and machine learning, please check out my YouTube channel and Medium Page or follow me on LinkedIn.


Recommended Tutorials

References

Leave a Comment

Your email address will not be published. Required fields are marked *