Hyperparameter tuning is an important optimization step for building a good topic model. BERTopic is a topic modeling python library that combines transformer embeddings and clustering model algorithms to identify topics in NLP (Natual Language Processing). In this tutorial, we will talk about the following:

- What are the hyperparameters for the BERTopic model?
- How to tune the hyperparameters for the topic model?
- How to compare the results with different hyperparameter values?

Please check out my previous tutorial Topic Modeling with Deep Learning Using Python BERTopic for an introduction to BERTopic.

**Resources for this post:**

- Click here for the Colab notebook.
- More video tutorials on NLP
- More blog posts on NLP
- If you are not a Medium member and want to support me to keep providing free content (😄 Buy me a cup of coffee ☕), join Medium membership through this link. You will get full access to posts on Medium for $5 per month, and I will receive a portion of it. Thank you for your support 🙏
- Video tutorial for this post on YouTube

Let’s get started!

### Step 1: Install And Import Python Libraries

In step 1, we will install and import python libraries.

Firstly, let’s import `bertopic`

and `flair`

.

# Install bertopic

!pip install bertopic flair

After the installation, when we tried to import the `BERTopic`

method, a type error about an unexpected keyword argument `cachedir`

came up.

This `TypeError`

is caused by the incompatibility between `joblib`

and `HDBSCAN`

. At the time this tutorial was created, `joblib`

has a new release that is not supported by `HDBSCAN`

. HDBSCAN does have a fix for it but has not been rolled out. So if you are watching this tutorial on YouTube or reading this tutorial on Medium.com at a later time, you may not encounter this error message.

Before the incompatibility issue between `joblib`

and `HDBSCAN`

is fixed, we can solve this issue by installing an old version of `joblib`

. In this example, we used `joblib`

version 1.1.0. After installing `joblib`

, we need to restart the runtime.

# Install older version of joblib

!pip install --upgrade joblib==1.1.0

After installing the python packages, we will import the python libraries.

`pandas`

and`numpy`

are imported for data processing.`UMAP`

and`PCA`

are for dimension reduction.`HDBSCAN`

and`KMeans`

are for clustering models.`CountVectorizer`

is for term vectorization.`sentence_transformers`

and`flair`

are for pretrained document embeddings.`BERTopic`

is for the topic modeling.

# Data processing

import pandas as pd

import numpy as np

# Dimension reduction

from umap import UMAP

from sklearn.decomposition import PCA

# Clustering

from hdbscan import HDBSCAN

from sklearn.cluster import KMeans

# Count vectorization

from sklearn.feature_extraction.text import CountVectorizer

# Sentence transformer

from sentence_transformers import SentenceTransformer

# Flair

from transformers.pipelines import pipeline

from flair.embeddings import TransformerDocumentEmbeddings, WordEmbeddings, DocumentPoolEmbeddings, StackedEmbeddings

# Topic model

from bertopic import BERTopic

### Step 2: Download And Read Data

The second step is to download and read the dataset.

The UCI Machine Learning Repository has the review data from three websites: imdb.com, amazon.com, and yelp.com. We will use the review data from amazon.com for this tutorial. Please follow these steps to download the data.

- Go to: https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences
- Click “Data Folder”
- Download “sentiment labeled sentences.zip”
- Unzip “sentiment labeled sentences.zip”
- Copy the file “amazon_cells_labelled.txt” to your project folder

Those who are using Google Colab for this analysis need to mount Google Drive to read the dataset. You can ignore the code below if you are not using Google Colab.

`drive.mount`

is used to mount to the Google drive so the colab notebook can access the data on the Google drive.`os.chdir`

is used to change the default directory on Google drive. I set the default directory to the folder where the review dataset is saved.`!pwd`

is used to print the current working directory.

Please check out Google Colab Tutorial for Beginners for details about using Google Colab for data science projects.

# Mount Google Drive

from google.colab import drive

drive.mount('/content/drive')

# Change directory

import os

os.chdir("drive/My Drive/contents/nlp")

# Print out the current directory

!pwd

Now let’s read the data into a `pandas`

dataframe and see what the dataset looks like.

The dataset has two columns. One column contains the reviews and the other column contains the sentiment label for the review. Since this tutorial is for topic modeling, we will not use the sentiment label column, so we removed it from the dataset.

# Read in data

amz_review = pd.read_csv('sentiment labelled sentences/amazon_cells_labelled.txt', sep='\t', names=['review', 'label'])

# Drop te label

amz_review = amz_review.drop('label', axis=1);

# Take a look at the data

amz_review.head()

`.info`

helps us to get information about the dataset.

From the output, we can see that this data set has 1000 records and no missing data. The `review`

column is the `object`

type.

# Get the dataset information

amz_review.info()

Output:

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 1000 entries, 0 to 999

Data columns (total 1 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 review 1000 non-null object

dtypes: object(1)

memory usage: 7.9+ KB

### Step 3: Hyperparameters for Dimensionality Reduction

In step 3, we will talk about the hyperparameters for dimensionality reduction in BERTopic.

Dimensionality reduction is necessary because the clustering model works better for low-dimension data than high-dimension data. The document embeddings usually have hundreds of dimensions, so we need to reduce the dimensionality before passing the embeddings to a clustering model.

The default algorithm for dimension reduction is UMAP (Uniform Manifold Approximation & Projection). Compared with other dimension reduction techniques such as PCA (Principle Component Analysis), UMAP maintains the data’s local and global structure when reducing the dimensionality, which is important for representing the semantics of the text data. The UMAP model accepts customized hyperparameters.

`n_neighbors`

is the local neighborhood size for UMAP. This is the parameter that controls the local versus global structure in data.

- A low value forces UMAP to focus more on the local structure and may lose insights into the big picture.
- A high value pushes UMAP to look at the broader neighborhoods and may lose details on local structure. This usually results in a larger cluster size.
- The default
`n_neighbors`

value for UMAP is 15.

`n_components`

indicates the output dimension for UMAP. This is the dimension of data that will be passed into the clustering model.`min_dist`

controls how tightly UMAP is allowed to pack points together. It is the minimum distance between points in the low-dimensional space.

- Small values of
`min_dist`

result in clumpier embeddings, which is good for clustering. Since our goal of dimension reduction is to build clustering models, we set`min_dist`

to 0. - Large values of
`min_dist`

prevent UMAP from packing points together and preserves the broad structure of data.

`metric='cosine'`

indicates that we will use cosine to measure the distance.`random_state`

sets a random seed to make the UMAP results reproducible.

- BERTopic model by default produces different results each time because of the stochasticity inherited from UMAP.
- To get reproducible topics, we need to pass a value to the
`random_state`

parameter in the`UMAP`

method.

After initiating the UMAP model with the hyperparameters, we pass it to the BERTopic model, and run the model using the review data.

# Initiate UMAP

umap_model = UMAP(n_neighbors=15,

n_components=5,

min_dist=0.0,

metric='cosine',

random_state=100)

# Initiate BERTopic

topic_model = BERTopic(umap_model=umap_model)

# Run BERTopic model

topics = topic_model.fit_transform(amz_review['review'])

# Get the list of topics

topic_model.get_topic_info()

Using the attribute `get_topic_info()`

on the topic model gives us a list of topics. We can see that the output gives us 25 rows in total.

- Topic -1 should be ignored. It indicates that the reviews are not assigned to any specific topic. The count for topic -1 is 277, meaning that there are 277 outlier reviews that do not belong to any topic.
- Topic 0 to topic 23 are the 24 topics created for the reviews. It was ordered by the number of reviews in each topic, so topic 0 has the highest number of reviews.
- The Name column lists the top terms for each topic. For example, the top 4 terms for Topic 0 are
`sound`

,`hear`

,`quality`

, and`the`

, indicating that it is a topic related to sound quality.

BERTopic provides the option of using other dimensionality reduction techniques by changing the `umap_model`

value in the `BERTopic`

method.

For example, we can use the widely used dimension reduction algorithm PCA to replace UMAP.

# PCA for dimensionality reduction

pca_model = PCA(n_components=15)

# Initiate BERTopic

topic_model = BERTopic(umap_model=pca_model)

# Run BERTopic model

topics = topic_model.fit_transform(amz_review['review'])

# Get the list of topics

topic_model.get_topic_info()

The output shows that 5 topics are created, which is much less than the 24 topics using UMAP.

### Step 4: Hyperparameters for Clustering Model

In step 4, we will talk about the hyperparameters for the clustering model in BERTopic.

After the text documents have been transformed into embeddings, and the embeddings’ dimensionality has been reduced, the next step is to run a clustering model on the embedded documents.

The default algorithm for clustering is HDBSCAN. HDBSCAN is a density-based clustering model. It identifies the number of clusters automatically and does not require specifying the number of clusters beforehand like most of the clustering models.

HDBSCAN has a few important hyperparameters.

`min_cluster_size`

is the minimum number of documents in each cluster.

- A larger minimum cluster size results in bigger clusters and fewer total clusters.
- A smaller minimum cluster size results in smaller clusters and a larger number of total clusters.
- A rule of thumb is to increase this threshold for a large dataset and keep it at the default value of 10 for a small dataset.

`min_samples`

controls the number of outliers. It defaults to the same value as`min_cluster_size`

. Reducing the value helps to reduce the noise in the topics.`metric`

indicates the distance metric used for the clustering model such as`euclidean`

.`prediction_data`

is for new documents topic predictions. We need to set it to`False`

if there is no need for new document prediction.

After specifying the hyperparameters for the HDBSCAN model, we pass the model into the BERTopic method. Notice that when initiating the BERTopic model, the `umap_model`

from the previous step is passed in as well. This is because we would like to utilize the same random seed defined in the UMAP model to make the results comparable. We will include `umap_model`

for all the topic models going forward.

# Clustering model

hdbscan_model = HDBSCAN(min_cluster_size=10, min_samples = 10, metric='euclidean', prediction_data=True)

# Initiate BERTopic

topic_model = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model)

# Run BERTopic model

topics = topic_model.fit_transform(amz_review['review'])

# Get the list of topics

topic_model.get_topic_info()

Besides the HDBSCAN clustering model, BERTopic supports other clustering algorithms.

There are two major differences between HDBSCAN and other clustering algorithms such as K-Means and GMM (Gaussian Mixture Model).

- HDBSCAN creates a separate cluster for outliers, but most other algorithms do not, so the clusters may contain more noise than HDBSCAN clusters.
- HDBSCAN automatically decides the number of clusters, while most other clustering algorithms need to have the number of clusters as input.

To learn more about the clustering model, please check out my previous tutorials 5 Ways for Deciding Number of Clusters and 4 Clustering Model Algorithms in Python.

We provide the sample code to implement a K-Means clustering model with BERTopic, and other clustering model algorithms can follow the same process.

- Firstly, the K-Means model is initiated with the number of clusters.
- Then the initiated K-Means model is passed into the
`hdbscan_model`

parameter in the`BERTopic`

function. - After that, the BERTopic model is fit and we get a list of 15 topics.

# Clustering model

kmeans_model = KMeans(n_clusters=15)

# Initiate BERTopic

topic_model = BERTopic(umap_model=umap_model, hdbscan_model=kmeans_model)

# Run BERTopic model

topics = topic_model.fit_transform(amz_review['review'])

# Get the list of topics

topic_model.get_topic_info()

### Step 5: Hyperparameter Tuning for Language Embeddings

In step 5, we will talk about how to tune the language embeddings.

Embeddings are the vector representation of the documents. BERTopic uses the English version of the Sentence Transformer by default to get document embeddings.

If there are multiple languages in the document, we can use `BERTopic(language="multilingual")`

to support the topic modeling of over 50 languages.

# Initiate BERTopic

topic_model = BERTopic(umap_model=umap_model, language="multilingual")

On the backend, the `language`

parameter in the `BERTopic`

method uses the sentence-transformer model.

- When
`language="english"`

, the sentence-transformer model`all-MiniLM-L6-v2`

is used. - When
`language="multilingual"`

, the sentence-transformer model`paraphrase-multilingual-MiniLM-L12-v2`

is used.

Sentence-transformer has different models with different sizes, speeds, and performances. We can go to the sentence transformer website for the latest list of pretrained models.

We can select any model from sentence transformers and pass it through BERTopic with the `embedding_model`

parameter.

For example, to use the sentence transformer model `paraphrase-albert-small-v2`

, we can pass it on to the `SentenceTransformer`

, and set it as the `embedding_model`

.

# Initiate a sentence transformer model

sentence_model = SentenceTransformer("paraphrase-albert-small-v2")

# Initiate BERTopic

topic_model = BERTopic(umap_model=umap_model, embedding_model=sentence_model)

# Run BERTopic model

topics = topic_model.fit_transform(amz_review['review'])

# Get the list of topics

topic_model.get_topic_info()

The output gives us 21 topics, and the topics look similar to the topics from the default parameters.

Besides the sentence-transformer models, BERTopic supports the pre-trained models from other python packages such as hugging face and flair.

Hugging Face model hub has thousands of pre-trained models. In this example, we used an English model called `distilroberta-base`

, loaded it in a Hugging Face pipeline, and pass the pipeline to the parameter `embedding_model`

.

# Initiate a pretrained model

hf_model = pipeline("feature-extraction", model="distilroberta-base")

# Initiate BERTopic

topic_model = BERTopic(umap_model=umap_model, embedding_model=hf_model)

# Run BERTopic model

topics = topic_model.fit_transform(amz_review['review'])

# Get the list of topics

topic_model.get_topic_info()

We can see that two topics are created, and the terms are not very meaningful, indicating that `distilroberta-base`

is not a good choice for our review corpus.

Flair is an NLP (Natual Language Processing) library that allows us to choose almost any embedding models, or combine a few embedding models together.

To use a single embedding model with Flair, we can pass the model name to `TransformerDocumentEmbeddings`

, and use it as the input for the `embedding_model`

option in `BERTopic`

.

# Initiate a pretrained embedding model

roberta_model = TransformerDocumentEmbeddings('roberta-base')

# Initiate BERTopic

topic_model = BERTopic(umap_model=umap_model, embedding_model=roberta_model)

# Run BERTopic model

topics = topic_model.fit_transform(amz_review['review'])

# Get the list of topics

topic_model.get_topic_info()

We can see that two topics are created, and all the terms are stopwords without much meaning, indicating that this is not a good embedding model choice for our review corpus.

To use multiple embedding models with Flair, we first need to initiate different pretrained embedding models, then use the `StackedEmbeddings`

function to stack the models, and finally pass the stacked embeddings to the `BERTopic`

`embedding_model`

parameter.

# Initiate a pretrained embedding model

roberta_model = TransformerDocumentEmbeddings('roberta-base')

# Initiate another pretrained embedding model

glove_embedding = WordEmbeddings('crawl')

document_glove_embeddings = DocumentPoolEmbeddings([glove_embedding])

# Stack the two pretrained embedding models

stacked_embeddings = StackedEmbeddings(embeddings=[roberta_model, document_glove_embeddings])

# Initiate BERTopic

topic_model = BERTopic(umap_model=umap_model, embedding_model=stacked_embeddings)

# Run BERTopic model

topics = topic_model.fit_transform(amz_review['review'])

# Get the list of topics

topic_model.get_topic_info()

We can see that 14 topics are created. The results look better than just using one model, but still not as good as the sentence transformer embedding models.

### Step 6: Hyperparameter Tuning for Number of Topics

In step 6, we will talk about how to change the number of topics for the topic model.

BERTopic uses the number of clusters created by the HDBSCAN model as the number of topics by default, but we can reduce the number of topics by changing the value of the `nr_topics`

parameter.

`nr_topics=None`

indicates that there is no topic reduction.`nr_topics=auto`

indicates an automatic topic reduction of the HDBSCAN results by merging topics close to each other.`nr_topics=15`

indicates that the target number of topics is 15.`nr_topics`

value should always be smaller than the number of topics created by`nr_topics=None`

.

On the backend, the topic reduction process is executed by merging similar topics based on the feature vector from c-TF-IDF. It starts with low-frequency topics and iteratively reduces the number of topics to the specified number.

# Initiate BERTopic

topic_model = BERTopic(umap_model=umap_model, nr_topics=15)

# Run BERTopic model

topics = topic_model.fit_transform(amz_review['review'])

# Get the list of topics

topic_model.get_topic_info()

After setting `nr_topics=15`

, we can see that the BERTopic model produced 15 topics.

When the text corpus is large, training a BERTopic model can take a long time. Rerunning the model each time we change the number of topics can waste a lot of time and resources. The good news is that the BERTopic package has a `reduce_topics`

method that uses the existing model information to do a topic reduction.

# Further reduce topics

topic_model.reduce_topics(amz_review['review'], nr_topics=10)

# Get the list of topics

topic_model.get_topic_info()

After passing in the review corpus and the number of topics, we can see that the number of topics is reduced to the specified number of 10.

If we would like to manually pick which topics to merge together based on domain knowledge, we can list the topic numbers and pass them into the `merge_topics`

function.

In this example, we merged topic 0 and topic 3 together because they both talk about headphone quality, and merged topic 2 and topic 6 together because they both talk about product satisfaction. The number of topics is reduced by two, and we have 8 topics now.

topics_to_merge = [[0, 3],

[2, 6]]

topic_model.merge_topics(amz_review['review'], topics_to_merge)

# Get the list of topics

topic_model.get_topic_info()

Another way of adjusting the number of topics is to control the minimum number of documents in a topic. We can set up this value by the parameter `min_topic_size`

.

- A low value for
`min_topic_size`

allows fewer documents to form a topic, so the topic model produces more topics. - A high value for
`min_topic_size`

requires a lot of documents to form a topic, so the topic model produces fewer topics. - The default value for
`min_topic_size`

is 10. A general guideline for setting`min_topic_size`

is to set up a low value for a smaller dataset, and a high value for a larger dataset.

Setting `min_topic_size`

is the same as setting `min_cluster_size`

in HDBSCAN.

# Initiate BERTopic

topic_model = BERTopic(umap_model=umap_model, min_topic_size=25)

# Run BERTopic model

topics = topic_model.fit_transform(amz_review['review'])

# Get the list of topics

topic_model.get_topic_info()

We can see that after setting the minimum topic size of 25, we get fewer topics, and each topic has more than 25 documents.

### Step 7: Hyperparameter for Top Words

In step 7, we will tune the hyperparameter for the top words representing the topics. The parameters that we will talk about are `n_gram_range`

and `top_n_words`

.

`n_gram_range`

is used to specify the range of n-grams included in the topic model.`top_n_words`

controls how many words are used to describe the topic.

Let’s take a look at `n_gram_range`

first.

N-grams are a continuous sequence of words. Its value determines the words used in CountVectorizer, and the top words representing the topics.

- Unigram refers to one word. Unigram is the default for BERTopic.
- Bigram refers to two consecutive words. For example, “ice cream” is considered as one word for bigram, but will be separated into two words, “ice” and “cream” for unigram.
- Trigram refers to three consecutive words.
- Cardinal numbers are used for more than three consecutive words, four-gram, five-gram, for example.

`n_gram_range=(1, 3)`

means that unigrams, bigrams, and trigrams are included in the model.

# Initiate BERTopic

topic_model = BERTopic(umap_model=umap_model, n_gram_range=(1, 3))

# Run BERTopic model

topics = topic_model.fit_transform(amz_review['review'])

# Get the list of topics

topic_model.get_topic_info()

From the output, we can see that the keywords representing the topics include both single words and multi-word phrases.

Next, let’s talk about `top_n_words`

. `top_n_words`

has the default value of 10, meaning that 10 top words will be used to represent each topic. If we change the value to 5, only the top five most representative words are included.

# Initiate BERTopic

topic_model = BERTopic(umap_model=umap_model, top_n_words=5)

# Run BERTopic model

topics = topic_model.fit_transform(amz_review['review'])

# Got top topic words

topic_model.get_topic(1)

Output:

[('phone', 0.1329223535845747),

('this', 0.06647473203777743),

('have', 0.03391477256848962),

('had', 0.033855540482838516),

('great', 0.02960535516059868)]

### Step 8: Hyperparameters for Words Universe

In step 8, we will talk about how to control the number of words for the topic model. Limiting the number of words helps to reduce the noise in the topics.

There are two ways to control how many words are used in `CountVectorizer`

and `c-TF-IDF`

.

`min_df`

sets a threshold for the required word frequency. For example,`min_df=10`

indicates that any words that appeared less than 10 times in the corpus will not be included in the c-TF-IDF calculation. A general guideline is to set a high`min_df`

value for a large corpus and a low value for a small corpus.`max_features`

indicates the maximum number of words to include for the c-TF-IDF calculation.`max_features=1_000`

means that the top 1000 words with the highest frequency in the corpus will be included.

Both `min_df`

and `max_features`

are the hyperparameters for the `CountVectorizer`

.

To use `min_df`

, we set the value when initiating `CountVectorizer`

, then pass it to the `vectorizer_model`

argument in the `BERTopic`

method.

# Count vectorizer

vectorizer_model = CountVectorizer(min_df=10)

# Initiate BERTopic

topic_model = BERTopic(umap_model=umap_model, vectorizer_model=vectorizer_model)

# Run BERTopic model

topics = topic_model.fit_transform(amz_review['review'])

# Get the list of topics

topic_model.get_topic_info()

To use `max_features`

, we set the value when initiating CountVectorizer, then pass it to the vectorizer_model argument in the BERTopic method.

# Count vectorizer

vectorizer_model = CountVectorizer(max_features=1_000)

# Initiate BERTopic

topic_model = BERTopic(umap_model=umap_model, vectorizer_model=vectorizer_model)

# Run BERTopic model

topics = topic_model.fit_transform(amz_review['review'])

# Get the list of topics

topic_model.get_topic_info()

### Step 9: Hyperparameter for Diversifying Topic Representation

In step 9, we will tune the hyperparameter to achieve a more diversified topic representation.

The top n words that represent the topic may include variations of the same word or words that are synonyms.

The hyperparameter `diversity`

helps to remove the words with the same or similar meanings. It has a range of 0 to 1, where 0 means least diversity and 1 means most diversity.

# Initiate BERTopic

topic_model = BERTopic(umap_model=umap_model, diversity=0.8)

# Run BERTopic model

topics = topic_model.fit_transform(amz_review['review'])

# Get the list of topics

topic_model.get_topic_info()

After setting `diversity=0.8`

, we can see that the top words describing the topics are more diversified. For example, topic 13 previously has the top words `disappointed_very_disappointment_disappointing`

, and has the top words `disappointment_aggravating_youd_displeased`

after diversification, which is much more diversified.

### Step 10: Hyperparameter for Stopwords

In step 10, we will talk about how to remove the stopwords from the list of the top words.

After creating the topics, if the top words representing the topics contain stopwords, we can remove the stopwords using `stop_words="english"`

with `CountVectorizer`

.

# Count vectorizer

vectorizer_model = CountVectorizer(stop_words="english")

# Initiate BERTopic

topic_model = BERTopic(umap_model=umap_model, vectorizer_model=vectorizer_model)

# Run BERTopic model

topics = topic_model.fit_transform(amz_review['review'])

# Get the list of topics

topic_model.get_topic_info()

We can see that after using `stop_words="english"`

, all the stopwords are removed from the top keywords.

### Step 11: Hyperparameter for Topic Probability Output

In step 11, we will talk about the boolean parameter that decides whether or not to produce topic probability.

- When
`calculate_probabilities = True`

, the probabilities of each document belonging to each topic are calculated. The topic with the highest probability is the predicted topic for a new document. This probability represents how confident we are about finding the topic in the document. - When
`calculate_probabilities = False`

, the probabilities of each document belonging to each topic are not calculated. This saves computation time and cost. If there is no new document to predict, we do not need to calculate the probabilities.

We can visualize the probabilities using `visualize_distribution`

, and pass in the document index. `visualize_distribution`

has the default probability threshold of 0.015, so only the topic with a probability greater than 0.015 will be included.

# Initiate BERTopic

topic_model = BERTopic(umap_model=umap_model, calculate_probabilities=True)

# Run BERTopic model

topics, probablity = topic_model.fit_transform(amz_review['review'])

# Visualize probability distribution

topic_model.visualize_distribution(topic_model.probabilities_[0], min_probability=0.015)

The topic probability distribution for the first review in the dataset shows that topic 7 has the highest probability, so topic 7 is the predicted topic.

The first review is “So there is no way for me to plug it in here in the US unless I go by a converter.”, and the topic of plugging a charger is pretty relevant.

# Check the content for the first review

amz_review['review'][0]

Output:

So there is no way for me to plug it in here in the US unless I go by a converter.

We can also get the predicted probability for all topics using the code below.

# Get probabilities for all topics

topic_model.probabilities_[0]

Output:

array([0.0126781 , 0.00997078, 0.00806085, 0.01186496, 0.01055103,

0.02589788, 0.01059081, 0.09212873, 0.01008473, 0.00990964,

0.00826178, 0.00974247, 0.02565346, 0.00914894, 0.01153246,

0.0117639 , 0.01009115, 0.01135396, 0.01329444, 0.01567501,

0.01403915, 0.01352111, 0.01451199, 0.01091581])

We can see that there are 24 probability values, one for each topic. Index 7 has the highest value, indicating that topic 7 is the predicted topic.

For more information about data science and machine learning, please check out my YouTube channel and Medium Page or follow me on LinkedIn.

### Recommended Tutorials

- GrabNGoInfo Machine Learning Tutorials Inventory
- Topic Modeling with Deep Learning Using Python BERTopic
- Hierarchical Topic Model for Airbnb Reviews
- 5 Ways for Deciding Number of Clusters
- 4 Clustering Model Algorithms in Python
- Hyperparameter Tuning For XGBoost
- Time Series Causal Impact Analysis in Python
- 3 Ways for Multiple Time Series Forecasting Using Prophet in Python
- Four Oversampling And Under-Sampling Methods For Imbalanced Classification Using Python
- Time Series Anomaly Detection Using Prophet in Python
- Five Ways To Create Tables In Databricks
- Explainable S-Learner Uplift Model Using Python Package CausalML
- One-Class SVM For Anomaly Detection
- Recommendation System: Item-Based Collaborative Filtering
- Hyperparameter Tuning for Time Series Causal Impact Analysis in Python
- Hyperparameter Tuning and Regularization for Time Series Model Using Prophet in Python
- Multivariate Time Series Forecasting with Seasonality and Holiday Effect Using Prophet in Python
- LASSO (L1) Vs Ridge (L2) Vs Elastic Net Regularization For Classification Model
- S Learner Uplift Model for Individual Treatment Effect and Customer Segmentation in Python
- How to Use R with Google Colab Notebook