Topic Modeling by Group Using Deep Learning in Python Topics by category using the Python package BERTopic on Airbnb reviews

Topic Modeling by Group Using Deep Learning in Python

Building one general topic model is not enough in some cases, especially when there are different categories with various properties and characteristics. For example, a commercial bank may be interested in topic models built for different lines of products such as credit cards, checking accounts, or student loan. A hotel chain may be interested in the topics in reviews for different locations.

In this tutorial, we will talk about how to analyze topics by group using Airbnb data in Python. We will cover:

  • How to build multiple topic models by category?
  • How to extract topics by group from one general topic model?

The Python package used for the topic model is BERTopic. For more details about using this package, please check out my previous tutorials Topic Modeling with Deep Learning Using Python BERTopic and Hyperparameter Tuning for BERTopic Model in Python.

Resources for this post:

  • Click here for the Colab notebook.
  • More video tutorials on NLP
  • More blog posts on NLP
  • Video tutorial for this post on YouTube
Topic Modeling by Group Using Deep Learning in Python – GrabNGoInfo.com

Let’s get started!

Step 1: Install And Import Python Libraries

In step 1, we will install and import python libraries.

Firstly, let’s import bertopic.

# Install bertopic
!pip install bertopic

After installing the python package, we will import the python libraries.

  • pandas and numpy are imported for data processing. We set the pandas dataframe display options to display more rows, columns, and wider column width.
  • nltk is for removing stopwords.
  • UMAP is for dimensionality reduction.
  • HDBSCAN is for clustering models.
  • CountVectorizer is for term vectorization.
  • BERTopic is for the topic modeling.
# Data processing
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
import numpy as np

# Text preprocessiong
import nltk
nltk.download('stopwords')

# Dimension reduction
from umap import UMAP

# Clustering
from hdbscan import HDBSCAN

# Count vectorization
from sklearn.feature_extraction.text import CountVectorizer

# Topic model
from bertopic import BERTopic

Step 2: Download And Read Airbnb Review Data

The second step is to download and read the dataset.

A website called Inside Airbnb had the Airbnb data publicly available for research. We used the review data for Washington D.C. for this analysis, but the website provides listing data from other locations around the world.

Please follow these steps to download the data.

  • Go to: http://insideairbnb.com/get-the-data
  • Scroll down the page until you see the section called Washington, D.C., District of Columbia, United States.
  • Click the blue file name “reviews.csv.gz” to download the review data and click the blue file name “listings.csv” to download the listings data.
  • Copy the downloaded file “reviews.csv.gz” and “listings.csv” to your project folder.

Note that Inside Airbnb generally provides quarterly data for the past 12 months, but users can make a data request for historical data of a longer time range if needed.

Download Airbnb data — Inside Airbnb website

Those who are using Google Colab for this analysis need to mount Google Drive to read the dataset. You can ignore the code below if you are not using Google Colab.

  • drive.mount is used to mount to the Google drive so the colab notebook can access the data on the Google drive.
  • os.chdir is used to change the default directory on Google drive. I suggest setting the default directory to the project folder.
  • !pwd is used to print the current working directory.

Please check out Google Colab Tutorial for Beginners for details about using Google Colab for data science projects.

# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Change directory
import os
os.chdir("drive/My Drive/contents/nlp")

# Print out the current directory
!pwd

Now let’s read the data into a pandas dataframe and see what the datasets look like.

The review dataset has multiple columns, but we only read the listing_id and comments column because other information is not needed for this tutorial. The dataset has over three hundred thousand reviews, we read ten thousand reviews to make the computation manageable and save time for each iteration.

# Read data
reviews = pd.read_csv('airbnb/airbnb_reviews_dc_20220914.csv.gz', nrows=10000, usecols=['listing_id', 'comments'], compression='gzip')

# Take a look at the data
reviews.head()
Airbnb reviews data – GrabNGoInfo.com

We would like to use neighborhoods as groups for the topic model, so the column neighbourhood and id are read from the listings dataset.

# Read data
listings = pd.read_csv('airbnb/airbnb_listings_dc_20020914.csv', usecols=['id', 'neighbourhood'])

# Take a look at the data
listings.head()

The id in the listings dataset is the listing ID, so we can use it as the matching key to append neighborhood information to the review dataset.

# Append neighbourhood to the review data
df = reviews.merge(listings, left_on='listing_id', right_on='id').drop('id', axis=1)

# Take a look at the data
df.head()
Airbnb reviews by neighbourhood — GrabNGoInfo.com

.info helps us to get information about the dataset.

From the output, we can see that the review data set has 10000 records and no missing data. The columns comments and neighbourhood are the object type and the data type for listing_id is int64.

# Get the dataset information
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 0 to 9999
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 listing_id 10000 non-null int64
1 comments 10000 non-null object
2 neighbourhood 10000 non-null object
dtypes: int64(1), object(2)
memory usage: 312.5+ KB

Step 3: Remove Noises from Topic Top Words

In step 3, we will remove the noises from the top words of the topic model.

There are three types of noises that impact the topic modeling results and interpretation, stop words, persons’ names, and domain-specific words.

Stop words are the words that commonly appear in sentences but have no real meanings such as the and for. There are 179 stop words in the Python package NLTK.

# NLTK English stopwords
stopwords = nltk.corpus.stopwords.words('english')

# Print out the NLTK default stopwords
print(f'There are {len(stopwords)} default stopwords. They are {stopwords}')
There are 179 default stopwords. They are ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

Persons’ names are high-frequency words for Airbnb reviews because reviewers like to mention hosts’ names in the review. Therefore, names are likely to become top words representing the topic, making interpreting the topics difficult.

To remove the hosts’ names from the top keywords representing the topics, I downloaded the frequently occurring surname list from the US Census Bureau. It contains the first names with a frequency of more than 100.

  • There are two columns in the US Census Bureau surnames frequency dataset. We only read the column name because only name is needed for the tutorial.
  • After reading the names into a pandas dataframe, they are transformed from upper cases to lower cases because the topic model uses lower cases.
  • Two name lists are created using lowercase names, one list has the names, and the other has the name plus the letter s as the element. This is because a lot of reviewers mention the host’s apartment such as “Don’s apartment”. The word “Don’s” becomes “Dons” after removing punctuation. So we need to remove the name plus the letter s from the top words as well.
  • Each name list has 151,671 names and the top three names with the highest frequency are Smith, Johnson, and Williams.
# Read in names
names = pd.read_csv('airbnb/surnames.csv', usecols=['name'])

# Host name list
name_list = names['name'].str.lower().tolist()

# Host's name list
names_list = list(map(( lambda x: str(x)+'s'), name_list))

# Print out the number of names
print(f'There are {len(name_list)} names in the surname list, and the top three names are {name_list[:3]}.')
There are 151671 names in the surname list, and the top three names are ['smith', 'johnson', 'williams'].

Domain-specific words are high-frequency words related to the business. For Airbnb reviews, reviewers frequently mention the word airbnb, apartment, time, would, and stay. Because I am using the data for Washington D.C., the word dc is a frequent word too. These domain-specific words need to be removed because they are likely to appear as top words but do not represent the topics.

# Domain specific words to remove
airbnb_related_words = ['stay', 'airbnb', 'dc', 'would', 'time', 'apartment']

Removing noises is an iterative process, and we can add new words to the list if they do not provide valuable meanings and appear in the top topic words. For example, some less common names such as natasha and vladi can appear as top words. The word also appears in the top words too, but does not provide valuable information about the topic, so we will remove such words.

# Other words to remove
other_words_to_remove = ['natasha', 'also', 'vladi']

To remove the noises from the top words representing the topics, we extended the stopwords with all the noise lists. After the extension, 303,530 words are excluded from the top words.

# Expand stopwords
stopwords.extend(name_list + names_list + airbnb_related_words + other_words_to_remove)
print(f'There are {len(stopwords)} stopwords.')
There are 303530 stopwords.

Step 4: Group Data Preprocessing

In step 4, we will talk about how to process the group or category data.

Firstly, let’s check the number of reviews for each neighbourhood using groupby. Note that we are using a subset of the reviews for this tutorial, so the counts do not represent the real Airbnb review distribution in Washington D.C..

# Count the number of reviews by neighbourhood
neighborhood_agg = df.groupby(['neighbourhood']).size().reset_index(name='counts')

# Take a look at the data
neighborhood_agg
Count of reviews by group — GrabNGoInfo.com

We can see that there are a total of 25 neighbourhoods in the D.C. area. The neighbourhood with the most reviews is Union Station, Stanton Park, Kingman Park, which has 1587 reviews. The neighbourhood with the least reviews is Deanwood, Burrville, Grant Park, Lincoln Heights, Fairmont Heights, which has 6 reviews.

The topic modeling requires a reasonably large number of reviews to find meaningful topics, so we will remove the neighborhood with less than 100 reviews.

# Remove neighbourhood with small number of reviews
neighborhood_list = neighborhood_agg[neighborhood_agg['counts']>=100]['neighbourhood'].tolist()

# Check the number of neighbourhood
print(f'There are {len(neighborhood_list)} with more than 100 reviews.\n')

# Take a look at the list
neighborhood_list
There are 18 with more than 100 reviews.

['Brightwood Park, Crestwood, Petworth',
'Capitol Hill, Lincoln Park',
'Capitol View, Marshall Heights, Benning Heights',
'Cathedral Heights, McLean Gardens, Glover Park',
'Columbia Heights, Mt. Pleasant, Pleasant Plains, Park View',
'Congress Heights, Bellevue, Washington Highlands',
'Dupont Circle, Connecticut Avenue/K Street',
'Edgewood, Bloomingdale, Truxton Circle, Eckington',
'Friendship Heights, American University Park, Tenleytown',
'Georgetown, Burleith/Hillandale',
'Historic Anacostia',
'Howard University, Le Droit Park, Cardozo/Shaw',
'Kalorama Heights, Adams Morgan, Lanier Heights',
'Shaw, Logan Circle',
'Southwest Employment Area, Southwest/Waterfront, Fort McNair, Buzzard Point',
'Spring Valley, Palisades, Wesley Heights, Foxhall Crescent, Foxhall Village, Georgetown Reservoir',
'Takoma, Brightwood, Manor Park',
'Union Station, Stanton Park, Kingman Park']

After removing the neighbourhoods with less than 100 reviews, we have 18 neighbourhoods left, and they are saved in a list called neighborhood_list. We will build one topic model for each of the 18 neighbourhoods using the BERTopic model with a for loop.

Step 5: Multiple Topic Models by Group

In step 5, we will talk about how to build one BERTopic model for each group using a for loop.

BERTopic model uses UMAP (Uniform Manifold Approximation & Projection) dimensionality reduction. BERTopic by default produces different results each time because of the stochasticity inherited from UMAP.

To get reproducible topics, we need to pass a value to the random_state parameter in the UMAP method.

  • n_neighbors=15 means that the local neighborhood size for UMAP is 15. This is the parameter that controls the local versus global structure in data.
  1. A low value forces UMAP to focus more on local structure, and may lose insights into the big picture.
  2. A high value pushes UMAP to look at the broader neighborhood, and may lose details on local structure.
  3. The default n_neighbors value for UMAP is 15.
  • n_components=5 indicates that the target dimension from UMAP is 5. This is the dimension of data that will be passed into the clustering model.
  • min_dist controls how tightly UMAP is allowed to pack points together. It’s the minimum distance between points in the low-dimensional space.
  1. Small values of min_dist result in clumpier embeddings, which is good for clustering. Since our goal of dimension reduction is to build clustering models, we set min_dist to 0.
  2. Large values of min_dist prevent UMAP from packing points together and preserves the broad structure of data.
  • metric='cosine' indicates that we will use cosine to measure the distance.
  • random_state sets a random seed to make the UMAP results reproducible.

CountVectorizer is for counting the word’s frequency. Passing the extended stop words list helps us to remove noises from the top words representing each topic.

# Initiate UMAP
umap_model = UMAP(n_neighbors=15,
n_components=5,
min_dist=0.0,
metric='cosine',
random_state=100)

# Count vectorizer
vectorizer_model = CountVectorizer(stop_words=stopwords)

In the BERTopic function, we tuned a few hyperparameters. To learn more about hyperparameter tuning, Please check out my previous tutorial Hyperparameter Tuning for BERTopic Model in Python.

  • umap_model takes the model for dimensionality reduction. We are using the UMAP model for this tutorial, but it can be other dimensionality reduction models such as PCA (Principle Component Analysis).
  • vectorizer_model takes the term vectorization model. The extended stop words list is passed into the BERTopic model through CountVectorizer.
  • diversity helps to remove the words with the same or similar meanings. It has a range of 0 to 1, where 0 means least diversity and 1 means most diversity.
  • min_topic_size is the minimum number of documents in a topic. min_topic_size=20 means that a topic needs to have at least 20 reviews.
  • top_n_words=4 indicates that we will use the top 4 words to represent the topic.
  • language has English as the default. We set it to multilingual because there are multiple languages in the Airbnb reviews.
  • calculate_probabilities=False means that the probabilities of each document belonging to each topic are not calculated.
# Initiate BERTopic
topic_model = BERTopic(umap_model=umap_model,
vectorizer_model=vectorizer_model,
diversity=0.8,
min_topic_size=20,
top_n_words=4,
language="multilingual",
calculate_probabilities=False
)

Next, let’s build a topic model for each neighbourhood using a for loop.

  • Firstly, an empty dataframe called topics_by_group is created using pd.DataFrame(). This dataframe will be used to save the topic model output.
  • Then, we loop through each neighbourhood in the neighborhood_list.
  • Using the neighbourhood name as the filter, a new dataframe called group is created. This dataframe contains only the reviews for one neighbourhood. Note that we need to reset the index for the dataframe group, otherwise BERTopic model gives an error message.
  • After that, we train the topic model for one neighbourhood and save the topic model results in a dataframe called topics_df.
  • A new column called neighbourhood is created indicating the neighbourhood name for the topics.
  • Finally, the topic modeling results for each neighbourhood are appended to the dataframe called topics_by_group.
# Create an empty dataframe
topics_by_group = pd.DataFrame()

# Loop through each neighbourhood
for i in neighborhood_list:
# A data frame only contains reviews from neighbourhood i
group = df[df['neighbourhood'] == i]
# Reset index, otherwise there will be error from BERTopic model
group.reset_index(drop=True, inplace=True)
# Train the topic model
topics = topic_model.fit_transform(group['comments'])
# Get the topic list for the neighbourbhood i
topics_df = topic_model.get_topic_info()
# Create a column with the neighbourhood name
topics_df['neighbourhood'] = i
# Append the topic list to the dataframe
topics_by_group = pd.concat((topics_by_group, topics_df))

topics_by_group
Topic models by group results — GrabNGoInfo.com

We can see that the dataframe topics_by_group has four columns.

  • The column Topic has topic numbers.
  1. -1 should be ignored. It indicates that the reviews are not assigned to any specific topic.
  2. All the non-negative numbers are the topics created for the reviews.
  • The column Count has the number of reviews for each topic. The topics are ordered by the number of reviews in each topic, so topic 0 has the highest number of reviews for each neighbourhood.
  • The column Name lists the top words representing each topic.
  • The column neighbourhood indicates the neighbourhood group for the topic model.

Step 6: One Topic Model for Multiple Groups

In step 6, we will talk about how to create one topic model for multiple groups. This helps us to understand how a topic is distributed across different groups.

Firstly, an overall topic model will be created using the reviews from all the neighbourhoods. Notice that the parameter min_topic_size=200 for the global model while min_topic_size=20 for the group-level model. This is because the global topic model has more reviews available.

# Initiate UMAP
umap_model = UMAP(n_neighbors=15,
n_components=5,
min_dist=0.0,
metric='cosine',
random_state=100)

# Count vectorizer
vectorizer_model = CountVectorizer(stop_words=stopwords)

# Initiate BERTopic
topic_model = BERTopic(umap_model=umap_model,
vectorizer_model=vectorizer_model,
diversity=0.8,
min_topic_size=200,
top_n_words=4,
language="multilingual",
calculate_probabilities=True)

# Run BERTopic model
topics = topic_model.fit_transform(df['comments'])

# Get the list of topics
topic_model.get_topic_info()
Global topic model results — GrabNGoInfo.com

The top words representation for the global topic model are extracted and saved in a column called Name in the model results.

For each neighbourhood, the representation of a top word is created based on the reviews for the neighbourhood under each topic. This is achieved by applying the topics_per_class attribute on the topic_model. The topics_per_class attribute takes in the documents and classes, where classes is for the list of groups.

We can visualize topics by group using visualize_topics_per_class. top_n_topics=9 means that we will only visualize the top 9 topics.

# Get topics by group
topics_per_class = topic_model.topics_per_class(df['comments'], classes=df['neighbourhood'])

# Visualize topics by group
topic_model.visualize_topics_per_class(topics_per_class, top_n_topics=9)
Topics per class BERTopic — GrabNGoInfo.com

The visualization chart is interactive and we can add or remove global topics by clicking the topics under Global Topic Representation.

  • The x-axis is the review frequency and the y-axis is the list of neighbourhoods.
  • The length of the bar represents the number of reviews for the selected global topic for each group.
  • We can see the group-level topic representation by hovering the mouse over each bar.

For the global topic 0, we can see that most reviews are from the neighbourhood Union Station, Stanton Park, Kingman Park and the top representation for this group is restaurants, parking, bike and loved.

We can also compare the distribution of different global topics. For example, comparing topic 4 and topic 8 shows that topic 4 reviews are mainly from the Cathedral Heights, McLean Gardens, Glover Park neighbourhood and topic 8 reviews are mainly from the Southwest Employment Area, Southwest/Waterfront, Fort McNair, Buzzard Point neighbourhood. This is as expected because topic 4 is about cathedral and topic 8 is about stadium.

Compare the topics of two groups — GrabNGoInfo.com

For more information about data science and machine learning, please check out my YouTube channel and Medium Page or follow me on LinkedIn.

Recommended Tutorials

References

Leave a Comment

Your email address will not be published. Required fields are marked *