Hierarchical Topic Model for Airbnb Reviews Extracting topics and sub-topics hierarchical structure in Airbnb reviews using the Python package BERTopic

Hierarchical Topic Model for Airbnb Reviews


Hierarchical topic models are the models that utilize the semantic hierarchy to identify topics and sub-topics for a collection of text. In this tutorial, we will use Airbnb review data to illustrate the following:

  • How to use a transformer-based deep learning model to build a hierarchical topic model in Python?
  • How to process the data to remove the noise in the topics?
  • How to extract the topics and sub-topics from the model outputs?
  • How to make predictions for a new document?

The Python package used for the hierarchical model in this tutorial is BERTopic. For more details about using this package, please check out my previous tutorials Topic Modeling with Deep Learning Using Python BERTopic and Hyperparameter Tuning for BERTopic Model in Python.

Resources for this post:

  • Video tutorial for this post on YouTube
  • Click here for the Colab notebook.
  • More video tutorials on NLP
  • More blog posts on NLP
Hierarchical Topic Model for Airbnb Reviews – GrabNGoInfo.com

Let’s get started!


Step 1: Install And Import Python Libraries

In step 1, we will install and import python libraries.

Firstly, let’s import bertopic.

# Install bertopic
!pip install bertopic

After installing the python packages, we will import the python libraries.

  • pandas and numpy are imported for data processing.
  • nltk is for removing stopwords.
  • UMAP is for dimension reduction.
  • HDBSCAN is for clustering models.
  • CountVectorizer is for term vectorization.
  • BERTopic is for the topic modeling.
# Data processing
import pandas as pd
import numpy as np
# Text preprocessiong
import nltk
nltk.download('stopwords')
# Dimension reduction
from umap import UMAP
# Clustering
from hdbscan import HDBSCAN
# Count vectorization
from sklearn.feature_extraction.text import CountVectorizer
# Topic model
from bertopic import BERTopic

Step 2: Download And Read Airbnb Review Data

The second step is to download and read the dataset.

A website called Inside Airbnb had the Airbnb data publicly available for research. We used the review data for Washington D.C. for this analysis, but the website provides other listing data from other locations around the world.

Please follow these steps to download the data.

  1. Go to: http://insideairbnb.com/get-the-data
  2. Scroll down the page until you see the section called Washington, D.C., District of Columbia, United States.
  3. Click the blue file name “reviews.csv.gz” to download the review data.
  4. Copy the downloaded file “reviews.csv.gz” to your project folder.

Note that Inside Airbnb generally provides quarterly data for the past 12 months, but users can make a data request for historical data of a longer time range if needed.

Hierarchical Topic Model for Airbnb Reviews Extracting topics and sub-topics hierarchical structure in Airbnb reviews using the Python package BERTopic
Download Airbnb data — Inside Airbnb website

Those who are using Google Colab for this analysis need to mount Google Drive to read the dataset. You can ignore the code below if you are not using Google Colab.

  • drive.mount is used to mount to the Google drive so the colab notebook can access the data on the Google drive.
  • os.chdir is used to change the default directory on Google drive. I suggest setting the default directory to the project folder.
  • !pwd is used to print the current working directory.

Please check out Google Colab Tutorial for Beginners for details about using Google Colab for data science projects.

# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')
# Change directory
import os
os.chdir("drive/My Drive/contents/nlp")
# Print out the current directory
!pwd

Now let’s read the data into a pandas dataframe and see what the dataset looks like.

The dataset has multiple columns, but we only read the comments column because the reviews are the only information needed for this tutorial. The dataset has over three hundred thousand reviews, we read ten thousand reviews to make the computation manageable and save time for each iteration.

# Read in data
df = pd.read_csv('airbnb/airbnb_reviews_dc_20220914.csv.gz', nrows=10000, usecols=['comments'], compression='gzip')
# Take a look at the data
df.head()

.info helps us to get information about the dataset.

From the output, we can see that this data set has 10000 records and no missing data. The comments column is the object type.

# Get the dataset information
df.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 comments 10000 non-null object
dtypes: object(1)
memory usage: 78.2+ KB

Step 3: Remove Noises from Topic Top Words

In step 3, we will remove the noises from the top words of the topic model.

There are three types of noises that impact the topic modeling results and interpretation, stop words, persons’ names, and domain-specific words.

Stop words are the words that commonly appear in sentences but have no real meanings such as the and for. There are 179 stop words in the Python package NLTK.

# NLTK English stopwords
stopwords = nltk.corpus.stopwords.words('english')
# Print out the NLTK default stopwords
print(f'There are {len(stopwords)} default stopwords. They are {stopwords}')

Output:

There are 179 default stopwords. They are ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

Persons’ names are high-frequency words for Airbnb reviews because reviewers like to mention hosts’ names in the review. Therefore, names are likely to become top words representing the topic, making interpreting the topics difficult.

To remove the hosts’ names from the top keywords representing the topics, I downloaded the frequently occurring surname list from the US Census Bureau. It contains the first names with a frequency of more than 100.

  • There are two columns in the US Census Bureau surnames frequency dataset. We only read the column name because only name is needed for the tutorial.
  • After reading the names into a pandas dataframe, they are transformed from upper cases to lower cases because the topic model uses lower cases.
  • Two name lists are created using lowercase names, one list has the names, and the other has the name plus the letter s as the element. This is because a lot of reviewers mention the host’s apartment such as “Don’s apartment”. The word “Don’s” becomes “Dons” after removing punctuation. So we need to remove the name plus the letter s from the top words as well.
  • Each name list has 151,671 names and the top three names with the highest frequency are Smith, Johnson, and Williams.
# Read in names
names = pd.read_csv('airbnb/surnames.csv', usecols=['name'])
# Host name list
name_list = names['name'].str.lower().tolist()
# Host's name list
names_list = list(map(( lambda x: str(x)+'s'), name_list))
# Print out the number of names
print(f'There are {len(name_list)} names in the surname list, and the top three names are {name_list[:3]}.')

Output:

There are 151671 names in the surname list, and the top three names are ['smith', 'johnson', 'williams'].

Domain-specific words are high-frequency words related to the business. For Airbnb reviews, reviewers frequently mention the word airbnb, time, would, and stay. Because I am using the data for Washington D.C., the word dc is a frequent word too.

# Domain specific words to remove
airbnb_related_words = ['stay', 'airbnb', 'dc', 'would', 'time']

Removing noises is an iterative process, and we can add new words to the list if they do not provide valuable meanings and appear in the top topic words. For example, some less common names such as natasha can appear as top words. The word also appears in the top words too, but does not provide valuable information about the topic, so we will remove such words.

# Other words to remove
other_words_to_remove = ['natasha', 'also', 'vladi']

To remove the noises from the top words representing the topics, we extended the stopwords with the name lists and the Airbnb-specific words. After the extension, 303,529 words are excluded from the top words.

# Expand stopwords
stopwords.extend(name_list + names_list + airbnb_related_words + other_words_to_remove)
print(f'There are {len(stopwords)} stopwords.')

Output:

There are 303529 stopwords.

Step 4: Build a Basic BERTopic Model

In step 4, we will talk about how to build a hierarchical topic model.

The hierarchical topic model produces results about the similarities among topics and helps us to understand the topic-subtopic structure.

We will start by building a basic BERTopic model.

BERTopic model uses UMAP (Uniform Manifold Approximation & Projection) dimensionality reduction. BERTopic by default produces different results each time because of the stochasticity inherited from UMAP.

To get reproducible topics, we need to pass a value to the random_state parameter in the UMAP method.

  • n_neighbors=15 means that the local neighborhood size for UMAP is 15. This is the parameter that controls the local versus global structure in data.
  1. A low value forces UMAP to focus more on local structure, and may lose insights into the big picture.
  2. A high value pushes UMAP to look at the broader neighborhood, and may lose details on local structure.
  3. The default n_neighbors value for UMAP is 15.
  • n_components=5 indicates that the target dimension from UMAP is 5. This is the dimension of data that will be passed into the clustering model.
  • min_dist controls how tightly UMAP is allowed to pack points together. It’s the minimum distance between points in the low-dimensional space.
  1. Small values of min_dist result in clumpier embeddings, which is good for clustering. Since our goal of dimension reduction is to build clustering models, we set min_dist to 0.
  2. Large values of min_dist prevent UMAP from packing points together and preserves the broad structure of data.
  • metric='cosine' indicates that we will use cosine to measure the distance.
  • random_state sets a random seed to make the UMAP results reproducible.

CountVectorizer is for counting the word’s frequency. Passing the extended stop words list helps us to remove noises from the top words representing each topic.

# Initiate UMAP
umap_model = UMAP(n_neighbors=15,
n_components=5,
min_dist=0.0,
metric='cosine',
random_state=100)
# Count vectorizer
vectorizer_model = CountVectorizer(stop_words=stopwords)

In the BERTopic function, we tuned a few hyperparameters. To learn more about hyperparameter tuning, Please check out my previous tutorial Hyperparameter Tuning for BERTopic Model in Python.

  • umap_model takes the model for dimensionality reduction. We are using the UMAP model for this tutorial, but it can be other dimensionality reduction models such as PCA (Principle Component Analysis).
  • vectorizer_model takes the term vectorization model. The extended stop words list is passed into the BERTopic model through CountVectorizer.
  • diversity helps to remove the words with the same or similar meanings. It has a range of 0 to 1, where 0 means least diversity and 1 means most diversity.
  • min_topic_size is the minimum number of documents in a topic. min_topic_size=200 means that a topic needs to have at least 200 reviews.
  • top_n_words=4 indicates that we will use the top 4 words to represent the topic.
  • language has English as the default. We set it to multilingual because there are multiple languages in the Airbnb reviews.
  • calculate_probabilities=True means that the probabilities of each document belonging to each topic are calculated. The topic with the highest probability is the predicted topic for a new document. This probability represents how confident we are about finding the topic in the document.
# Initiate BERTopic
topic_model = BERTopic(umap_model=umap_model,
vectorizer_model=vectorizer_model,
diversity=0.8,
min_topic_size=200,
top_n_words=4,
language="multilingual",
calculate_probabilities=True)
# Run BERTopic model
topics = topic_model.fit_transform(df['comments'])
# Get the list of topics
topic_model.get_topic_info()
Hierarchical Topic Model for Airbnb Reviews Extracting topics and sub-topics hierarchical structure in Airbnb reviews using the Python package BERTopic
BERTopic model list of topics — GrabNGoInfo.com

Using the attribute get_topic_info() on the topic model gives us a list of topics. We can see that the output gives us 9 rows in total.

  • Topic -1 should be ignored. It indicates that the reviews are not assigned to any specific topic. The count for topic -1 is 2442, meaning that there are 2442 outlier reviews that do not belong to any topic.
  • Topic 0 to topic 8 are the 9 topics created for the reviews. It was ordered by the number of reviews in each topic, so topic 0 has the highest number of reviews.
  • The Name column lists the top terms for each topic. For example, the top 4 terms for Topic 0 are clean, neighborhood, restaurants, and bed, indicating that it is a topic related to a convenient neighborhood.

Step 5: Merge Topics Manually

In step 5, we will talk about how to merge topics manually.

If we would like to manually pick which topics to merge together based on domain knowledge, we can list the topic numbers and pass them into the merge_topics function.

In this example, we merged topic 1 and topic 5 together because they both talk about breakfast, and merged topic 0 and topic 7 together because they both talk about neighborhood restaurants.

# Topic to merge
topics_to_merge = [[1, 5],
[0, 7]]
# Merge topics                   
topic_model.merge_topics(df['comments'], topics_to_merge)
# Get the list of topics
topic_model.get_topic_info()
Hierarchical Topic Model for Airbnb Reviews Extracting topics and sub-topics hierarchical structure in Airbnb reviews using the Python package BERTopic
List of topics after manual topic merge — GrabNGoInfo.com

We can see that the number of topics is reduced by two, and we now have 7 topics.

Step 6: Extract Topic Hierarchy

In step 6, we will talk about how to extract topic hierarchy from the topic model.

After finishing building the basic BERTopic model, we can extract the hierarchical structure using hierarchical_topics. The output is a dataframe with 8 columns.

  • Parent_ID is a new topic ID created for the parent topics.
  • Parent_Name is a list of top words describing the parent topic.
  • Topics is a list of child topic numbers included in the parent topic. All the child topic numbers in this column are from the basic BERTopic model.
  • Child_Left_ID is the left child topic number. This child topic number can be from the basic BERTopic model or an existing parent topic number.
  • Child_Left_Name has the top words describing the left child topic.
  • Child_Right_ID is the right child topic number. Similar to Child_Left_ID, this child topic number can be from the basic BERTopic model or an existing parent topic number.
  • Child_Right_Name has the top words describing the right child topic.
  • Distance shows the distance between the left and right child topics.
# Hierachical topics
hierarchical_topics = topic_model.hierarchical_topics(df['comments'])
# Take a look at the data
hierarchical_topics
Hierarchical Topic Model for Airbnb Reviews Extracting topics and sub-topics hierarchical structure in Airbnb reviews using the Python package BERTopic
Topic hierarchy from the topic model — GrabNGoInfo.com

Step 7: Create Mapping Between Parent and Child Topics

In step 7, we will take a close look at the topic hierarchy, and create a mutually exclusive mapping between the parent topics and the child topics.

Using topic_model.visualize_hierarchy, we can visualize the hierarchical structure of the topics. Hovering the mouse over the black dots shows the top words for the level of the hierarchy.

# Visualize heirarchical topics
topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)
Hierarchical Topic Model for Airbnb Reviews Extracting topics and sub-topics hierarchical structure in Airbnb reviews using the Python package BERTopic
Visualize hierarchy from the topic model — GrabNGoInfo.com

Another way to visualize the topic hierarchy is to create a topic tree using topic_model.get_topic_tree. This view lists the topic representations for all hierarchy levels.

# Topic tree
tree = topic_model.get_topic_tree(hierarchical_topics)
# Print out the tree
print(tree)

Output:

.
├─■──stadium_parking_staying_bathroom ── Topic: 6
└─neighborhood_recommend_restaurants_bed
├─neighborhood_recommend_restaurants_bed
│ ├─neighborhood_recommend_restaurants_bed
│ │ ├─clean_neighborhood_restaurants_bed
│ │ │ ├─■──clean_bed_bike_neighbourhood ── Topic: 2
│ │ │ └─■──neighborhood_recommend_restaurants_bed ── Topic: 0
│ │ └─■──comfortable_neighborhood_meet_subway ── Topic: 3
│ └─breakfast_recommend_feel_apartment
│ ├─■──breakfast_neighborhood_recommend_bathroom ── Topic: 1
│ └─■──cathedral_recommend_basement_feel ── Topic: 4
└─■──apartment_recommend_bathroom_feel ── Topic: 5

After examining the topic hierarchy, we decide to create four parent topics.

  • Parent topic 1 is about breakfast. It includes child topic 1 and child topic 4.
  • Parent topic 2 is about the neighborhood with a focus on restaurants. It includes child topics 0, 2, and 3.
  • Parent topic 3 is about the apartment itself with comments on the bathroom. It includes child topic 5.
  • Parent topic 4 is about the stadium. It includes child topic 6.

A function called topic_mapping is created for the mapping between the parent topics and the child topics.

# A function to map parent and child topics
def topic_mapping(child_topic_number):
if child_topic_number in [1, 4]:
return 'P1_breakfast_recommend_feel_apartment'
elif child_topic_number in [0, 2, 3]:
return 'P2_neighborhood_recommend_restaurants_bed'
elif child_topic_number in [5]:
return 'P3_apartment_recommend_bathroom_feelm'
elif child_topic_number in [6]:
return 'P4_stadium_parking_staying_bathroom'

Step 8: Topic Model In-sample Hierarchical Predictions

In step 8, we will talk about topic model in-sample hierarchical predictions.

Firstly, we get the predicted child topic numbers from the BERTopic model. Then, a column child_topic is created in the dataframe. After that, the topic_mapping function is applied to each child topic number and the parent_topic column is created.

# Get the child topic predictions from the basic BERTopic model
child_topic_prediction = topic_model.topics_[:]
# Save the child predictions in the dataframe
df['child_topic'] = child_topic_prediction
# Create the parent topics
df['parent_topic'] = df['child_topic'].apply(topic_mapping)
# Take a look at the data
df.head()
Hierarchical Topic Model for Airbnb Reviews Extracting topics and sub-topics hierarchical structure in Airbnb reviews using the Python package BERTopic
Hierarchical topics with parent and child topic mapping — GrabNGoInfo.com

The first review in the dataset says “Don’s apartment is comfortable, clean and well equipped. Three minutes walk from the metro, wonderful terrace view of the river and Washington Monument. All in all a great experience. I would certainly recommend Don’s apartment to friends and would be very happy to stay there again.” It is a good match for the parent topic 2 of recommending the apartment for the neighborhood.

# Take a look at the first review
df['comments'][0]

Outputs:

Don's apartment is comfortable, clean and well equipped. Three minutes walk from the metro, wonderful terrace view of the river and Washington Monument. All in all a great experience. I would certainly recommend Don's apartment to friends and would be very happy to stay there again.

Using value_counts, we get the frequencies of the parent topics. We can see that most reviews are about the neighborhood and a lot of people care about breakfast.

# Review frequency by parent topic
df['parent_topic'].value_counts()

Output:

P2_neighborhood_recommend_restaurants_bed    5806
P1_breakfast_recommend_feel_apartment 1244
P3_apartment_recommend_bathroom_feelm 280
P4_stadium_parking_staying_bathroom 228
Name: parent_topic, dtype: int64

Step 9: New Document Topic Predictions

In step 9, we will talk about making predictions for new documents.

Suppose there is a new review “I like the apartment. The bathroom is spacious and clean.”, we can find the topic that is most similar to the review using find_topics.

# New data for the review
new_review = "I like the apartment. The bathroom is spacious and clean."
# Find topics
similar_topics, similarity = topic_model.find_topics(new_review, top_n=1);
# Print results
print(f'The most similar child topic is {similar_topics[0]}, and the similarities is {np.round(similarity,2)[0]}')

The result shows that child topic 5 is the topic that is most similar to the review. The corresponding parent topic is 3, which is about the apartment itself.

The most similar child topic is 5, and the similarities is 0.56

For more information about data science and machine learning, please check out my YouTube channel and Medium Page or follow me on LinkedIn.


Recommended Tutorials

References

Leave a Comment

Your email address will not be published. Required fields are marked *