Categorical Entity Embedding Using Python Tensorflow Keras Entity embedding for high cardinality categorical variables using Airbnb data

Categorical Entity Embedding Using Python Tensorflow Keras

Categorical entity embedding extracts the embedding layers of categorical variables from a neural network model, and uses numeric vectors to represent the properties of the categorical values. It is usually used on categorical variables with high cardinalities.

For example, a marketing company can create categorical entity embedding for different campaigns to represent the characteristics using vectors, and use those vectors to understand the similarities between campaigns, or put the vectors as features into different machine learning models to improve the model performance.

In this tutorial, Python TensorFlow Keras is used to create categorical entity embeddings of Airbnb neighbourhood data. We will talk about:

  • How to do data processing for categorical entity embedding?
  • How to build a neural network model with entity embedding?
  • How to extract the categorical embedding layers?
  • How to use categorical entity embedding in other machine learning models?

Resources for this post:

Categorical Entity Embedding Using Python Tensorflow Keras – GrabNGoInfo.com

Let’s get started!


Step 1: Install And Import Python Libraries

In step 1, we will install and import python libraries.

Firstly, let’s install the matplotlib version 3.4.2. matplotlib version 3.4.2 is installed because a matplotlib version later than 3.4.0 has a new functionality of adding labels to seaborn visualization.

# Change matplotlib to a version later than 3.4.0 for the countplot visualization
!pip install matplotlib==3.4.2

After the installation and restarting of the runtime, we can import the libraries.

  • pandas and numpy are imported for data processing.
  • train_test_split is for train test splitting.
  • RandomForestRegressor is for building the random forest model.
  • seaborn and matplotlib are for visualization.
  • plot_model and Image are for visualizing neural network model structures.
  • tensorflow and EarlyStopping are for the neural network model.
# Data processing
import pandas as pd
import numpy as np

# Train test split
from sklearn.model_selection import train_test_split

# Model
from sklearn.ensemble import RandomForestRegressor

# Visualization
import seaborn as sns
sns.set(rc={'figure.figsize':(12,8)}) # Set figure size
import matplotlib.pyplot as plt

# Visualize neural network model structure
from keras.utils import plot_model
from IPython.display import Image

# Deep learning model
from tensorflow.keras.layers import Input, Dense, Reshape, Concatenate, Embedding
from tensorflow.keras.models import Model, load_model
from keras.callbacks import EarlyStopping

Step 2: Download And Read Airbnb Review Data

The second step is to download and read the dataset.

A website called Inside Airbnb had the Airbnb data publicly available for research. We used the listing data for Washington D.C. for this analysis, but the website provides other data for other locations around the world.

Please follow these steps to download the data.

  • Go to: http://insideairbnb.com/get-the-data
  • Scroll down the page until you see the section called Washington, D.C., District of Columbia, United States.
  • Click the blue file name “listings.csv” to download the data.
  • Copy the downloaded file “listings.csv” to your project folder.

Note that Inside Airbnb generally provides quarterly data for the past 12 months, but users can make a data request for historical data of a longer time range if needed.

Categorical Entity Embedding Using Python Tensorflow Keras Entity embedding for high cardinality categorical variables
Inside Airbnb Data — insideairbnb.com

Those who are using Google Colab for this analysis need to mount Google Drive to read the dataset. You can ignore the code below if you are not using Google Colab.

  • drive.mount is used to mount to the Google drive so the colab notebook can access the data on the Google drive.
  • os.chdir is used to change the default directory on Google drive. I suggest setting the default directory to the project folder.
  • !pwd is used to print the current working directory.

Please check out Google Colab Tutorial for Beginners for details about using Google Colab for data science projects.

# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Change directory
import os
os.chdir("drive/My Drive/contents/nlp")

# Print out the current directory
!pwd

The listing data has the property information aggregated at the listing ID level. We will build a simple model to predict the listing price, and the following columns will be read from the dataset.

  • id is the unique ID for the Airbnb listing.
  • neighbourhood is the neighbourhood name where the listing is located.
  • room_type is the type of room. It can be the entire house, a room in a house, etc.
  • price is the daily price in local currency. Washington D.C. is in the United States, so the currency for the price is US dollars.
  • minimum_nights is the listing’s minimum number of nights for the stay.
  • number_of_reviews is the total number of reviews for the listing.
  • reviews_per_month is the average number of reviews per month.
  • calculated_host_listings_count is the total number of listings that the host has in Washington D.C.
  • availability_365 is the availability of the listing in the next 365 days. The listing can be unavailable because of the guest booking or the host blocking.
  • number_of_reviews_ltm is the number of reviews in the last 12 months.

More details can be found in the Inside Airbnb data dictionary.

# List of columns to read
cols_to_keep = ['id',
'neighbourhood',
'room_type',
'price',
'minimum_nights',
'number_of_reviews',
'reviews_per_month',
'calculated_host_listings_count',
'availability_365',
'number_of_reviews_ltm']

# Read data
df = pd.read_csv('airbnb/airbnb_listings_dc_20020914.csv', usecols=cols_to_keep)

# Take a look at the data
df.head()
Categorical Entity Embedding Using Python Tensorflow Keras Entity embedding for high cardinality categorical variables
Airbnb listing data — GrabNGoInfo.com

Using .info(), we can see that the dataset has 6473 records and 10 columns. 2 out of the 10 columns are categorical, and the two categorical columns are neighbourhood and room_type. Most of the columns do not have missing data. Only the variable reviews_per_month has missing data and needs missing imputation.

# Check the dataframe information
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6473 entries, 0 to 6472
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 6473 non-null int64
1 neighbourhood 6473 non-null object
2 room_type 6473 non-null object
3 price 6473 non-null int64
4 minimum_nights 6473 non-null int64
5 number_of_reviews 6473 non-null int64
6 reviews_per_month 5295 non-null float64
7 calculated_host_listings_count 6473 non-null int64
8 availability_365 6473 non-null int64
9 number_of_reviews_ltm 6473 non-null int64
dtypes: float64(1), int64(7), object(2)
memory usage: 505.8+ KB

Step 3: Data Processing

In step 3, we will work on data processing.

For the missing values in the variable reviews_per_month, I suspect that the missing values are generated because there are no reviews for the listing. To confirm this assumption, I filtered the dataframe and kept only the records with missing values, then I checked the values for the variable number_of_reviews.

# Check the records with missing data
df[df['reviews_per_month'].isnull()]['number_of_reviews'].value_counts()
0    1178
Name: number_of_reviews, dtype: int64

We can see that all the listings with missing values have zero reviews. Therefore, we need to impute the missing values to zeros.

# Impute the missing values for reviews_per_month to 0
df['reviews_per_month'] = df['reviews_per_month'].fillna(0)

The price range shows that the minimum price is 0 and the maximum price is 10,000. We removed the outlier prices and only kept the listings with a daily price greater than 20 dollars and less than 1000 dollars.

# Check the min and max price
print(f'The minimum price is {df.price.min()} and the maximum price is {df.price.max()}.')

# Remove outliers
df = df[(df['price']>20) & (df['price']<1000)]

Output

The minimum price is 0 and the maximum price is 10000.

The visualization for the price data shows that all the outliers are removed.

# Visualization
sns.displot(df['price'])
Categorical Entity Embedding Using Python Tensorflow Keras Entity embedding for high cardinality categorical variables
Price distribution — GrabNGoInfo.com

The variable room_type has four values, Private room, Entire home/apt, Shared room, and Hotel room. Entire home/apt is the most popular room_type and Private room is the second most popular room_type.

# Distribution of multiple treatments
ax = sns.countplot(df['room_type'])

# Add labels
ax.bar_label(ax.containers[0])
Categorical Entity Embedding Using Python Tensorflow Keras Entity embedding for high cardinality categorical variables
Room type distribution — GrabNGoInfo.com

Because the number of categories is small for room_type, we will not do entity embeddings. Instead, we will use get_dummies to create a dummy variable with zero and one values for each category. After the dummy variables are created, we append them to the dataframe df and drop the column room_type.

# Create dummy varialbes
room_type_dummies = pd.get_dummies(df['room_type'])

# Concat dummy variables to df and drop the original category
df = pd.concat([df, room_type_dummies], axis=1).drop('room_type', axis=1)

# Take a look at the data
df.head()

After the data processing, we have 6411 records and 13 columns. There are no missing values in the dataset and neighbourhood is the only categorical variable.

# Get data information
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6411 entries, 0 to 6472
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 6411 non-null int64
1 neighbourhood 6411 non-null object
2 price 6411 non-null int64
3 minimum_nights 6411 non-null int64
4 number_of_reviews 6411 non-null int64
5 reviews_per_month 6411 non-null float64
6 calculated_host_listings_count 6411 non-null int64
7 availability_365 6411 non-null int64
8 number_of_reviews_ltm 6411 non-null int64
9 Entire home/apt 6411 non-null uint8
10 Hotel room 6411 non-null uint8
11 Private room 6411 non-null uint8
12 Shared room 6411 non-null uint8
dtypes: float64(1), int64(7), object(1), uint8(4)
memory usage: 783.9+ KB

Using .nunique(), we can confirm that the dataset is unique at the listing id level.

# Number of unique IDs
df['id'].nunique()
6411

The countplot of neighbourhood shows that the number of listings in a neighbourhood ranges from 12 to 557. The Capitol Hill, Lincoln Park neighbourhood has the highest number of listings.

# Countplot
ax = sns.countplot(df['neighbourhood'])
# Add labels
ax.bar_label(ax.containers[0])
# Rotate x labels
ax.tick_params(axis='x', rotation=90)
Categorical Entity Embedding Using Python Tensorflow Keras Entity embedding for high cardinality categorical variables
Listing counts by neighbourhood — GrabNGoInfo.com

Step 4: Train Test Split

In step 4, we will do the train test split for the model.

  • X has all the features for the model prediction. .iloc is used to select the features starting the 2nd column and exclude the id variable. .iloc create a new variable that refers to the same memory as the original dataframe df. Changing the new dataframe variable can alter the original dataframe. Therefore, we used .copy() to create a new copy so X and df use separate memories that do not affect each other.
  • y is the target variable. We are predicting the daily listing prices, so the column price is used as the target.
  • X and y are passed in the train_test_split to create the training and testing datasets. test_size = 0.2 means that 80% of the data are used for training and 20% of the data are used for testing. random_state makes the train test split results reproducible.
# Features
X = df.iloc[:, 1:].copy().drop('price', axis=1)

# Target
y = df['price']

# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42)

# Check the number of records in training and testing dataset.
print(f'The training dataset has {X_train.shape[0]} records and {X_train.shape[1]} columns.')
print(f'The testing dataset has {len(X_test)} records.')
The training dataset has 5128 records and 11 columns.
The testing dataset has 1283 records.

After the train test split, the training dataset has 5128 records and 11 columns, and the testing dataset has 1283 records.

Step 5: Categorical Data Label Encoding

In step 5, we will do the categorical label encoding for the neighbourhood variable.

  • Firstly, two empty lists are created, one for the training data, and the other for the test data.
  • Next, an empty dictionary called cat_encoder is created. This dictionary will be used to save the label encodings for the categorical variable neighbourhood.
  • After that, all the unique values for neighbourhood are extracted and saved in a variable called unique_cat. There are 39 unique neighbourhoods in the training dataset.
  • Then, we loop through each neighourhood and assign an integer to the neighbourhood.
  • Finally, the cat_encoder is printed out and we can see that each neighbourhood is a key and each key has an integer as a value. There are 39 elements in the dictionary, corresponding to the 39 unique neighbourhoods.
# Input list for the training data
input_list_train = []

# Input list for the testing data
input_list_test = []

# Categorical encoder is in dictionary format
cat_encoder = {}

# Unique values for the categorical variable
unique_cat = np.unique(X_train['neighbourhood'])

# Print out the number of unique values in the categorical variable
print(f'There are {len(unique_cat)} unique neighbourhoods in the training dataset.\n')

# Encode the categorical variable
for i in range(len(unique_cat)):
cat_encoder[unique_cat[i]] = i

# Take a look at the encoder
cat_encoder

Output

There are 39 unique neighbourhoods in the training dataset.

{'Brightwood Park, Crestwood, Petworth': 0,
'Brookland, Brentwood, Langdon': 1,
'Capitol Hill, Lincoln Park': 2,
'Capitol View, Marshall Heights, Benning Heights': 3,
'Cathedral Heights, McLean Gardens, Glover Park': 4,
'Cleveland Park, Woodley Park, Massachusetts Avenue Heights, Woodland-Normanstone Terrace': 5,
'Colonial Village, Shepherd Park, North Portal Estates': 6,
'Columbia Heights, Mt. Pleasant, Pleasant Plains, Park View': 7,
'Congress Heights, Bellevue, Washington Highlands': 8,
'Deanwood, Burrville, Grant Park, Lincoln Heights, Fairmont Heights': 9,
'Douglas, Shipley Terrace': 10,
'Downtown, Chinatown, Penn Quarters, Mount Vernon Square, North Capitol Street': 11,
'Dupont Circle, Connecticut Avenue/K Street': 12,
'Eastland Gardens, Kenilworth': 13,
'Edgewood, Bloomingdale, Truxton Circle, Eckington': 14,
'Fairfax Village, Naylor Gardens, Hillcrest, Summit Park': 15,
'Friendship Heights, American University Park, Tenleytown': 16,
'Georgetown, Burleith/Hillandale': 17,
'Hawthorne, Barnaby Woods, Chevy Chase': 18,
'Historic Anacostia': 19,
'Howard University, Le Droit Park, Cardozo/Shaw': 20,
'Ivy City, Arboretum, Trinidad, Carver Langston': 21,
'Kalorama Heights, Adams Morgan, Lanier Heights': 22,
'Lamont Riggs, Queens Chapel, Fort Totten, Pleasant Hill': 23,
'Mayfair, Hillbrook, Mahaning Heights': 24,
'Near Southeast, Navy Yard': 25,
'North Cleveland Park, Forest Hills, Van Ness': 26,
'North Michigan Park, Michigan Park, University Heights': 27,
'River Terrace, Benning, Greenway, Dupont Park': 28,
'Shaw, Logan Circle': 29,
'Sheridan, Barry Farm, Buena Vista': 30,
'Southwest Employment Area, Southwest/Waterfront, Fort McNair, Buzzard Point': 31,
'Spring Valley, Palisades, Wesley Heights, Foxhall Crescent, Foxhall Village, Georgetown Reservoir': 32,
'Takoma, Brightwood, Manor Park': 33,
'Twining, Fairlawn, Randle Highlands, Penn Branch, Fort Davis Park, Fort Dupont': 34,
'Union Station, Stanton Park, Kingman Park': 35,
'West End, Foggy Bottom, GWU': 36,
'Woodland/Fort Stanton, Garfield Heights, Knox Hill': 37,
'Woodridge, Fort Lincoln, Gateway': 38}

The neighbourhood column is mapped to the integers using the cat_encoder, and the values are appended to the list for the training and testing dataset separately.

# Append the values to the input list
input_list_train.append(X_train['neighbourhood'].map(cat_encoder).values)
input_list_test.append(X_test['neighbourhood'].map(cat_encoder).values)

# Take a look at the data
print('input_list_train:', input_list_train)
print('input_list_test:', input_list_test)

Output:

input_list_train: [array([17, 12,  7, ..., 16, 14, 27])]
input_list_test: [array([ 2, 17, 27, ..., 13, 16, 7])]

We can see that the list for the training and testing datasets now has arrays of numbers representing the neighbourhoods.

Step 6: Categorical Entity Embedding Model

In step 6, we will build a model with categorical entity embedding.

Firstly, let’s create the embedding layer using the Embedding function.

  • input_dim is the number of unique values for the categorical column. In this example, it is the unique number of neighbourhood.
  • output_dim is the dimension of the embedding output. How to decide this number? The authors of the entity embedding paper mentioned that it is a hyperparameter value to tune with the range of 1 to the number of categories minus 1. The authors proposed two general guidelines:
  1. If the number of aspects to describe the entities can be estimated, we can use that as the output_dim. More complex entities usually need more output dimensions. For example, neighbourhood can be described by population density, distance to major tour locations, convenience level, number of Airbnb listings, and safety index, so we set 5 as the number of output dimensions.
  2. If the number of aspects to describe the entities cannot be estimated, then start with the highest possible number of dimensions, which is the number of categories minus 1 for the hyperparameter tuning.
  • name gives a name for the layer.
  • The input dimension of the categorical variable is defined by the Input function. Input() is used to instantiate a Keras tensor. shape=(1,) indicates that the expected input will be a one-dimensional vector.
  • Reshape changed the output from 3-dimensional to 2-dimensional.
# Number of unique values in the categorical col
n_unique_cat = len(unique_cat)

# Input dimension of the categorical variable
input_cat = Input(shape=(1,))

# Output dimension of the categorical entity embedding
cat_emb_dim = 5

# Embedding layer
emb_cat = Embedding(input_dim=n_unique_cat, output_dim=cat_emb_dim, name="embedding_cat")(input_cat)
# Check the output shape
print(emb_cat)

# Reshape
emb_cat = Reshape(target_shape=(cat_emb_dim, ))(emb_cat)
# Check the output shape
print(emb_cat)

Output:

KerasTensor(type_spec=TensorSpec(shape=(None, 1, 5), dtype=tf.float32, name=None), name='embedding_cat/embedding_lookup/Identity_1:0', description="created by layer 'embedding_cat'")
KerasTensor(type_spec=TensorSpec(shape=(None, 5), dtype=tf.float32, name=None), name='reshape/Reshape:0', description="created by layer 'reshape'")

Recall that X_train has 11 columns, 1 categorical column neighbourhood and 10 numerical columns.

# Feature columns
X_train.columns

Output:

Index(['neighbourhood', 'minimum_nights', 'number_of_reviews',
'reviews_per_month', 'calculated_host_listings_count',
'availability_365', 'number_of_reviews_ltm', 'Entire home/apt',
'Hotel room', 'Private room', 'Shared room'],
dtype='object')

Next, let’s Append numerical values to the training and testing list.

# List of numerical columns
numeric_cols = ['minimum_nights', 'number_of_reviews',
'reviews_per_month', 'calculated_host_listings_count',
'availability_365', 'number_of_reviews_ltm', 'Entire home/apt',
'Hotel room', 'Private room', 'Shared room']

# Append numerical values to the training and testing list
input_list_train.append(X_train[numeric_cols].values)
input_list_test.append(X_test[numeric_cols].values)

#n Take a look at the data
print('input_list_train:', input_list_train)
print('input_list_test:', input_list_test)

We can see that the training and the testing input datasets have two arrays each. The first array contains the categorical encoders, and the second array contains the numerical variable values.

input_list_train: [array([17, 12,  7, ..., 16, 14, 27]), array([[31.  , 34.  ,  0.52, ...,  0.  ,  0.  ,  0.  ],
[31. , 2. , 0.15, ..., 0. , 0. , 0. ],
[31. , 28. , 0.65, ..., 0. , 0. , 0. ],
...,
[ 4. , 8. , 3.12, ..., 0. , 0. , 0. ],
[ 2. , 14. , 4.83, ..., 0. , 0. , 0. ],
[ 4. , 45. , 0.62, ..., 0. , 0. , 0. ]])]
input_list_test: [array([ 2, 17, 27, ..., 13, 16, 7]), array([[ 2. , 57. , 9.34, ..., 0. , 0. , 0. ],
[ 1. , 10. , 0.15, ..., 0. , 1. , 0. ],
[31. , 0. , 0. , ..., 0. , 1. , 0. ],
...,
[31. , 2. , 1.3 , ..., 0. , 1. , 0. ],
[ 3. , 68. , 1.43, ..., 0. , 0. , 0. ],
[ 1. , 7. , 1.52, ..., 0. , 0. , 0. ]])]

The input dimension of the numeric variables is defined by instantiating a Keras tensor using the Input() function. shape=(len(numeric_cols),) indicates that the expected input dimension is the same as the number of numeric columns. There is no embedding layers for the numeric values, so emb_numeric is the same as input_numeric.

# Input dimension of the numeric variables
input_numeric = Input(shape=(len(numeric_cols),))

# Output dimension of the numeric variables
emb_numeric = input_numeric

# Take a look at the output dimension
emb_numeric

We can see that the output dimension for the numeric variables is 10.

<KerasTensor: shape=(None, 10) dtype=float32 (created by layer 'input_2')>

The input Keras tensors for both the categorical and the numeric variables are put in a list called input_data. input_data shows that the first element has an input dimension of 1 and the second element has an input dimension of 10.

# Input data dimensions
input_data = [input_cat, input_numeric]

# Take a look at the data
input_data

Output:

[<KerasTensor: shape=(None, 1) dtype=float32 (created by layer 'input_1')>,
<KerasTensor: shape=(None, 10) dtype=float32 (created by layer 'input_2')>]

Similarly, the output Keras tensors for both the categorical and the numeric variables are put in a list called emb_data. emb_data shows that the first element has an output dimension of 5 and the second element has an output dimension of 10.

  • The categorical variable neighbourhood has a dimension of 5 because we specified the embedding dimension to be 5 in the embedding layer.
  • The numeric variables have a dimension of 10 because there are 10 numeric columns.
# Embedding dimensions
emb_data = [emb_cat, emb_numeric]

# Take a look at the data
emb_data

Output:

[<KerasTensor: shape=(None, 5) dtype=float32 (created by layer 'reshape')>,
<KerasTensor: shape=(None, 10) dtype=float32 (created by layer 'input_2')>]

Using the Concatenate() function, the Keras tensors in the list emb_data are concatenated together. The output Keras tensor has a dimension of 15, which is the sum of the dimension of the two tensors in the list.

# Concatenate layer concatenates a list of inputs
model_data = Concatenate()(emb_data)
model_data

Output:

<KerasTensor: shape=(None, 15) dtype=float32 (created by layer 'concatenate')>

The concatenated data is passed in the Dense layer of the model.

  • The first three Dense layers have 15 neurons, 8 neurons, and 4 neurons separately. They all have relu as the activation function.
  • The output layer has one neuron, and the activation function is linear because the target variable price is a continuous variable.

Then the input_data and outputs are grouped into an object using the Model function.

  • inputs takes in the inputs of the model. It can be a Keras Input object or a list of Keras Input objects. As the Keras documentation pointed out, Only dictionaries, lists, and tuples of input tensors are supported. Nested inputs such as a list of lists or a dictionary of dictionaries are not supported.
  • outputs takes in the outputs of the model.
  • name is the name of the model. We gave our model the name of Entity_embedding_model_keras.

We can print out the model details using the summary function.

# Dense layer with 10 neurons and relu activation function
model = Dense(10, activation = 'relu')(model_data)
# Dense layer with 5 neurons and relu activation function
model = Dense(5, activation = 'relu')(model)
# Dense layer with 2 neurons and relu activation function
model = Dense(2, activation = 'relu')(model)
# Output is linear
outputs = Dense(1, activation = 'linear')(model)

# Use Model to group layers into an object with training and inference features
nn = Model(inputs=input_data, outputs=outputs, name ='Entity_embedding_model_keras')

# Print out the model summary
nn.summary()

The summary of the model Entity_embedding_model_keras has four columns.

  • The first column is Layer (type). It has the layer names with the layer type in the parenthesis.
  1. The system automatically gives each layer a name if the user did not specify the layer name. Out of the 9 layer names, only embedding_cat is user-defined. All the other names are auto-assigned.
  2. The type in the parentheses describes the type of the corresponding layer. We can see that our model has the InputLayer, the Embedding layer, the Reshape layer, the Concatenate layer, and the Dense layer.
  • The second column is Output Shape.
  1. None means the batch size has not been fixed. It can be any number of samples.
  2. The Numbers are the number of neurons for the Dense layers and the number of columns for other layers.
  • The third column is Param #. It is the number of parameters needs to be estimated.
  1. The embedding_cat layer has 195 parameters, It is calculated using 39 unique neighbourhoods multiply by the embedding dimension of 5.
  2. The layer dense has 160 parameters. It is calculated using 15 input values from the Concatenate layer multiplied by the 10 neurons in dense_1, and plus 10 bias values from dense_1.
  3. The layer dense_1 has 55 parameters. It is calculated using 10 input values from the dense layer multiplied by the 5 neurons in dense_1, and plus 5 bias values from dense_1.
  4. The layer dense_2 has 12 parameters. It is calculated using 5 input values from the dense_1 layer multiplied by the 2 neurons in dense_2, and plus 2 bias values from dense_2.
  5. The last layer dense_3 has 3 parameters. It is calculated using 2 input values from the dense_2 layer multiplied by the 1 neuron in dense_3, and plus 1 bias value from dense_3.
  6. The InputLayer, Reshape layer, and Concatenate layer do not have parameters to estimate.
  • The fourth column is Connected to, indicating the input layer for the current layer.
  1. The two input layers, input_1 and input_2 do not have any value for this column.
  2. The embedding_cat layer is connected to input_1.
  3. The Reshape layer is connected to the embedding_cat layer.
  4. The Concatenate layer is connected to both the embedding_cat layer and the input_2 layer. The embedding_cat layer has the categorical entity embeddings, and the input_2 layer has the numeric inputs.
  5. The layer dense is connected to the Concatenate layer, and each of the following Dense layers is connected to the previous Dense layer.

At the bottom of the summary table, the numbers of Total params, Trainable params, and Non-trainable params are listed.

  • Total params is the total number of parameters for the model, and it is the sum of all the parameter numbers from each layer.
  • Trainable params is the number of trainable parameters that are trained using backpropagation. All the parameters in this model are trainable parameters.
  • Non-trainable paramsis the number of parameters that are not trained using backpropagation. Our model does not have any non-trainable params. We can add trainable=False to a dense layer to change it from trainable to non-trainable if needed.
Model: "Entity_embedding_model_keras"
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_1 (InputLayer) [(None, 1)] 0 []

embedding_cat (Embedding) (None, 1, 5) 195 ['input_1[0][0]']

reshape (Reshape) (None, 5) 0 ['embedding_cat[0][0]']

input_2 (InputLayer) [(None, 10)] 0 []

concatenate (Concatenate) (None, 15) 0 ['reshape[0][0]',
'input_2[0][0]']

dense (Dense) (None, 10) 160 ['concatenate[0][0]']

dense_1 (Dense) (None, 5) 55 ['dense[0][0]']

dense_2 (Dense) (None, 2) 12 ['dense_1[0][0]']

dense_3 (Dense) (None, 1) 3 ['dense_2[0][0]']

==================================================================================================
Total params: 425
Trainable params: 425
Non-trainable params: 0

The neural network model structure can be represented by a graph and saved as a picture.

  • From the graph, we can see that one categorical variable is the input for the embedding layer and the embedding output 5 variables.
  • The five variables are concatenated with the 10 numeric variables and a total of 15 variables are the inputs for the dense layers.
# Print model structure
plot_model(nn, show_shapes=True, show_layer_names=True, to_file='Entity_embedding_model_keras.png')
Image(retina=True, filename='Entity_embedding_model_keras.png')
Categorical Entity Embedding Using Python Tensorflow Keras Entity embedding for high cardinality categorical variables
Neural network model structure — GrabNGoInfo.com

When compiling the model, we set mean_squared_error as loss, adam as the optimizer, and mae as metrics.

EarlyStopping is used to prevent overfitting and save computing resources.

  • monitor='val_loss' means that the validation loss will be monitored.
  • mode can take the value of 'min', 'max' or 'auto'.
  1. 'min' means the training stops if the metric stops decreasing.
  2. 'max' means the training stops if the metric stops increasing.
  3. 'auto' means the stop criterion is inferred from the name of the monitored metric.
  • verbose controls whether to print out messages. It has two values, 0 and 1. 0 means do not print out any message, and 1 means that a message will be printed when the early stopping happens.
  • patience is the threshold number of epochs without improvement.
  • restore_best_weights = True means that the model weights with the best monitored value will be stored.

When fitting the model, we pass in the training dataset, and use the testing dataset as the validation data.

  • One epoch is when all the training dataset is used once. epochs=1000 means the model goes through all the training datasets a maximum of 1000 times.
  • batch_size=64 means that 64 samples will be used to update the weights and biases each time. Because we have 5128 records in the training dataset, it will be divided into 80 batches with 64 samples and 1 batch with 8 samples.
  • verbose=1 means that we will print out some messages during the model training process.
  • We pass in the EarlyStopping variable to callbacks.
# Compile model
nn.compile(loss='mean_squared_error', optimizer='adam', metrics=['mae'])

# Set up early stopping
es = EarlyStopping(monitor='val_loss',
mode='min',
verbose=1,
patience=50,
restore_best_weights=True)

# Fit the model
history = nn.fit(input_list_train,
y_train,
validation_data=(input_list_test, y_test),
epochs=1000,
batch_size=64,
verbose=1,
callbacks=[es])

The early stopping happened at epoch 883 with the training mean absolute error (MAE) of 68 and validation loss of 69.

Epoch 882/1000
81/81 [==============================] - 0s 4ms/step - loss: 11156.1582 - mae: 67.8472 - val_loss: 13683.8223 - val_mae: 73.7888
Epoch 883/1000
64/81 [======================>.......] - ETA: 0s - loss: 11323.4639 - mae: 68.3551Restoring model weights from the end of the best epoch: 833.
81/81 [==============================] - 0s 4ms/step - loss: 11120.8857 - mae: 67.8158 - val_loss: 13986.5391 - val_mae: 68.9497
Epoch 883: early stopping

Step 7: Model Performance

In step 7, we will check the model performance.

The first visualization plots the loss change over epochs. We set the loss to the mean squared error in the model, loss represents the mean squared error in the chart. We can see that both the training and the validation loss decrease with the epochs, validation has a higher loss than the training dataset. This may suggest model overfitting. We will leave the topic of correcting overfitting for a neural network model to a separate tutorial.

# summarize history for loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'val'], loc='upper right')
Categorical Entity Embedding Using Python Tensorflow Keras Entity embedding for high cardinality categorical variables
History for loss — GrabNGoInfo.com

The second visualization plots the mean absolute error (MAE) over epochs. We can see that both the training and the validation mean absolute error (MAE) decrease with the epochs, validation has a slightly higher mean absolute error (MAE) than the training dataset.

# summarize history for mae
plt.plot(history.history['mae'])
plt.plot(history.history['val_mae'])
plt.title('Mean Absolute Error (MAE)')
plt.ylabel('Mean Absolute Error (MAE)')
plt.xlabel('epoch')
plt.legend(['train', 'val'], loc='lower right')
plt.show()
Categorical Entity Embedding Using Python Tensorflow Keras Entity embedding for high cardinality categorical variables
History for MAE — GrabNGoInfo.com

Using .predict, we make predictions on the testing dataset. The output is two-dimensional, so flatten is used to change the prediction to one-dimensional.

The scatter plot of the predicted price vs. the actual price shows that a higher predicted price corresponds to a higher actual price in general, but there are some listings with much higher actual prices than the predicted prices.

# Make prediction
y_test_predict = nn.predict(input_list_test)

# Change the predictions from 2-d to 1-d
y_test_predict = y_test_predict.flatten()

# Visualization
ax = sns.scatterplot(y_test, y_test_predict)
Categorical Entity Embedding Using Python Tensorflow Keras Entity embedding for high cardinality categorical variables
Actual vs. predicted price — GrabNGoInfo.com

To quantify the difference between the actual price and the predicted price, we calculated a variable called model_error, and used it to calculate the mean squared error (MSE), the root mean squared error (RMSE), the mean absolute error (MAE), R squared, and the mean absolute percentage error (MAPE).

# Calculate model error
model_error = y_test - y_test_predict

# Mean squared error
MSE = np.mean(model_error**2)
# Root mean squared error
RMSE = np.sqrt(MSE)
# Mean absolute error
MAE = np.mean(abs(model_error))
# R squared
R2 = 1- sum(model_error**2)/sum((y_test-np.mean(y_test))**2)
# Mean absolute percentage error
MAPE = np.mean(abs(model_error/y_test))

print(f'The MSE for the model is {MSE:.2f}')
print(f'The RMSE for the model is {RMSE:.2f}.')
print(f'The MAE for the model is {MAE:.2f}.')
print(f'The R-squared for the model is {R2:.2f}.')
print(f'The MAPE for the model is {MAPE:.2f}.')

Output:

The MSE for the model is 13663.52
The RMSE for the model is 116.89.
The MAE for the model is 72.51.
The R-squared for the model is 0.24.
The MAPE for the model is 0.49.

The same metrics can be calculated using the functions from the sklearn.metrics library. We can see that the sklearn library and the manual calculation produce the same results.

# Import library
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error, mean_absolute_percentage_error

# Mean squared error
MSE = mean_squared_error(y_test, y_test_predict)
# Root mean squared error
RMSE = np.sqrt(MSE)
# Mean absolute error
MAE = mean_absolute_error(y_test, y_test_predict)

R2 = r2_score(y_test, y_test_predict)
MAPE = mean_absolute_percentage_error(y_test, y_test_predict)

print(f'The MSE for the model is {MSE:.2f}')
print(f'The RMSE for the model is {RMSE:.2f}.')
print(f'The MAE for the model is {MAE:.2f}.')
print(f'The R-squared for the model is {R2:.2f}.')
print(f'The MAPE for the model is {MAPE:.2f}.')

Output:

The MSE for the model is 13663.52
The RMSE for the model is 116.89.
The MAE for the model is 72.51.
The R-squared for the model is 0.24.
The MAPE for the model is 0.49.

Step 8: Extract Categorical Embeddings

In step 8, we will extract the categorical embeddings.

The weights are extracted from the embedding layer and saved in a dataframe. We can see that the dataframe has 39 rows, each row representing one unique neighbourhood. There are 6 columns in the dataframe. The first column is the encoded categorical index, the other five columns are the embeddings.

# Get weights from the embedding layer
cat_emb_df = pd.DataFrame(nn.get_layer('embedding_cat').get_weights()[0]).reset_index()

# Add prefix to the embedding names
cat_emb_df = cat_emb_df.add_prefix('cat_')

# Take a look at the data
cat_emb_df

To append the neighbourhood name to the embedding dataframe, we put the categorical encoder dictionary into a dataframe, and merge the dataframe with the embedding dataframe on the categorical index.

# Put the categorical encoder dictionary into a dataframe
cat_encoder_df = pd.DataFrame(cat_encoder.items(), columns=['cat', 'cat_index'])

# Merge data to append the category name
cat_emb_df = pd.merge(cat_encoder_df, cat_emb_df, how = 'inner', on='cat_index')

# Take a look at the data
cat_emb_df.head()
Categorical Entity Embedding Using Python Tensorflow Keras Entity embedding for high cardinality categorical variables
Sample of embedding output — GrabNGoInfo.com

Step 9: Save Entity Embedding Results

In step 9, we will save the entity embedding results. We can use to_csv to save the entity embedding results to a csv file. index = False means that the index of the dataframe will not be saved.

The embedding model can be saved in hdf5 format using .save.

# Save embedding results
cat_emb_df.to_csv('cat_embedding_keras.csv', index = False)

# Save model
nn.save("cat_embedding_keras.hdf5")

In the future, the model can be loaded using the load_model function.

# Load model
loaded_nn = load_model("cat_embedding_keras.hdf5")

Step 10: Baseline Random Forest Model

In step 10, we will build a random forest model as a baseline model to compare with the neural network model.

The model used all the 10 numeric features and the scatterplot of actual vs. predicted values generally aligns. Similar to the neural network model, some listings have much higher listing actual prices than the predicted prices.

# Feature list
base_cols = ['minimum_nights',
'number_of_reviews',
'reviews_per_month',
'calculated_host_listings_count',
'availability_365',
'number_of_reviews_ltm',
'Entire home/apt',
'Hotel room',
'Private room',
'Shared room']

# Initiate the model
base_rf = RandomForestRegressor()

# Fit the model
base_rf.fit(X_train[base_cols], y_train)

# Make predictions
base_y_test_prediction = base_rf.predict(X_test[base_cols])

# Visualization
ax = sns.scatterplot(y_test, base_y_test_prediction)
Categorical Entity Embedding Using Python Tensorflow Keras Entity embedding for high cardinality categorical variables
Actual vs. predicted price — GrabNGoInfo.com

The random forest model performance metrics are similar to the neural network model metrics.

# Prediction error
base_model_error = y_test - base_y_test_prediction

# Mean squared error
MSE = np.mean(base_model_error**2)
# Root mean squared error
RMSE = np.sqrt(MSE)
# Mean absolute error
MAE = np.mean(abs(base_model_error))
# R squared
R2 = 1- sum(base_model_error**2)/sum((y_test-np.mean(y_test))**2)
# Mean absolute percentage error
MAPE = np.mean(abs(base_model_error/y_test))

print(f'The MSE for the model is {MSE:.2f}')
print(f'The RMSE for the model is {RMSE:.2f}.')
print(f'The MAE for the model is {MAE:.2f}.')
print(f'The R-squared for the model is {R2:.2f}.')
print(f'The MAPE for the model is {MAPE:.2f}.')

Output:

The MSE for the model is 13551.77
The RMSE for the model is 116.41.
The MAE for the model is 71.29.
The R-squared for the model is 0.25.
The MAPE for the model is 0.51.

Step 11: Use Categorical Entity Embedding in Other Machine Learning Models

In step 11, we will talk about how to use the categorical entity embeddings in other machine learning models. The random forest model will be used as an example, and other models can follow the same process.

To use the entity embeddings in a model, we treat the embeddings as the five features for each neighbourhood and appended them to the training and testing dataset separately.

# Append categorical embeddings to the training dataset
X_train_emb = pd.merge(X_train, cat_emb_df, left_on='neighbourhood', right_on='cat', how='inner').drop(['neighbourhood','cat', 'cat_index'], axis=1)

# Append categorical embeddings to the testing dataset
X_test_emb = pd.merge(X_test, cat_emb_df, left_on='neighbourhood', right_on='cat', how='inner').drop(['neighbourhood','cat', 'cat_index'], axis=1)

# Check info for the training dataset
X_train_emb.info()

# Check info for the testing dataset
X_test_emb.info()

We can see that after adding the embedding features, both the training dataset and the testing dataset have 15 columns.

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5128 entries, 0 to 5127
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 minimum_nights 5128 non-null int64
1 number_of_reviews 5128 non-null int64
2 reviews_per_month 5128 non-null float64
3 calculated_host_listings_count 5128 non-null int64
4 availability_365 5128 non-null int64
5 number_of_reviews_ltm 5128 non-null int64
6 Entire home/apt 5128 non-null uint8
7 Hotel room 5128 non-null uint8
8 Private room 5128 non-null uint8
9 Shared room 5128 non-null uint8
10 cat_0 5128 non-null float32
11 cat_1 5128 non-null float32
12 cat_2 5128 non-null float32
13 cat_3 5128 non-null float32
14 cat_4 5128 non-null float32
dtypes: float32(5), float64(1), int64(5), uint8(4)
memory usage: 400.6 KB

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1283 entries, 0 to 1282
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 minimum_nights 1283 non-null int64
1 number_of_reviews 1283 non-null int64
2 reviews_per_month 1283 non-null float64
3 calculated_host_listings_count 1283 non-null int64
4 availability_365 1283 non-null int64
5 number_of_reviews_ltm 1283 non-null int64
6 Entire home/apt 1283 non-null uint8
7 Hotel room 1283 non-null uint8
8 Private room 1283 non-null uint8
9 Shared room 1283 non-null uint8
10 cat_0 1283 non-null float32
11 cat_1 1283 non-null float32
12 cat_2 1283 non-null float32
13 cat_3 1283 non-null float32
14 cat_4 1283 non-null float32
dtypes: float32(5), float64(1), int64(5), uint8(4)
memory usage: 100.2 KB

Using the new feature list with embeddings to train the same random forest model, we can get the new predicted results for the testing dataset.

The dots in the scatter plot are more scattered than the base model, indicating that this model may have a worse performance.

# Initiate the model
emb_rf = RandomForestRegressor()

# Fit the model
emb_rf.fit(X_train_emb, y_train)

# Make predictions
emb_y_test_prediction = emb_rf.predict(X_test_emb)

# Visualization
ax = sns.scatterplot(y_test, emb_y_test_prediction)
Categorical Entity Embedding Using Python Tensorflow Keras Entity embedding for high cardinality categorical variables
Actual vs. predicted price — GrabNGoInfo.com

The model performance metrics confirmed that the random forest model with embeddings performs worse than the baseline model. The values are worse across all the metrics.

This tells us that adding categorical entity embeddings does not always improve the model’s performance. Sometimes it just adds noise to the model and makes the model performance worse. So it is always a good idea to have a baseline model before adding entity embeddings.

# Model error
emb_model_error = y_test - emb_y_test_prediction

# Mean squared error
MSE = np.mean(emb_model_error**2)
# Root mean squared error
RMSE = np.sqrt(MSE)
# Mean absolute error
MAE = np.mean(abs(emb_model_error))
# R squared
R2 = 1- sum(emb_model_error**2)/sum((y_test-np.mean(y_test))**2)
# Mean absolute percentage error
MAPE = np.mean(abs(emb_model_error/y_test))

print(f'The MSE for the model is {MSE:.2f}')
print(f'The RMSE for the model is {RMSE:.2f}.')
print(f'The MAE for the model is {MAE:.2f}.')
print(f'The R-squared for the model is {R2:.2f}.')
print(f'The MAPE for the model is {MAPE:.2f}.')

Output:

The MSE for the model is 19586.73
The RMSE for the model is 139.95.
The MAE for the model is 95.32.
The R-squared for the model is -0.09.
The MAPE for the model is 0.84.

One possible reason for the worse performance of the random forest model with embedding is the small dataset for training the embeddings. There are not enough information in the dataset for the model to learn meaningful embeddings.

When millions of records are used to train the embeddings in my work, the baseline XGBoost model performance was improved slightly.

The entity embedding paper has two tables showing the model performance improvement across four different machine learning algorithms.

Categorical Entity Embedding Using Python Tensorflow Keras Entity embedding for high cardinality categorical variables
Guo and Berkhahn paper — Entity Embeddings of Categorical Variables

What’s your experience with using entity embedding? Did it improve the model’s performance? Please let me know in the comments.

For more information about data science and machine learning, please check out my YouTube channel and Medium Page or follow me on LinkedIn.


Recommended Tutorials

References

Leave a Comment

Your email address will not be published. Required fields are marked *