Categorical entity embedding extracts the embedding layers of categorical variables from a neural network model, and uses numeric vectors to represent the properties of the categorical values. It is usually used on categorical variables with high cardinalities.

For example, a marketing company can create categorical entity embedding for different campaigns to represent the characteristics using vectors, and use those vectors to understand the similarities between campaigns, or put the vectors as features into different machine learning models to improve the model performance.

In this tutorial, Python TensorFlow Keras is used to create categorical entity embeddings of Airbnb neighbourhood data. We will talk about:

- How to do data processing for categorical entity embedding?
- How to build a neural network model with entity embedding?
- How to extract the categorical embedding layers?
- How to use categorical entity embedding in other machine learning models?

**Resources for this post:**

- Video tutorial for this post on YouTube
- Click here for the Colab notebook
- More video tutorials on Deep Learning
- More blog posts on Deep Learning

Let’s get started!

### Step 1: Install And Import Python Libraries

In step 1, we will install and import python libraries.

Firstly, let’s install the `matplotlib`

version 3.4.2. `matplotlib`

version 3.4.2 is installed because a `matplotlib`

version later than 3.4.0 has a new functionality of adding labels to seaborn visualization.

# Change matplotlib to a version later than 3.4.0 for the countplot visualization

!pip install matplotlib==3.4.2

After the installation and restarting of the runtime, we can import the libraries.

`pandas`

and`numpy`

are imported for data processing.`train_test_split`

is for train test splitting.`RandomForestRegressor`

is for building the random forest model.`seaborn`

and`matplotlib`

are for visualization.`plot_model`

and`Image`

are for visualizing neural network model structures.`tensorflow`

and`EarlyStopping`

are for the neural network model.

# Data processing

import pandas as pd

import numpy as np

# Train test split

from sklearn.model_selection import train_test_split

# Model

from sklearn.ensemble import RandomForestRegressor

# Visualization

import seaborn as sns

sns.set(rc={'figure.figsize':(12,8)}) # Set figure size

import matplotlib.pyplot as plt

# Visualize neural network model structure

from keras.utils import plot_model

from IPython.display import Image

# Deep learning model

from tensorflow.keras.layers import Input, Dense, Reshape, Concatenate, Embedding

from tensorflow.keras.models import Model, load_model

from keras.callbacks import EarlyStopping

### Step 2: Download And Read Airbnb Review Data

The second step is to download and read the dataset.

A website called Inside Airbnb had the Airbnb data publicly available for research. We used the listing data for Washington D.C. for this analysis, but the website provides other data for other locations around the world.

Please follow these steps to download the data.

- Go to: http://insideairbnb.com/get-the-data
- Scroll down the page until you see the section called
**Washington, D.C., District of Columbia, United States**. - Click the blue file name “listings.csv” to download the data.
- Copy the downloaded file “listings.csv” to your project folder.

Note that Inside Airbnb generally provides quarterly data for the past 12 months, but users can make a data request for historical data of a longer time range if needed.

Those who are using Google Colab for this analysis need to mount Google Drive to read the dataset. You can ignore the code below if you are not using Google Colab.

`drive.mount`

is used to mount to the Google drive so the colab notebook can access the data on the Google drive.`os.chdir`

is used to change the default directory on Google drive. I suggest setting the default directory to the project folder.`!pwd`

is used to print the current working directory.

Please check out Google Colab Tutorial for Beginners for details about using Google Colab for data science projects.

# Mount Google Drive

from google.colab import drive

drive.mount('/content/drive')

# Change directory

import os

os.chdir("drive/My Drive/contents/nlp")

# Print out the current directory

!pwd

The listing data has the property information aggregated at the listing ID level. We will build a simple model to predict the listing price, and the following columns will be read from the dataset.

`id`

is the unique ID for the Airbnb listing.`neighbourhood`

is the neighbourhood name where the listing is located.`room_type`

is the type of room. It can be the entire house, a room in a house, etc.`price`

is the daily price in local currency. Washington D.C. is in the United States, so the currency for the price is US dollars.`minimum_nights`

is the listing’s minimum number of nights for the stay.`number_of_reviews`

is the total number of reviews for the listing.`reviews_per_month`

is the average number of reviews per month.`calculated_host_listings_count`

is the total number of listings that the host has in Washington D.C.`availability_365`

is the availability of the listing in the next 365 days. The listing can be unavailable because of the guest booking or the host blocking.`number_of_reviews_ltm`

is the number of reviews in the last 12 months.

More details can be found in the Inside Airbnb data dictionary.

# List of columns to read

cols_to_keep = ['id',

'neighbourhood',

'room_type',

'price',

'minimum_nights',

'number_of_reviews',

'reviews_per_month',

'calculated_host_listings_count',

'availability_365',

'number_of_reviews_ltm']

# Read data

df = pd.read_csv('airbnb/airbnb_listings_dc_20020914.csv', usecols=cols_to_keep)

# Take a look at the data

df.head()

Using `.info()`

, we can see that the dataset has 6473 records and 10 columns. 2 out of the 10 columns are categorical, and the two categorical columns are `neighbourhood`

and `room_type`

. Most of the columns do not have missing data. Only the variable `reviews_per_month`

has missing data and needs missing imputation.

# Check the dataframe information

df.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 6473 entries, 0 to 6472

Data columns (total 10 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 id 6473 non-null int64

1 neighbourhood 6473 non-null object

2 room_type 6473 non-null object

3 price 6473 non-null int64

4 minimum_nights 6473 non-null int64

5 number_of_reviews 6473 non-null int64

6 reviews_per_month 5295 non-null float64

7 calculated_host_listings_count 6473 non-null int64

8 availability_365 6473 non-null int64

9 number_of_reviews_ltm 6473 non-null int64

dtypes: float64(1), int64(7), object(2)

memory usage: 505.8+ KB

### Step 3: Data Processing

In step 3, we will work on data processing.

For the missing values in the variable `reviews_per_month`

, I suspect that the missing values are generated because there are no reviews for the listing. To confirm this assumption, I filtered the dataframe and kept only the records with missing values, then I checked the values for the variable `number_of_reviews`

.

# Check the records with missing data

df[df['reviews_per_month'].isnull()]['number_of_reviews'].value_counts()

0 1178

Name: number_of_reviews, dtype: int64

We can see that all the listings with missing values have zero reviews. Therefore, we need to impute the missing values to zeros.

# Impute the missing values for reviews_per_month to 0

df['reviews_per_month'] = df['reviews_per_month'].fillna(0)

The price range shows that the minimum price is 0 and the maximum price is 10,000. We removed the outlier prices and only kept the listings with a daily price greater than 20 dollars and less than 1000 dollars.

# Check the min and max price

print(f'The minimum price is {df.price.min()} and the maximum price is {df.price.max()}.')

# Remove outliers

df = df[(df['price']>20) & (df['price']<1000)]

Output

The minimum price is 0 and the maximum price is 10000.

The visualization for the price data shows that all the outliers are removed.

# Visualization

sns.displot(df['price'])

The variable `room_type`

has four values, `Private room`

, `Entire home/apt`

, `Shared room`

, and `Hotel room`

. `Entire home/apt`

is the most popular `room_type`

and `Private room`

is the second most popular `room_type`

.

# Distribution of multiple treatments

ax = sns.countplot(df['room_type'])

# Add labels

ax.bar_label(ax.containers[0])

Because the number of categories is small for `room_type`

, we will not do entity embeddings. Instead, we will use `get_dummies`

to create a dummy variable with zero and one values for each category. After the dummy variables are created, we append them to the dataframe `df`

and drop the column `room_type`

.

# Create dummy varialbes

room_type_dummies = pd.get_dummies(df['room_type'])

# Concat dummy variables to df and drop the original category

df = pd.concat([df, room_type_dummies], axis=1).drop('room_type', axis=1)

# Take a look at the data

df.head()

After the data processing, we have 6411 records and 13 columns. There are no missing values in the dataset and `neighbourhood`

is the only categorical variable.

# Get data information

df.info()

<class 'pandas.core.frame.DataFrame'>

Int64Index: 6411 entries, 0 to 6472

Data columns (total 13 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 id 6411 non-null int64

1 neighbourhood 6411 non-null object

2 price 6411 non-null int64

3 minimum_nights 6411 non-null int64

4 number_of_reviews 6411 non-null int64

5 reviews_per_month 6411 non-null float64

6 calculated_host_listings_count 6411 non-null int64

7 availability_365 6411 non-null int64

8 number_of_reviews_ltm 6411 non-null int64

9 Entire home/apt 6411 non-null uint8

10 Hotel room 6411 non-null uint8

11 Private room 6411 non-null uint8

12 Shared room 6411 non-null uint8

dtypes: float64(1), int64(7), object(1), uint8(4)

memory usage: 783.9+ KB

Using `.nunique()`

, we can confirm that the dataset is unique at the listing id level.

# Number of unique IDs

df['id'].nunique()

6411

The `countplot`

of `neighbourhood`

shows that the number of listings in a neighbourhood ranges from 12 to 557. The `Capitol Hill, Lincoln Park`

neighbourhood has the highest number of listings.

# Countplot

ax = sns.countplot(df['neighbourhood'])

# Add labels

ax.bar_label(ax.containers[0])

# Rotate x labels

ax.tick_params(axis='x', rotation=90)

### Step 4: Train Test Split

In step 4, we will do the train test split for the model.

`X`

has all the features for the model prediction.`.iloc`

is used to select the features starting the 2nd column and exclude the`id`

variable.`.iloc`

create a new variable that refers to the same memory as the original dataframe`df`

. Changing the new dataframe variable can alter the original dataframe. Therefore, we used .copy() to create a new copy so`X`

and`df`

use separate memories that do not affect each other.`y`

is the target variable. We are predicting the daily listing prices, so the column`price`

is used as the target.`X`

and`y`

are passed in the`train_test_split`

to create the training and testing datasets.`test_size = 0.2`

means that 80% of the data are used for training and 20% of the data are used for testing.`random_state`

makes the train test split results reproducible.

# Features

X = df.iloc[:, 1:].copy().drop('price', axis=1)

# Target

y = df['price']

# Train test split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42)

# Check the number of records in training and testing dataset.

print(f'The training dataset has {X_train.shape[0]} records and {X_train.shape[1]} columns.')

print(f'The testing dataset has {len(X_test)} records.')

The training dataset has 5128 records and 11 columns.

The testing dataset has 1283 records.

After the train test split, the training dataset has 5128 records and 11 columns, and the testing dataset has 1283 records.

### Step 5: Categorical Data Label Encoding

In step 5, we will do the categorical label encoding for the `neighbourhood`

variable.

- Firstly, two empty lists are created, one for the training data, and the other for the test data.
- Next, an empty dictionary called
`cat_encoder`

is created. This dictionary will be used to save the label encodings for the categorical variable`neighbourhood`

. - After that, all the unique values for
`neighbourhood`

are extracted and saved in a variable called`unique_cat`

. There are 39 unique neighbourhoods in the training dataset. - Then, we loop through each neighourhood and assign an integer to the neighbourhood.
- Finally, the
`cat_encoder`

is printed out and we can see that each neighbourhood is a key and each key has an integer as a value. There are 39 elements in the dictionary, corresponding to the 39 unique neighbourhoods.

# Input list for the training data

input_list_train = []

# Input list for the testing data

input_list_test = []

# Categorical encoder is in dictionary format

cat_encoder = {}

# Unique values for the categorical variable

unique_cat = np.unique(X_train['neighbourhood'])

# Print out the number of unique values in the categorical variable

print(f'There are {len(unique_cat)} unique neighbourhoods in the training dataset.\n')

# Encode the categorical variable

for i in range(len(unique_cat)):

cat_encoder[unique_cat[i]] = i

# Take a look at the encoder

cat_encoder

Output

There are 39 unique neighbourhoods in the training dataset.

{'Brightwood Park, Crestwood, Petworth': 0,

'Brookland, Brentwood, Langdon': 1,

'Capitol Hill, Lincoln Park': 2,

'Capitol View, Marshall Heights, Benning Heights': 3,

'Cathedral Heights, McLean Gardens, Glover Park': 4,

'Cleveland Park, Woodley Park, Massachusetts Avenue Heights, Woodland-Normanstone Terrace': 5,

'Colonial Village, Shepherd Park, North Portal Estates': 6,

'Columbia Heights, Mt. Pleasant, Pleasant Plains, Park View': 7,

'Congress Heights, Bellevue, Washington Highlands': 8,

'Deanwood, Burrville, Grant Park, Lincoln Heights, Fairmont Heights': 9,

'Douglas, Shipley Terrace': 10,

'Downtown, Chinatown, Penn Quarters, Mount Vernon Square, North Capitol Street': 11,

'Dupont Circle, Connecticut Avenue/K Street': 12,

'Eastland Gardens, Kenilworth': 13,

'Edgewood, Bloomingdale, Truxton Circle, Eckington': 14,

'Fairfax Village, Naylor Gardens, Hillcrest, Summit Park': 15,

'Friendship Heights, American University Park, Tenleytown': 16,

'Georgetown, Burleith/Hillandale': 17,

'Hawthorne, Barnaby Woods, Chevy Chase': 18,

'Historic Anacostia': 19,

'Howard University, Le Droit Park, Cardozo/Shaw': 20,

'Ivy City, Arboretum, Trinidad, Carver Langston': 21,

'Kalorama Heights, Adams Morgan, Lanier Heights': 22,

'Lamont Riggs, Queens Chapel, Fort Totten, Pleasant Hill': 23,

'Mayfair, Hillbrook, Mahaning Heights': 24,

'Near Southeast, Navy Yard': 25,

'North Cleveland Park, Forest Hills, Van Ness': 26,

'North Michigan Park, Michigan Park, University Heights': 27,

'River Terrace, Benning, Greenway, Dupont Park': 28,

'Shaw, Logan Circle': 29,

'Sheridan, Barry Farm, Buena Vista': 30,

'Southwest Employment Area, Southwest/Waterfront, Fort McNair, Buzzard Point': 31,

'Spring Valley, Palisades, Wesley Heights, Foxhall Crescent, Foxhall Village, Georgetown Reservoir': 32,

'Takoma, Brightwood, Manor Park': 33,

'Twining, Fairlawn, Randle Highlands, Penn Branch, Fort Davis Park, Fort Dupont': 34,

'Union Station, Stanton Park, Kingman Park': 35,

'West End, Foggy Bottom, GWU': 36,

'Woodland/Fort Stanton, Garfield Heights, Knox Hill': 37,

'Woodridge, Fort Lincoln, Gateway': 38}

The `neighbourhood`

column is mapped to the integers using the `cat_encoder`

, and the values are appended to the list for the training and testing dataset separately.

# Append the values to the input list

input_list_train.append(X_train['neighbourhood'].map(cat_encoder).values)

input_list_test.append(X_test['neighbourhood'].map(cat_encoder).values)

# Take a look at the data

print('input_list_train:', input_list_train)

print('input_list_test:', input_list_test)

Output:

input_list_train: [array([17, 12, 7, ..., 16, 14, 27])]

input_list_test: [array([ 2, 17, 27, ..., 13, 16, 7])]

We can see that the list for the training and testing datasets now has arrays of numbers representing the neighbourhoods.

### Step 6: Categorical Entity Embedding Model

In step 6, we will build a model with categorical entity embedding.

Firstly, let’s create the embedding layer using the `Embedding`

function.

`input_dim`

is the number of unique values for the categorical column. In this example, it is the unique number of neighbourhood.`output_dim`

is the dimension of the embedding output. How to decide this number? The authors of the entity embedding paper mentioned that it is a hyperparameter value to tune with the range of 1 to the number of categories minus 1. The authors proposed two general guidelines:

- If the number of aspects to describe the entities can be estimated, we can use that as the
`output_dim`

. More complex entities usually need more output dimensions. For example,`neighbourhood`

can be described by population density, distance to major tour locations, convenience level, number of Airbnb listings, and safety index, so we set 5 as the number of output dimensions. - If the number of aspects to describe the entities cannot be estimated, then start with the highest possible number of dimensions, which is the number of categories minus 1 for the hyperparameter tuning.

`name`

gives a name for the layer.- The input dimension of the categorical variable is defined by the
`Input`

function.`Input()`

is used to instantiate a Keras tensor.`shape=(1,)`

indicates that the expected input will be a one-dimensional vector. `Reshape`

changed the output from 3-dimensional to 2-dimensional.

# Number of unique values in the categorical col

n_unique_cat = len(unique_cat)

# Input dimension of the categorical variable

input_cat = Input(shape=(1,))

# Output dimension of the categorical entity embedding

cat_emb_dim = 5

# Embedding layer

emb_cat = Embedding(input_dim=n_unique_cat, output_dim=cat_emb_dim, name="embedding_cat")(input_cat)

# Check the output shape

print(emb_cat)

# Reshape

emb_cat = Reshape(target_shape=(cat_emb_dim, ))(emb_cat)

# Check the output shape

print(emb_cat)

Output:

KerasTensor(type_spec=TensorSpec(shape=(None, 1, 5), dtype=tf.float32, name=None), name='embedding_cat/embedding_lookup/Identity_1:0', description="created by layer 'embedding_cat'")

KerasTensor(type_spec=TensorSpec(shape=(None, 5), dtype=tf.float32, name=None), name='reshape/Reshape:0', description="created by layer 'reshape'")

Recall that `X_train`

has 11 columns, 1 categorical column `neighbourhood`

and 10 numerical columns.

# Feature columns

X_train.columns

Output:

Index(['neighbourhood', 'minimum_nights', 'number_of_reviews',

'reviews_per_month', 'calculated_host_listings_count',

'availability_365', 'number_of_reviews_ltm', 'Entire home/apt',

'Hotel room', 'Private room', 'Shared room'],

dtype='object')

Next, let’s Append numerical values to the training and testing list.

# List of numerical columns

numeric_cols = ['minimum_nights', 'number_of_reviews',

'reviews_per_month', 'calculated_host_listings_count',

'availability_365', 'number_of_reviews_ltm', 'Entire home/apt',

'Hotel room', 'Private room', 'Shared room']

# Append numerical values to the training and testing list

input_list_train.append(X_train[numeric_cols].values)

input_list_test.append(X_test[numeric_cols].values)

#n Take a look at the data

print('input_list_train:', input_list_train)

print('input_list_test:', input_list_test)

We can see that the training and the testing input datasets have two arrays each. The first array contains the categorical encoders, and the second array contains the numerical variable values.

input_list_train: [array([17, 12, 7, ..., 16, 14, 27]), array([[31. , 34. , 0.52, ..., 0. , 0. , 0. ],

[31. , 2. , 0.15, ..., 0. , 0. , 0. ],

[31. , 28. , 0.65, ..., 0. , 0. , 0. ],

...,

[ 4. , 8. , 3.12, ..., 0. , 0. , 0. ],

[ 2. , 14. , 4.83, ..., 0. , 0. , 0. ],

[ 4. , 45. , 0.62, ..., 0. , 0. , 0. ]])]

input_list_test: [array([ 2, 17, 27, ..., 13, 16, 7]), array([[ 2. , 57. , 9.34, ..., 0. , 0. , 0. ],

[ 1. , 10. , 0.15, ..., 0. , 1. , 0. ],

[31. , 0. , 0. , ..., 0. , 1. , 0. ],

...,

[31. , 2. , 1.3 , ..., 0. , 1. , 0. ],

[ 3. , 68. , 1.43, ..., 0. , 0. , 0. ],

[ 1. , 7. , 1.52, ..., 0. , 0. , 0. ]])]

The input dimension of the numeric variables is defined by instantiating a Keras tensor using the `Input()`

function. `shape=(len(numeric_cols),)`

indicates that the expected input dimension is the same as the number of numeric columns. There is no embedding layers for the numeric values, so `emb_numeric`

is the same as `input_numeric`

.

# Input dimension of the numeric variables

input_numeric = Input(shape=(len(numeric_cols),))

# Output dimension of the numeric variables

emb_numeric = input_numeric

# Take a look at the output dimension

emb_numeric

We can see that the output dimension for the numeric variables is 10.

<KerasTensor: shape=(None, 10) dtype=float32 (created by layer 'input_2')>

The input Keras tensors for both the categorical and the numeric variables are put in a list called `input_data`

. `input_data`

shows that the first element has an input dimension of 1 and the second element has an input dimension of 10.

# Input data dimensions

input_data = [input_cat, input_numeric]

# Take a look at the data

input_data

Output:

[<KerasTensor: shape=(None, 1) dtype=float32 (created by layer 'input_1')>,

<KerasTensor: shape=(None, 10) dtype=float32 (created by layer 'input_2')>]

Similarly, the output Keras tensors for both the categorical and the numeric variables are put in a list called `emb_data`

. `emb_data`

shows that the first element has an output dimension of 5 and the second element has an output dimension of 10.

- The categorical variable
`neighbourhood`

has a dimension of 5 because we specified the embedding dimension to be 5 in the embedding layer. - The numeric variables have a dimension of 10 because there are 10 numeric columns.

# Embedding dimensions

emb_data = [emb_cat, emb_numeric]

# Take a look at the data

emb_data

Output:

[<KerasTensor: shape=(None, 5) dtype=float32 (created by layer 'reshape')>,

<KerasTensor: shape=(None, 10) dtype=float32 (created by layer 'input_2')>]

Using the `Concatenate()`

function, the Keras tensors in the list `emb_data`

are concatenated together. The output Keras tensor has a dimension of 15, which is the sum of the dimension of the two tensors in the list.

# Concatenate layer concatenates a list of inputs

model_data = Concatenate()(emb_data)

model_data

Output:

<KerasTensor: shape=(None, 15) dtype=float32 (created by layer 'concatenate')>

The concatenated data is passed in the `Dense`

layer of the model.

- The first three
`Dense`

layers have 15 neurons, 8 neurons, and 4 neurons separately. They all have`relu`

as the activation function. - The output layer has one neuron, and the activation function is
`linear`

because the target variable`price`

is a continuous variable.

Then the `input_data`

and `outputs`

are grouped into an object using the `Model`

function.

`inputs`

takes in the inputs of the model. It can be a`Keras`

`Input`

object or a list of`Keras`

`Input`

objects. As the Keras documentation pointed out, Only dictionaries, lists, and tuples of input tensors are supported. Nested inputs such as a list of lists or a dictionary of dictionaries are not supported.`outputs`

takes in the outputs of the model.`name`

is the name of the model. We gave our model the name of`Entity_embedding_model_keras`

.

We can print out the model details using the `summary`

function.

# Dense layer with 10 neurons and relu activation function

model = Dense(10, activation = 'relu')(model_data)

# Dense layer with 5 neurons and relu activation function

model = Dense(5, activation = 'relu')(model)

# Dense layer with 2 neurons and relu activation function

model = Dense(2, activation = 'relu')(model)

# Output is linear

outputs = Dense(1, activation = 'linear')(model)

# Use Model to group layers into an object with training and inference features

nn = Model(inputs=input_data, outputs=outputs, name ='Entity_embedding_model_keras')

# Print out the model summary

nn.summary()

The summary of the model `Entity_embedding_model_keras`

has four columns.

- The first column is
`Layer (type)`

. It has the layer names with the layer type in the parenthesis.

- The system automatically gives each layer a name if the user did not specify the layer name. Out of the 9 layer names, only
`embedding_cat`

is user-defined. All the other names are auto-assigned. - The type in the parentheses describes the type of the corresponding layer. We can see that our model has the
`InputLayer`

, the`Embedding`

layer, the`Reshape`

layer, the`Concatenate`

layer, and the`Dense`

layer.

- The second column is
`Output Shape`

.

`None`

means the batch size has not been fixed. It can be any number of samples.- The Numbers are the number of neurons for the
`Dense`

layers and the number of columns for other layers.

- The third column is
`Param #`

. It is the number of parameters needs to be estimated.

- The
`embedding_cat`

layer has 195 parameters, It is calculated using 39 unique neighbourhoods multiply by the embedding dimension of 5. - The layer
`dense`

has 160 parameters. It is calculated using 15 input values from the`Concatenate`

layer multiplied by the 10 neurons in`dense_1`

, and plus 10 bias values from`dense_1`

. - The layer
`dense_1`

has 55 parameters. It is calculated using 10 input values from the`dense`

layer multiplied by the 5 neurons in`dense_1`

, and plus 5 bias values from`dense_1`

. - The layer
`dense_2`

has 12 parameters. It is calculated using 5 input values from the`dense_1`

layer multiplied by the 2 neurons in`dense_2`

, and plus 2 bias values from`dense_2`

. - The last layer
`dense_3`

has 3 parameters. It is calculated using 2 input values from the`dense_2`

layer multiplied by the 1 neuron in`dense_3`

, and plus 1 bias value from`dense_3`

. - The
`InputLayer`

,`Reshape`

layer, and`Concatenate`

layer do not have parameters to estimate.

- The fourth column is
`Connected to`

, indicating the input layer for the current layer.

- The two input layers,
`input_1`

and`input_2`

do not have any value for this column. - The
`embedding_cat`

layer is connected to`input_1`

. - The
`Reshape`

layer is connected to the`embedding_cat`

layer. - The
`Concatenate`

layer is connected to both the`embedding_cat`

layer and the`input_2`

layer. The`embedding_cat`

layer has the categorical entity embeddings, and the`input_2`

layer has the numeric inputs. - The layer
`dense`

is connected to the`Concatenate`

layer, and each of the following Dense layers is connected to the previous Dense layer.

At the bottom of the summary table, the numbers of `Total params`

, `Trainable params`

, and `Non-trainable params`

are listed.

`Total params`

is the total number of parameters for the model, and it is the sum of all the parameter numbers from each layer.`Trainable params`

is the number of trainable parameters that are trained using backpropagation. All the parameters in this model are trainable parameters.`Non-trainable params`

is the number of parameters that are not trained using backpropagation. Our model does not have any non-trainable params. We can add`trainable=False`

to a dense layer to change it from trainable to non-trainable if needed.

Model: "Entity_embedding_model_keras"

__________________________________________________________________________________________________

Layer (type) Output Shape Param # Connected to

==================================================================================================

input_1 (InputLayer) [(None, 1)] 0 []

embedding_cat (Embedding) (None, 1, 5) 195 ['input_1[0][0]']

reshape (Reshape) (None, 5) 0 ['embedding_cat[0][0]']

input_2 (InputLayer) [(None, 10)] 0 []

concatenate (Concatenate) (None, 15) 0 ['reshape[0][0]',

'input_2[0][0]']

dense (Dense) (None, 10) 160 ['concatenate[0][0]']

dense_1 (Dense) (None, 5) 55 ['dense[0][0]']

dense_2 (Dense) (None, 2) 12 ['dense_1[0][0]']

dense_3 (Dense) (None, 1) 3 ['dense_2[0][0]']

==================================================================================================

Total params: 425

Trainable params: 425

Non-trainable params: 0

The neural network model structure can be represented by a graph and saved as a picture.

- From the graph, we can see that one categorical variable is the input for the embedding layer and the embedding output 5 variables.
- The five variables are concatenated with the 10 numeric variables and a total of 15 variables are the inputs for the dense layers.

# Print model structure

plot_model(nn, show_shapes=True, show_layer_names=True, to_file='Entity_embedding_model_keras.png')

Image(retina=True, filename='Entity_embedding_model_keras.png')

When compiling the model, we set `mean_squared_error`

as loss, `adam`

as the optimizer, and `mae`

as metrics.

`EarlyStopping`

is used to prevent overfitting and save computing resources.

`monitor='val_loss'`

means that the validation loss will be monitored.`mode`

can take the value of`'min'`

,`'max'`

or`'auto'`

.

`'min'`

means the training stops if the metric stops decreasing.`'max'`

means the training stops if the metric stops increasing.`'auto'`

means the stop criterion is inferred from the name of the monitored metric.

`verbose`

controls whether to print out messages. It has two values, 0 and 1. 0 means do not print out any message, and 1 means that a message will be printed when the early stopping happens.`patience`

is the threshold number of epochs without improvement.`restore_best_weights = True`

means that the model weights with the best monitored value will be stored.

When fitting the model, we pass in the training dataset, and use the testing dataset as the validation data.

- One
`epoch`

is when all the training dataset is used once.`epochs=1000`

means the model goes through all the training datasets a maximum of 1000 times. `batch_size=64`

means that 64 samples will be used to update the weights and biases each time. Because we have 5128 records in the training dataset, it will be divided into 80 batches with 64 samples and 1 batch with 8 samples.`verbose=1`

means that we will print out some messages during the model training process.- We pass in the
`EarlyStopping`

variable to`callbacks`

.

# Compile model

nn.compile(loss='mean_squared_error', optimizer='adam', metrics=['mae'])

# Set up early stopping

es = EarlyStopping(monitor='val_loss',

mode='min',

verbose=1,

patience=50,

restore_best_weights=True)

# Fit the model

history = nn.fit(input_list_train,

y_train,

validation_data=(input_list_test, y_test),

epochs=1000,

batch_size=64,

verbose=1,

callbacks=[es])

The early stopping happened at epoch 883 with the training mean absolute error (MAE) of 68 and validation loss of 69.

Epoch 882/1000

81/81 [==============================] - 0s 4ms/step - loss: 11156.1582 - mae: 67.8472 - val_loss: 13683.8223 - val_mae: 73.7888

Epoch 883/1000

64/81 [======================>.......] - ETA: 0s - loss: 11323.4639 - mae: 68.3551Restoring model weights from the end of the best epoch: 833.

81/81 [==============================] - 0s 4ms/step - loss: 11120.8857 - mae: 67.8158 - val_loss: 13986.5391 - val_mae: 68.9497

Epoch 883: early stopping

### Step 7: Model Performance

In step 7, we will check the model performance.

The first visualization plots the loss change over epochs. We set the loss to the mean squared error in the model, loss represents the mean squared error in the chart. We can see that both the training and the validation loss decrease with the epochs, validation has a higher loss than the training dataset. This may suggest model overfitting. We will leave the topic of correcting overfitting for a neural network model to a separate tutorial.

# summarize history for loss

plt.plot(history.history['loss'])

plt.plot(history.history['val_loss'])

plt.title('loss')

plt.ylabel('loss')

plt.xlabel('epoch')

plt.legend(['train', 'val'], loc='upper right')

The second visualization plots the mean absolute error (MAE) over epochs. We can see that both the training and the validation mean absolute error (MAE) decrease with the epochs, validation has a slightly higher mean absolute error (MAE) than the training dataset.

# summarize history for mae

plt.plot(history.history['mae'])

plt.plot(history.history['val_mae'])

plt.title('Mean Absolute Error (MAE)')

plt.ylabel('Mean Absolute Error (MAE)')

plt.xlabel('epoch')

plt.legend(['train', 'val'], loc='lower right')

plt.show()

Using `.predict`

, we make predictions on the testing dataset. The output is two-dimensional, so `flatten`

is used to change the prediction to one-dimensional.

The scatter plot of the predicted price vs. the actual price shows that a higher predicted price corresponds to a higher actual price in general, but there are some listings with much higher actual prices than the predicted prices.

# Make prediction

y_test_predict = nn.predict(input_list_test)

# Change the predictions from 2-d to 1-d

y_test_predict = y_test_predict.flatten()

# Visualization

ax = sns.scatterplot(y_test, y_test_predict)

To quantify the difference between the actual price and the predicted price, we calculated a variable called `model_error`

, and used it to calculate the mean squared error (MSE), the root mean squared error (RMSE), the mean absolute error (MAE), R squared, and the mean absolute percentage error (MAPE).

# Calculate model error

model_error = y_test - y_test_predict

# Mean squared error

MSE = np.mean(model_error**2)

# Root mean squared error

RMSE = np.sqrt(MSE)

# Mean absolute error

MAE = np.mean(abs(model_error))

# R squared

R2 = 1- sum(model_error**2)/sum((y_test-np.mean(y_test))**2)

# Mean absolute percentage error

MAPE = np.mean(abs(model_error/y_test))

print(f'The MSE for the model is {MSE:.2f}')

print(f'The RMSE for the model is {RMSE:.2f}.')

print(f'The MAE for the model is {MAE:.2f}.')

print(f'The R-squared for the model is {R2:.2f}.')

print(f'The MAPE for the model is {MAPE:.2f}.')

Output:

The MSE for the model is 13663.52

The RMSE for the model is 116.89.

The MAE for the model is 72.51.

The R-squared for the model is 0.24.

The MAPE for the model is 0.49.

The same metrics can be calculated using the functions from the `sklearn.metrics`

library. We can see that the `sklearn`

library and the manual calculation produce the same results.

# Import library

from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error, mean_absolute_percentage_error

# Mean squared error

MSE = mean_squared_error(y_test, y_test_predict)

# Root mean squared error

RMSE = np.sqrt(MSE)

# Mean absolute error

MAE = mean_absolute_error(y_test, y_test_predict)

R2 = r2_score(y_test, y_test_predict)

MAPE = mean_absolute_percentage_error(y_test, y_test_predict)

print(f'The MSE for the model is {MSE:.2f}')

print(f'The RMSE for the model is {RMSE:.2f}.')

print(f'The MAE for the model is {MAE:.2f}.')

print(f'The R-squared for the model is {R2:.2f}.')

print(f'The MAPE for the model is {MAPE:.2f}.')

Output:

The MSE for the model is 13663.52

The RMSE for the model is 116.89.

The MAE for the model is 72.51.

The R-squared for the model is 0.24.

The MAPE for the model is 0.49.

### Step 8: Extract Categorical Embeddings

In step 8, we will extract the categorical embeddings.

The weights are extracted from the embedding layer and saved in a dataframe. We can see that the dataframe has 39 rows, each row representing one unique neighbourhood. There are 6 columns in the dataframe. The first column is the encoded categorical index, the other five columns are the embeddings.

# Get weights from the embedding layer

cat_emb_df = pd.DataFrame(nn.get_layer('embedding_cat').get_weights()[0]).reset_index()

# Add prefix to the embedding names

cat_emb_df = cat_emb_df.add_prefix('cat_')

# Take a look at the data

cat_emb_df

To append the neighbourhood name to the embedding dataframe, we put the categorical encoder dictionary into a dataframe, and merge the dataframe with the embedding dataframe on the categorical index.

# Put the categorical encoder dictionary into a dataframe

cat_encoder_df = pd.DataFrame(cat_encoder.items(), columns=['cat', 'cat_index'])

# Merge data to append the category name

cat_emb_df = pd.merge(cat_encoder_df, cat_emb_df, how = 'inner', on='cat_index')

# Take a look at the data

cat_emb_df.head()

### Step 9: Save Entity Embedding Results

In step 9, we will save the entity embedding results. We can use `to_csv`

to save the entity embedding results to a csv file. `index = False`

means that the index of the dataframe will not be saved.

The embedding model can be saved in `hdf5`

format using `.save`

.

# Save embedding results

cat_emb_df.to_csv('cat_embedding_keras.csv', index = False)

# Save model

nn.save("cat_embedding_keras.hdf5")

In the future, the model can be loaded using the `load_model`

function.

# Load model

loaded_nn = load_model("cat_embedding_keras.hdf5")

### Step 10: Baseline Random Forest Model

In step 10, we will build a random forest model as a baseline model to compare with the neural network model.

The model used all the 10 numeric features and the scatterplot of actual vs. predicted values generally aligns. Similar to the neural network model, some listings have much higher listing actual prices than the predicted prices.

# Feature list

base_cols = ['minimum_nights',

'number_of_reviews',

'reviews_per_month',

'calculated_host_listings_count',

'availability_365',

'number_of_reviews_ltm',

'Entire home/apt',

'Hotel room',

'Private room',

'Shared room']

# Initiate the model

base_rf = RandomForestRegressor()

# Fit the model

base_rf.fit(X_train[base_cols], y_train)

# Make predictions

base_y_test_prediction = base_rf.predict(X_test[base_cols])

# Visualization

ax = sns.scatterplot(y_test, base_y_test_prediction)

The random forest model performance metrics are similar to the neural network model metrics.

# Prediction error

base_model_error = y_test - base_y_test_prediction

# Mean squared error

MSE = np.mean(base_model_error**2)

# Root mean squared error

RMSE = np.sqrt(MSE)

# Mean absolute error

MAE = np.mean(abs(base_model_error))

# R squared

R2 = 1- sum(base_model_error**2)/sum((y_test-np.mean(y_test))**2)

# Mean absolute percentage error

MAPE = np.mean(abs(base_model_error/y_test))

print(f'The MSE for the model is {MSE:.2f}')

print(f'The RMSE for the model is {RMSE:.2f}.')

print(f'The MAE for the model is {MAE:.2f}.')

print(f'The R-squared for the model is {R2:.2f}.')

print(f'The MAPE for the model is {MAPE:.2f}.')

Output:

The MSE for the model is 13551.77

The RMSE for the model is 116.41.

The MAE for the model is 71.29.

The R-squared for the model is 0.25.

The MAPE for the model is 0.51.

### Step 11: Use Categorical Entity Embedding in Other Machine Learning Models

In step 11, we will talk about how to use the categorical entity embeddings in other machine learning models. The random forest model will be used as an example, and other models can follow the same process.

To use the entity embeddings in a model, we treat the embeddings as the five features for each neighbourhood and appended them to the training and testing dataset separately.

# Append categorical embeddings to the training dataset

X_train_emb = pd.merge(X_train, cat_emb_df, left_on='neighbourhood', right_on='cat', how='inner').drop(['neighbourhood','cat', 'cat_index'], axis=1)

# Append categorical embeddings to the testing dataset

X_test_emb = pd.merge(X_test, cat_emb_df, left_on='neighbourhood', right_on='cat', how='inner').drop(['neighbourhood','cat', 'cat_index'], axis=1)

# Check info for the training dataset

X_train_emb.info()

# Check info for the testing dataset

X_test_emb.info()

We can see that after adding the embedding features, both the training dataset and the testing dataset have 15 columns.

<class 'pandas.core.frame.DataFrame'>

Int64Index: 5128 entries, 0 to 5127

Data columns (total 15 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 minimum_nights 5128 non-null int64

1 number_of_reviews 5128 non-null int64

2 reviews_per_month 5128 non-null float64

3 calculated_host_listings_count 5128 non-null int64

4 availability_365 5128 non-null int64

5 number_of_reviews_ltm 5128 non-null int64

6 Entire home/apt 5128 non-null uint8

7 Hotel room 5128 non-null uint8

8 Private room 5128 non-null uint8

9 Shared room 5128 non-null uint8

10 cat_0 5128 non-null float32

11 cat_1 5128 non-null float32

12 cat_2 5128 non-null float32

13 cat_3 5128 non-null float32

14 cat_4 5128 non-null float32

dtypes: float32(5), float64(1), int64(5), uint8(4)

memory usage: 400.6 KB

<class 'pandas.core.frame.DataFrame'>

Int64Index: 1283 entries, 0 to 1282

Data columns (total 15 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 minimum_nights 1283 non-null int64

1 number_of_reviews 1283 non-null int64

2 reviews_per_month 1283 non-null float64

3 calculated_host_listings_count 1283 non-null int64

4 availability_365 1283 non-null int64

5 number_of_reviews_ltm 1283 non-null int64

6 Entire home/apt 1283 non-null uint8

7 Hotel room 1283 non-null uint8

8 Private room 1283 non-null uint8

9 Shared room 1283 non-null uint8

10 cat_0 1283 non-null float32

11 cat_1 1283 non-null float32

12 cat_2 1283 non-null float32

13 cat_3 1283 non-null float32

14 cat_4 1283 non-null float32

dtypes: float32(5), float64(1), int64(5), uint8(4)

memory usage: 100.2 KB

Using the new feature list with embeddings to train the same random forest model, we can get the new predicted results for the testing dataset.

The dots in the scatter plot are more scattered than the base model, indicating that this model may have a worse performance.

# Initiate the model

emb_rf = RandomForestRegressor()

# Fit the model

emb_rf.fit(X_train_emb, y_train)

# Make predictions

emb_y_test_prediction = emb_rf.predict(X_test_emb)

# Visualization

ax = sns.scatterplot(y_test, emb_y_test_prediction)

The model performance metrics confirmed that the random forest model with embeddings performs worse than the baseline model. The values are worse across all the metrics.

This tells us that adding categorical entity embeddings does not always improve the model’s performance. Sometimes it just adds noise to the model and makes the model performance worse. So it is always a good idea to have a baseline model before adding entity embeddings.

# Model error

emb_model_error = y_test - emb_y_test_prediction

# Mean squared error

MSE = np.mean(emb_model_error**2)

# Root mean squared error

RMSE = np.sqrt(MSE)

# Mean absolute error

MAE = np.mean(abs(emb_model_error))

# R squared

R2 = 1- sum(emb_model_error**2)/sum((y_test-np.mean(y_test))**2)

# Mean absolute percentage error

MAPE = np.mean(abs(emb_model_error/y_test))

print(f'The MSE for the model is {MSE:.2f}')

print(f'The RMSE for the model is {RMSE:.2f}.')

print(f'The MAE for the model is {MAE:.2f}.')

print(f'The R-squared for the model is {R2:.2f}.')

print(f'The MAPE for the model is {MAPE:.2f}.')

Output:

The MSE for the model is 19586.73

The RMSE for the model is 139.95.

The MAE for the model is 95.32.

The R-squared for the model is -0.09.

The MAPE for the model is 0.84.

One possible reason for the worse performance of the random forest model with embedding is the small dataset for training the embeddings. There are not enough information in the dataset for the model to learn meaningful embeddings.

When millions of records are used to train the embeddings in my work, the baseline XGBoost model performance was improved slightly.

The entity embedding paper has two tables showing the model performance improvement across four different machine learning algorithms.

What’s your experience with using entity embedding? Did it improve the model’s performance? Please let me know in the comments.

For more information about data science and machine learning, please check out my YouTube channel and Medium Page or follow me on LinkedIn.

### Recommended Tutorials

- GrabNGoInfo Machine Learning Tutorials Inventory
- Hierarchical Topic Model for Airbnb Reviews
- 3 Ways for Multiple Time Series Forecasting Using Prophet in Python
- Time Series Anomaly Detection Using Prophet in Python
- Time Series Causal Impact Analysis in Python
- Hyperparameter Tuning For XGBoost
- Four Oversampling And Under-Sampling Methods For Imbalanced Classification Using Python
- Five Ways To Create Tables In Databricks
- Explainable S-Learner Uplift Model Using Python Package CausalML
- One-Class SVM For Anomaly Detection
- Recommendation System: Item-Based Collaborative Filtering