Item-based collaborative filtering is also called item-item collaborative filtering. It is a type of recommendation system algorithm that uses item similarity to make product recommendations.

In this tutorial, we will talk about

- What is item-based (item-item) collaborative filtering?
- How to create a user-product matrix?
- How to identify similar items?
- How to rank items for the recommendation?

**Resources for this post:**

- Python code is at the end of the post. Click here for the notebook.
- More video tutorials on recommendation system
- More blog posts on recommendation system

If you prefer the video version of the tutorial, please check out this video on YouTube

### Step 0: Item-Based Collaborative Filtering Recommendation Algorithm

Firstly, let’s understand how item-based collaborative filtering works.

Item-based collaborative filtering makes recommendations based on user-product interactions in the past. The assumption behind the algorithm is that users like similar products and dislike similar products, so they give similar ratings to similar products.

Item-based collaborative filtering algorithm usually has the following steps:

- Calculate item similarity scores based on all the user ratings.
- Identify the top n items that are most similar to the item of interest.
- Calculate the weighted average score for the most similar items by the user.
- Rank items based on the score and pick top n items to recommend.

This graph illustrates how item-based collaborative filtering works using a simplified example.

- Ms. Blond likes apples, watermelons, and pineapples. Ms. Black likes watermelons and pineapples. Ms. Purple likes watermelons and grapes.
- Because watermelons and pineapples are liked by the sampe persons, they are considered similar items.
- Since Ms. Purple likes watermelons and Ms. Purple has not been exposed to pineapples yet, the recommendation system recommends pineapples to Ms. purple.

### Step 1: Import Python Libraries

In the first step, we will import Python libraries `pandas`

, `numpy`

, and `scipy.stats`

. These three libraries are for data processing and calculations.

We also imported `seaborn`

for visualization and `cosine_similarity`

for calculating similarity scores.

# Data processing import pandas as pd import numpy as np import scipy.stats # Visualization import seaborn as sns # Similarity from sklearn.metrics.pairwise import cosine_similarity

### Step 2: Download And Read In Data

This tutorial uses the movielens dataset. This dataset contains actual user ratings of movies.

In step 2, we will follow the steps below to get the datasets:

- Go to https://grouplens.org/datasets/movielens/
- Download the 100k dataset with the file name “ml-latest-small.zip”
- Unzip “ml-latest-small.zip”
- Copy the “ml-latest-small” folder to your project folder

Those who are using Google Colab for this analysis need to mount Google Drive to read the dataset. You can ignore the code below if you are not using Google Colab.

# Mount Google Drive from google.colab import drive drive.mount('/content/drive') # Change directory import os os.chdir("drive/My Drive/contents/recommendation_system") # Print out the current directory !pwd

There are multiple datasets in the 100k movielens folder. For this tutorial, we will use two ratings and movies.

Now let’s read in the rating data.

# Read in data ratings=pd.read_csv('ml-latest-small/ratings.csv') # Take a look at the data ratings.head()

There are four columns in the ratings dataset, userID, movieID, rating, and timestamp.

# Get the dataset information ratings.info()

The dataset has over 100k records, and there is no missing data.

<class 'pandas.core.frame.DataFrame'> RangeIndex: 100836 entries, 0 to 100835 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 userId 100836 non-null int64 1 movieId 100836 non-null int64 2 rating 100836 non-null float64 3 timestamp 100836 non-null int64 dtypes: float64(1), int64(3) memory usage: 3.1 MB

The 100k ratings are from 610 users on 9724 movies. The rating has ten unique values from 0.5 to 5.

# Number of users print('The ratings dataset has', ratings['userId'].nunique(), 'unique users') # Number of movies print('The ratings dataset has', ratings['movieId'].nunique(), 'unique movies') # Number of ratings print('The ratings dataset has', ratings['rating'].nunique(), 'unique ratings') # List of unique ratings print('The unique ratings are', sorted(ratings['rating'].unique()))

The ratings dataset has 610 unique users The ratings dataset has 9724 unique movies The ratings dataset has 10 unique ratings The unique ratings are [0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0]

Next, let’s read in the movies data to get the movie names.

The movies dataset has movieID, title, and genres.

# Read in data movies = pd.read_csv('ml-latest-small/movies.csv') # Take a look at the data movies.head()

Using `movieID`

as the matching key, we appended movie information to the rating dataset and named it `df`

. So now we have the movie title and movie rating in the same dataset!

# Merge ratings and movies datasets df = pd.merge(ratings, movies, on='movieId', how='inner') # Take a look at the data df.head()

### Step 3: Exploratory Data Analysis (EDA)

In step 3, we need to filter the movies and keep only those with over 100 ratings for the analysis. This is to make the calculation manageable by the Google Colab memory.

To do that, we first group the movies by title, count the number of ratings, and keep only the movies with greater than 100 ratings.

The average ratings for the movies are calculated as well.

From the `.info()`

output, we can see that there are 134 movies left.

# Aggregate by movie agg_ratings = df.groupby('title').agg(mean_rating = ('rating', 'mean'), number_of_ratings = ('rating', 'count')).reset_index() # Keep the movies with over 100 ratings agg_ratings_GT100 = agg_ratings[agg_ratings['number_of_ratings']>100] # Check the information of the dataframe agg_ratings_GT100.info()

<class 'pandas.core.frame.DataFrame'> Int64Index: 134 entries, 74 to 9615 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 title 134 non-null object 1 mean_rating 134 non-null float64 2 number_of_ratings 134 non-null int64 dtypes: float64(1), int64(1), object(1) memory usage: 4.2+ KB

Let’s check what the most popular movies and their ratings are.

# Check popular movies agg_ratings_GT100.sort_values(by='number_of_ratings', ascending=False).head()

Next, let’s use a `jointplot`

to check the correlation between the average rating and the number of ratings.

We can see an upward trend from the scatter plot, showing that popular movies get higher ratings.

The average rating distribution shows that most movies in the dataset have an average rating of around 4.

The number of rating distribution shows that most movies have less than 150 ratings.

# Visulization sns.jointplot(x='mean_rating', y='number_of_ratings', data=agg_ratings_GT100)

To keep only the 134 movies with more than 100 ratings, we need to join the movie with the user-rating level dataframe.

`how='inner'`

and `on='title'`

ensure that only the movies with more than 100 ratings are included.

# Merge data df_GT100 = pd.merge(df, agg_ratings_GT100[['title']], on='title', how='inner') df_GT100.info()

After filtering the movies with over 100 ratings, we have 597 users that rated 134 movies.

# Number of users print('The ratings dataset has', df_GT100['userId'].nunique(), 'unique users') # Number of movies print('The ratings dataset has', df_GT100['movieId'].nunique(), 'unique movies') # Number of ratings print('The ratings dataset has', df_GT100['rating'].nunique(), 'unique ratings') # List of unique ratings print('The unique ratings are', sorted(df_GT100['rating'].unique()))

### Step 4: Create User-Movie Matrix

In step 4, we will transform the dataset into a matrix format. The rows of the matrix are movies, and the columns of the matrix are users. The value of the matrix is the user rating of the movie if there is a rating. Otherwise, it shows ‘NaN’.

# Create user-item matrix matrix = df_GT100.pivot_table(index='title', columns='userId', values='rating') matrix.head()

### Step 5: Data Normalization

In Step 5, we will normalize the data by subtracting the average rating of each movie. The cosine similarity calculated based on the normalized data is called mean-centered cosine similarity.

After normalization, the ratings less than the movie’s average rating get a negative value, and the ratings more than the movie’s average rating get a positive value.

# Normalize user-item matrix matrix_norm = matrix.subtract(matrix.mean(axis=1), axis = 0) matrix_norm.head()

If you find the contents helpful so far, please sign up for the email list below.

### Step 6: Calculate Similarity Score

There are different ways to measure similarities. Pearson correlation and cosine similarity are two widely used methods.

In this tutorial, we will calculate the item similarity matrix using Pearson correlation.

# Item similarity matrix using Pearson correlation item_similarity = matrix_norm.T.corr() item_similarity.head()

Those who are interested in using cosine similarity can refer to this code. Since `cosine_similarity`

does not take missing values, we need to impute the missing values with 0s before the calculation.

# Item similarity matrix using cosine similarity item_similarity_cosine = cosine_similarity(matrix_norm.fillna(0)) item_similarity_cosine

In the movie similarity matrix, the values range from -1 to 1, where -1 means opposite movie similarity and 1 means very high movie similarity.

### Step 7: Predict User’s Rating For One Movie

In step 7, we will predict a user’s rating for one movie. Let’s use user 1 and the movie American Pie as an example.

The prediction follows the process below:

- Create a list of the movies that user 1 has watched and rated.
- Rank the similarities between the movies user 1 rated and American Pie.
- Select top n movies with highest similarity scores.
- Calculate the predicted rating using weighted average of similarity scores and the ratings from user 1.

Now let’s implement the process using Python.

Firstly, we removed all the movies that have a missing rating for user 1, and sorted the movies by the ratings.

# Pick a user ID picked_userid = 1 # Pick a movie picked_movie = 'American Pie (1999)' # Movies that the target user has watched picked_userid_watched = pd.DataFrame(matrix_norm[picked_userid].dropna(axis=0, how='all')\ .sort_values(ascending=False))\ .reset_index()\ .rename(columns={1:'rating'}) picked_userid_watched.head()

We can see that user 1’s favorite movie is Dumb and Dumber, followed by Indiana Jones and the Temple of Doom.

Next, we get the similarity score of the movie American Pie with the movie user 1 has watched, and pick the top 5 movies with the highest similarity score.

# Similarity score of the movie American Pie with all the other movies picked_movie_similarity_score = item_similarity[[picked_movie]].reset_index().rename(columns={'American Pie (1999)':'similarity_score'}) # Rank the similarities between the movies user 1 rated and American Pie. n = 5 picked_userid_watched_similarity = pd.merge(left=picked_userid_watched, right=picked_movie_similarity_score, on='title', how='inner')\ .sort_values('similarity_score', ascending=False)[:5] # Take a look at the User 1 watched movies with highest similarity picked_userid_watched_similarity

After that, calculate the weighted average of ratings and similarities scores, so the movies with higher similarity scores get more weight. This weighted average is the predicted rating for American Pie by user 1.

# Calculate the predicted rating using weighted average of similarity scores and the ratings from user 1 predicted_rating = round(np.average(picked_userid_watched_similarity['rating'], weights=picked_userid_watched_similarity['similarity_score']), 6) print(f'The predicted rating for {picked_movie} by user {picked_userid} is {predicted_rating}' )

The predicted rating for American Pie (1999) by user 1 is 0.338739

### Step 8: Movie Recommendation

In step 8, we will create an item-item movie recommendation system following four steps:

- Create a list of movie that the target user has not watched before.
- Loop through the unwatched movie and create predicted scores for each movie.
- Rank the predicted score of unwatched movie from high to low.
- Select the top k movies as the recommendations for the target user.

The Python function below implemented the four steps. With the input of `picked_userid`

, `number_of_similar_items`

, and `number_of_recommendations`

, we can get the top movies for the user and their corresponding ratings. Note that the ratings are normalized by extracting the average rating for the movie, so we need to add the average value back to the predicted ratings if we want the predicted ratings to be on the same scale as the original ratings.

# Item-based recommendation function def item_based_rec(picked_userid=1, number_of_similar_items=5, number_of_recommendations =3): import operator # Movies that the target user has not watched picked_userid_unwatched = pd.DataFrame(matrix_norm[picked_userid].isna()).reset_index() picked_userid_unwatched = picked_userid_unwatched[picked_userid_unwatched[1]==True]['title'].values.tolist() # Movies that the target user has watched picked_userid_watched = pd.DataFrame(matrix_norm[picked_userid].dropna(axis=0, how='all')\ .sort_values(ascending=False))\ .reset_index()\ .rename(columns={1:'rating'}) # Dictionary to save the unwatched movie and predicted rating pair rating_prediction ={} # Loop through unwatched movies for picked_movie in picked_userid_unwatched: # Calculate the similarity score of the picked movie iwth other movies picked_movie_similarity_score = item_similarity[[picked_movie]].reset_index().rename(columns={picked_movie:'similarity_score'}) # Rank the similarities between the picked user watched movie and the picked unwatched movie. picked_userid_watched_similarity = pd.merge(left=picked_userid_watched, right=picked_movie_similarity_score, on='title', how='inner')\ .sort_values('similarity_score', ascending=False)[:number_of_similar_items] # Calculate the predicted rating using weighted average of similarity scores and the ratings from user 1 predicted_rating = round(np.average(picked_userid_watched_similarity['rating'], weights=picked_userid_watched_similarity['similarity_score']), 6) # Save the predicted rating in the dictionary rating_prediction[picked_movie] = predicted_rating # Return the top recommended movies return sorted(rating_prediction.items(), key=operator.itemgetter(1), reverse=True)[:number_of_recommendations] # Get recommendations recommended_movie = item_based_rec(picked_userid=1, number_of_similar_items=5, number_of_recommendations =3) recommended_movie

[('Austin Powers: The Spy Who Shagged Me (1999)', 1.096288), ('Crouching Tiger, Hidden Dragon (Wo hu cang long) (2000)', 0.92924), ('Lord of the Rings: The Return of the King, The (2003)', 0.926824)]

### Step 9: Put All Code Together

###### Step 1: Import Python Libraries # Data processing import pandas as pd import numpy as np import scipy.stats # Visualization import seaborn as sns # Similarity from sklearn.metrics.pairwise import cosine_similarity ###### Step 2: Download And Read In Data # Mount Google Drive from google.colab import drive drive.mount('/content/drive') # Change directory import os os.chdir("drive/My Drive/contents/recommendation_system") # Print out the current directory !pwd # Read in data ratings=pd.read_csv('ml-latest-small/ratings.csv') # Take a look at the data ratings.head() # Get the dataset information ratings.info() # Number of users print('The ratings dataset has', ratings['userId'].nunique(), 'unique users') # Number of movies print('The ratings dataset has', ratings['movieId'].nunique(), 'unique movies') # Number of ratings print('The ratings dataset has', ratings['rating'].nunique(), 'unique ratings') # List of unique ratings print('The unique ratings are', sorted(ratings['rating'].unique())) # Read in data movies = pd.read_csv('ml-latest-small/movies.csv') # Take a look at the data movies.head() # Merge ratings and movies datasets df = pd.merge(ratings, movies, on='movieId', how='inner') # Take a look at the data df.head() ###### Step 3: Exploratory Data Analysis (EDA) # Aggregate by movie agg_ratings = df.groupby('title').agg(mean_rating = ('rating', 'mean'), number_of_ratings = ('rating', 'count')).reset_index() # Keep the movies with over 100 ratings agg_ratings_GT100 = agg_ratings[agg_ratings['number_of_ratings']>100] agg_ratings_GT100.info() # Check popular movies agg_ratings_GT100.sort_values(by='number_of_ratings', ascending=False).head() # Visulization sns.jointplot(x='mean_rating', y='number_of_ratings', data=agg_ratings_GT100) # Merge data df_GT100 = pd.merge(df, agg_ratings_GT100[['title']], on='title', how='inner') df_GT100.info() # Number of users print('The ratings dataset has', df_GT100['userId'].nunique(), 'unique users') # Number of movies print('The ratings dataset has', df_GT100['movieId'].nunique(), 'unique movies') # Number of ratings print('The ratings dataset has', df_GT100['rating'].nunique(), 'unique ratings') # List of unique ratings print('The unique ratings are', sorted(df_GT100['rating'].unique())) ###### Step 4: Create User-Movie Matrix # Create user-item matrix matrix = df_GT100.pivot_table(index='title', columns='userId', values='rating') matrix.head() ###### Step 5: Data Normalization # Normalize user-item matrix matrix_norm = matrix.subtract(matrix.mean(axis=1), axis = 0) matrix_norm.head() ###### Step 6: Calculate Similarity Score # Item similarity matrix using Pearson correlation item_similarity = matrix_norm.T.corr() item_similarity.head() # Item similarity matrix using cosine similarity item_similarity_cosine = cosine_similarity(matrix_norm.fillna(0)) item_similarity_cosine ###### Step 7: Predict User's Rating For One Movie # Pick a user ID picked_userid = 1 # Pick a movie picked_movie = 'American Pie (1999)' # Movies that the target user has watched picked_userid_watched = pd.DataFrame(matrix_norm[picked_userid].dropna(axis=0, how='all')\ .sort_values(ascending=False))\ .reset_index()\ .rename(columns={1:'rating'}) picked_userid_watched.head() # Similarity score of the movie American Pie with all the other movies picked_movie_similarity_score = item_similarity[[picked_movie]].reset_index().rename(columns={'American Pie (1999)':'similarity_score'}) # Rank the similarities between the movies user 1 rated and American Pie. n = 5 picked_userid_watched_similarity = pd.merge(left=picked_userid_watched, right=picked_movie_similarity_score, on='title', how='inner')\ .sort_values('similarity_score', ascending=False)[:5] # Take a look at the User 1 watched movies with highest similarity picked_userid_watched_similarity # Calculate the predicted rating using weighted average of similarity scores and the ratings from user 1 predicted_rating = round(np.average(picked_userid_watched_similarity['rating'], weights=picked_userid_watched_similarity['similarity_score']), 6) print(f'The predicted rating for {picked_movie} by user {picked_userid} is {predicted_rating}' ) ###### Step 8: Movie Recommendation # Item-based recommendation function def item_based_rec(picked_userid=1, number_of_similar_items=5, number_of_recommendations =3): import operator # Movies that the target user has not watched picked_userid_unwatched = pd.DataFrame(matrix_norm[picked_userid].isna()).reset_index() picked_userid_unwatched = picked_userid_unwatched[picked_userid_unwatched[1]==True]['title'].values.tolist() # Movies that the target user has watched picked_userid_watched = pd.DataFrame(matrix_norm[picked_userid].dropna(axis=0, how='all')\ .sort_values(ascending=False))\ .reset_index()\ .rename(columns={1:'rating'}) # Dictionary to save the unwatched movie and predicted rating pair rating_prediction ={} # Loop through unwatched movies for picked_movie in picked_userid_unwatched: # Calculate the similarity score of the picked movie iwth other movies picked_movie_similarity_score = item_similarity[[picked_movie]].reset_index().rename(columns={picked_movie:'similarity_score'}) # Rank the similarities between the picked user watched movie and the picked unwatched movie. picked_userid_watched_similarity = pd.merge(left=picked_userid_watched, right=picked_movie_similarity_score, on='title', how='inner')\ .sort_values('similarity_score', ascending=False)[:number_of_similar_items] # Calculate the predicted rating using weighted average of similarity scores and the ratings from user 1 predicted_rating = round(np.average(picked_userid_watched_similarity['rating'], weights=picked_userid_watched_similarity['similarity_score']), 6) # Save the predicted rating in the dictionary rating_prediction[picked_movie] = predicted_rating # Return the top recommended movies return sorted(rating_prediction.items(), key=operator.itemgetter(1), reverse=True)[:number_of_recommendations] # Get recommendations recommended_movie = item_based_rec(picked_userid=1, number_of_similar_items=5, number_of_recommendations =3) recommended_movie

### Summary

In this tutorial, we went over how to build an item-based collaborative filtering recommendation system. You learned

- What is item-based (item-item) collaborative filtering?
- How to create a user-product matrix?
- How to identify similar items?
- How to rank items for the recommendation?

For more information about data science and machine learning, please check out myÂ YouTube channelÂ andÂ Medium PageÂ or follow me onÂ LinkedIn.

### Recommended Tutorials

- GrabNGoInfo Machine Learning Tutorials Inventory
- Recommendation System: User-Based Collaborative Filtering
- Four Oversampling And Under-Sampling Methods For Imbalanced Classification Using Python
- Ensemble Oversampling And Under-Sampling For Imbalanced Classification Using Python
- Balanced Weights For Imbalanced Classification
- Neural Network Model Balanced Weight For Imbalanced Classification In Keras