The Ultimate Guide to Evaluating Your Recommendation System Understand the key metrics to measure the performance of your recommender engine

The Ultimate Guide to Evaluating Your Recommendation System


Recommendation systems have become an integral part of our daily lives, shaping our experiences on e-commerce platforms, content streaming services, and social media networks. They help users navigate vast catalogs, find relevant items, and discover new products or content that they might enjoy. But how do we know if a recommendation system is doing a good job? This tutorial will guide you through the essential metrics for evaluating the performance of your recommendation system, ensuring it effectively meets user needs and enhances their experience.

Please check out my previous tutorials on user-based collaborative filtering and item-based collaborative filtering.

Resources for this post:

Recommendation System Evaluation – GrabNGoInfo.com

Let’s get started!


Group 1. Accuracy Metrics: Predicting User Preferences

Accuracy metrics measure how well a recommendation system can predict user preferences. They are generally applicable to collaborative filtering methods, which leverage similarities between users or items to make recommendations.

1.1 Mean Absolute Error (MAE): Calculates the average absolute difference between predicted and actual ratings.

Here’s a Python function to calculate the Mean Absolute Error (MAE) between predicted and actual ratings:

def mean_absolute_error(actual_ratings, predicted_ratings):
# Check if the lengths of actual_ratings and predicted_ratings are equal
if len(actual_ratings) != len(predicted_ratings):
raise ValueError("The length of actual_ratings and predicted_ratings must be the same.")

n = len(actual_ratings)
total_error = 0

# Iterate through the lists and calculate the absolute difference between actual and predicted ratings
for i in range(n):
total_error += abs(actual_ratings[i] - predicted_ratings[i])

# Calculate the average of absolute differences
mae = total_error / n
return mae

# Example usage
actual_ratings = [3.5, 4.0, 2.0, 5.0, 3.0]
predicted_ratings = [3.0, 4.5, 1.5, 4.5, 2.5]

# Call the mean_absolute_error function with actual_ratings and predicted_ratings
mae = mean_absolute_error(actual_ratings, predicted_ratings)
print(f"Mean Absolute Error: {mae}")

Output:

Mean Absolute Error: 0.5

This function takes two lists as input, actual_ratings and predicted_ratings, and calculates the average absolute difference between them. If the lengths of the lists don’t match, a ValueError is raised. The function returns the calculated MAE as a floating-point number.

In the example usage, actual_ratings and predicted_ratings are two lists containing example ratings. The function is called with these lists, and the result is printed.

1.2 Root Mean Square Error (RMSE): Calculates the square root of the average squared differences between predicted and actual ratings. RMSE is more sensitive to large errors than MAE.

Here’s a Python function to calculate the Root Mean Square Error (RMSE) between predicted and actual ratings:

import math

def root_mean_square_error(actual_ratings, predicted_ratings):
# Check if the lengths of actual_ratings and predicted_ratings are equal
if len(actual_ratings) != len(predicted_ratings):
raise ValueError("The length of actual_ratings and predicted_ratings must be the same.")

n = len(actual_ratings)
total_error = 0

# Iterate through the lists and calculate the squared difference between actual and predicted ratings
for i in range(n):
total_error += (actual_ratings[i] - predicted_ratings[i]) ** 2

# Calculate the square root of the average of squared differences
rmse = math.sqrt(total_error / n)
return rmse

# Example usage
actual_ratings = [3.5, 4.0, 2.0, 5.0, 3.0]
predicted_ratings = [3.0, 4.5, 1.5, 4.5, 2.5]

# Call the root_mean_square_error function with actual_ratings and predicted_ratings
rmse = root_mean_square_error(actual_ratings, predicted_ratings)
print(f"Root Mean Square Error: {rmse}")

Output:

Root Mean Square Error: 0.5

This function takes two lists as input, actual_ratings and predicted_ratings, and calculates the square root of the average squared differences between them. If the lengths of the lists don’t match, a ValueError is raised. The function returns the calculated RMSE as a floating-point number.

In the example usage, actual_ratings and predicted_ratings are two lists containing example ratings. The function is called with these lists, and the result is printed.

1.3 Precision: Measures the proportion of relevant recommendations out of all the recommended items.

Here’s a Python function to calculate Precision.

def precision(recommended_items, relevant_items):
# Calculate the intersection of recommended_items and relevant_items
true_positive = len(set(recommended_items).intersection(set(relevant_items)))

# Calculate the total number of recommended items
total_recommended_items = len(recommended_items)

# Calculate precision
precision_value = true_positive / total_recommended_items if total_recommended_items > 0 else 0
return precision_value

# Example usage
recommended_items = [1, 3, 5, 7, 9]
relevant_items = [2, 3, 5, 7, 11, 15, 20]

precision_value = precision(recommended_items, relevant_items)
print(f"Precision: {precision_value:.2f}")

Output:

Precision: 0.60

This function takes two lists as input, recommended_items and relevant_items, and calculates the proportion of relevant recommendations by counting the number of items in the intersection of the two lists and dividing that by the total number of recommended items.

In the example usage, recommended_items and relevant_items are two lists containing example item IDs.

1.4 Recall: Measures the proportion of relevant recommendations out of all the relevant items.

Here is a Python function for calculating recall:

def recall(recommended_items, relevant_items):
# Calculate the intersection of recommended_items and relevant_items
true_positive = len(set(recommended_items).intersection(set(relevant_items)))

# Calculate the total number of relevant items
total_relevant_items = len(relevant_items)

# Calculate recall
recall_value = true_positive / total_relevant_items if total_relevant_items > 0 else 0
return recall_value

# Example usage
recommended_items = [1, 3, 5, 7, 9]
relevant_items = [2, 3, 5, 7, 11, 15, 20]

recall_value = recall(recommended_items, relevant_items)
print(f"Recall: {recall_value:.2f}")

Output:

Recall: 0.43

This function takes two lists as input, recommended_items and relevant_items, and calculates the proportion of relevant recommendations by counting the number of items in the intersection of the two lists and dividing that by the total number of relevant items.

In the example usage, recommended_items and relevant_items are two lists containing example item IDs.

Group 2. Ranking Metrics: Quality Over Quantity

Ranking metrics assess the quality of the ranking of recommended items, ensuring that the most relevant items appear at the top of the list.

2.1 Mean Reciprocal Rank (MRR): Calculates the average of the reciprocal ranks of the first relevant recommendation for each user.

The pros of Mean Reciprocal Rank (MRR) include:

  1. Position-aware: MRR takes into account the position of the first relevant item in the recommendation list, rewarding systems that rank relevant items higher. This makes it suitable for evaluating ranking-based recommendation systems where the order of items matters.
  2. Focus on the top-ranked item: MRR emphasizes the importance of the first relevant item in the list, making it a useful metric for scenarios where users are likely to focus on the top recommendations and ignore the rest.
  3. Average performance: MRR calculates the mean reciprocal rank across all users, providing an overall measure of the recommendation system’s performance. This allows for a fair comparison of different algorithms or models, as it considers the average performance rather than specific user cases.
  4. Intuitive interpretation: MRR scores range from 0 to 1, with higher values indicating better performance. This makes it easy to interpret and compare the performance of different recommendation algorithms.

The cons of Mean Reciprocal Rank (MRR) include:

  1. Limited scope: MRR focuses exclusively on the first relevant item in the list and does not take into account other relevant items or their positions. This can be a limitation in scenarios where multiple relevant items or the overall ranking quality are important.
  2. Binary relevance assumption: MRR assumes a binary relevance scale (either relevant or not relevant) and does not account for varying degrees of relevance on a continuous scale. This can be a limitation in situations where the relevance of items is not binary and needs to be quantified more granularly.
  3. Lack of personalization: While MRR provides an average performance measure across all users, it may not fully capture the personalized aspect of recommendation systems. A high MRR score does not necessarily guarantee that the recommendation system is providing good recommendations for each individual user.
  4. Sensitivity to outliers: MRR can be sensitive to outliers, as it calculates the reciprocal rank of the first relevant item for each user. A few users with very low reciprocal ranks can significantly impact the overall MRR score, potentially making it less reliable for evaluating the general performance of a recommendation system.

Here is a Python function for calculating Mean Reciprocal Rank (MRR):

def mean_reciprocal_rank(recommended_items_list, relevant_items_list):
if len(recommended_items_list) != len(relevant_items_list):
raise ValueError("The length of recommended_items_list and relevant_items_list must be the same.")

reciprocal_ranks = []

# Iterate through the lists of recommended items and relevant items for each user
for recommended_items, relevant_items in zip(recommended_items_list, relevant_items_list):
# Find the reciprocal rank for each user
for rank, item in enumerate(recommended_items, start=1):
if item in relevant_items:
reciprocal_ranks.append(1 / rank)
break
else:
reciprocal_ranks.append(0)

# Calculate the mean reciprocal rank
mrr = sum(reciprocal_ranks) / len(reciprocal_ranks)
return mrr

# Example usage
recommended_items_list = [
[1, 3, 5, 7, 9],
[2, 4, 6, 8],
[11, 12, 13, 14, 15, 16, 17]
]

relevant_items_list = [
[2, 3, 5, 7, 11],
[1, 4, 6, 8, 9],
[16, 17, 18, 19, 20]
]

mrr = mean_reciprocal_rank(recommended_items_list, relevant_items_list)
print(f"Mean Reciprocal Rank: {mrr:.2f}")

Output:

Mean Reciprocal Rank: 0.39

This function takes two lists of lists as input, recommended_items_list and relevant_items_list, where each inner list represents the recommendations and relevant items for a specific user. It calculates the reciprocal rank for each user by finding the rank of the first relevant item in the recommended items list, then takes the average of these reciprocal ranks. The function returns the calculated MRR as a floating-point number.

In the example usage, recommended_items_list and relevant_items_list are two lists of lists containing example item IDs for three users. The function is called with these lists, and the result is printed.

2.2 Mean Average Precision (MAP): Calculates the average precision for each user and takes the mean across all users. It takes into account both the order and the relevance of recommended items.

The pros of Mean Average Precision (MAP) include:

  1. Position-aware: MAP takes into account the position of relevant items in the recommendation list, rewarding systems that rank relevant items higher. This makes it suitable for evaluating ranking-based recommendation systems where the order of items matters.
  2. Relevance-aware: MAP considers the relevance of items by calculating the average precision for each user, which is the average of the precision scores at each relevant item’s position. This means that it can distinguish between different levels of relevance when evaluating recommendations.
  3. Average performance: MAP calculates the mean average precision across all users, providing an overall measure of the recommendation system’s performance. This allows for a fair comparison of different algorithms or models, as it considers the average performance rather than specific user cases.
  4. Robustness: MAP is less sensitive to outliers compared to some other metrics, as it calculates the average precision across multiple positions in the recommendation list for each user. This robustness makes it more reliable for evaluating the general performance of a recommendation system.

The cons of Mean Average Precision (MAP) include:

  1. Binary relevance assumption: MAP assumes a binary relevance scale (either relevant or not relevant) and does not account for varying degrees of relevance on a continuous scale. This can be a limitation in situations where the relevance of items is not binary and needs to be quantified more granularly.
  2. Lack of personalization: While MAP provides an average performance measure across all users, it may not fully capture the personalized aspect of recommendation systems. A high MAP score does not necessarily guarantee that the recommendation system is providing good recommendations for each individual user.
  3. Not suitable for all scenarios: MAP is more appropriate for recommendation scenarios where a ranked list of items is provided to users. It may not be suitable for other types of recommendation scenarios, such as collaborative filtering or content-based recommendations that do not involve explicit ranking.
  4. Complexity: The calculation of MAP can be more complex than other metrics like precision, recall, or F1-score, making it more difficult to interpret and explain to non-experts.

Here is a Python code for calculating Mean Average Precision (MAP).

def average_precision(recommended_items, relevant_items):
true_positives = 0
sum_precisions = 0

for rank, item in enumerate(recommended_items, start=1):
if item in relevant_items:
true_positives += 1
precision_at_rank = true_positives / rank
sum_precisions += precision_at_rank

return sum_precisions / len(relevant_items) if len(relevant_items) > 0 else 0


def mean_average_precision(recommended_items_list, relevant_items_list):
if len(recommended_items_list) != len(relevant_items_list):
raise ValueError("The length of recommended_items_list and relevant_items_list must be the same.")

average_precisions = []

# Calculate the average precision for each user
for recommended_items, relevant_items in zip(recommended_items_list, relevant_items_list):
ap = average_precision(recommended_items, relevant_items)
average_precisions.append(ap)

# Calculate the mean average precision across all users
map_value = sum(average_precisions) / len(average_precisions)
return round(map_value, 2)

# Example usage
recommended_items_list = [
[1, 3, 5, 7, 9],
[2, 4, 6, 8],
[11, 12, 13, 14, 15, 16, 17]
]

relevant_items_list = [
[2, 3, 5, 7, 11],
[1, 4, 6, 8, 9],
[16, 17, 18, 19, 20]
]

map_value = mean_average_precision(recommended_items_list, relevant_items_list)
print(f"Mean Average Precision: {map_value}")

Output:

Mean Average Precision: 0.29

In this code, the average_precision function calculates the average precision for a given set of recommended items and relevant items. The mean_average_precision function calculates the mean average precision across all users by calling the average_precision function for each user and then averaging the results. The final MAP value is rounded to 2 decimal places.

2.3 Normalized Discounted Cumulative Gain (nDCG): Evaluates the ranking quality by assigning higher importance to relevant items appearing at the top of the recommendation list. It is normalized to ensure comparability across different users and queries.

Normalized Discounted Cumulative Gain (nDCG) is a popular metric used to evaluate the quality of ranking in recommendation systems. It has several benefits:

  1. Position-aware: Unlike some other metrics, nDCG takes into account the position of the relevant items in the recommendation list. Items that are ranked higher (closer to the top) contribute more to the nDCG score, reflecting the fact that users are more likely to interact with items at the top of the list.
  2. Relevance-weighted: nDCG incorporates the relevance of each recommended item, allowing it to differentiate between items with varying degrees of relevance. This makes it suitable for situations where the relevance of items is not binary (e.g., partially relevant, highly relevant) and can be quantified on a continuous scale.
  3. Normalized: nDCG is normalized against the ideal ranking, which means it can be compared across different queries or users. This allows for a fair evaluation of the recommendation system’s performance, even when the number of relevant items varies between users or queries.
  4. Suitable for diverse recommendation scenarios: nDCG is applicable to various recommendation scenarios, including search engine result ranking, collaborative filtering, and content-based recommendation. This makes it a versatile metric for evaluating different types of recommendation systems.
  5. Intuitive interpretation: nDCG scores range from 0 to 1, with higher values indicating better ranking quality. This makes it easy to interpret and compare the performance of different recommendation algorithms.

The cons of Normalized Discounted Cumulative Gain (nDCG) include:

  1. Complexity: The calculation of nDCG can be more complex than other metrics like precision, recall, or F1-score, making it more difficult to interpret and explain to non-experts.
  2. Lack of personalization: While nDCG provides a measure of ranking quality, it may not fully capture the personalized aspect of recommendation systems. A high nDCG score does not necessarily guarantee that the recommendation system is providing good recommendations for each individual user.
  3. Binary relevance assumption: Although nDCG can handle varying degrees of relevance, it is often used with binary relevance judgments in practice. This can be a limitation in situations where the relevance of items is not binary and needs to be quantified more granularly.
  4. Sensitive to the choice of the ideal ranking: The normalization factor in nDCG is based on the ideal ranking, which can sometimes be subjective or difficult to determine. The choice of the ideal ranking can influence the nDCG score, potentially affecting its consistency and reliability.

Here’s a Python function to calculate the Normalized Discounted Cumulative Gain (nDCG).

import math
def discounted_cumulative_gain(recommended_items, relevant_items):
dcg = 0
for i, item in enumerate(recommended_items, start=1):
if item in relevant_items:
dcg += 1 / (math.log2(i + 1))
return dcg

def ideal_discounted_cumulative_gain(recommended_items, relevant_items):
sorted_relevant_items = sorted(relevant_items, key=lambda x: recommended_items.index(x) if x in recommended_items else float('inf'))
return discounted_cumulative_gain(sorted_relevant_items, relevant_items)

def normalized_discounted_cumulative_gain(recommended_items, relevant_items):
dcg = discounted_cumulative_gain(recommended_items, relevant_items)
idcg = ideal_discounted_cumulative_gain(recommended_items, relevant_items)

if idcg == 0:
return 0
else:
return round(dcg / idcg, 2)

# Example usage
recommended_items_list = [
[1, 3, 5, 7, 9],
[2, 4, 6, 8],
[11, 12, 13, 14, 15, 16, 17]
]

relevant_items_list = [
[2, 3, 5, 7, 11],
[1, 4, 6, 8, 9],
[16, 17, 18, 19, 20]
]

ndcg_values = [normalized_discounted_cumulative_gain(recommended, relevant)
for recommended, relevant in zip(recommended_items_list, relevant_items_list)]

print(f"nDCG values: {ndcg_values}")

Output:

nDCG values: [0.53, 0.53, 0.23]

In this code, the discounted_cumulative_gain function calculates the DCG for a given set of recommended items and relevant items. The ideal_discounted_cumulative_gain function calculates the ideal DCG, which is the DCG value if the recommended items were perfectly ranked. The normalized_discounted_cumulative_gain function calculates the nDCG by dividing the DCG value by the ideal DCG value.

The example usage calculates the nDCG values for three users with different recommended and relevant item lists. The nDCG values are printed for each user.

Group 3. Coverage and Diversity Metrics: A Taste for Variety

Coverage and diversity metrics measure the extent to which a recommendation system can provide a wide range of relevant and novel items, promoting exploration and discovery.

3.1 Catalog Coverage: Measures the proportion of items in the catalog that are recommended at least once.

Catalog coverage is an important metric in recommendation systems for several reasons:

  • Item exposure: Catalog coverage measures the proportion of items in the catalog that are recommended at least once. This provides insight into how well the recommendation system exposes different items in the catalog to users, ensuring that a wide variety of items have a chance to be recommended and discovered by users.
  • Long-tail items: A high catalog coverage indicates that the recommendation system is capable of suggesting not only popular items but also less popular or long-tail items. This can help promote niche items that cater to specific user interests, potentially increasing user satisfaction and overall revenue.
  • Diversification: Catalog coverage can serve as a proxy for the diversity of recommendations. A higher catalog coverage implies that the system is recommending a broader range of items, which can lead to a more diverse and engaging user experience.
  • Cold-start problem: Catalog coverage can help identify the cold-start problem, where the recommendation system struggles to recommend new or less popular items due to a lack of data. A low catalog coverage might indicate that the system is not well-suited for handling such situations, and alternative approaches or additional data sources should be considered.
  • Business goals: For businesses with a large and diverse catalog, it is essential to ensure that users are exposed to a wide variety of items. A high catalog coverage can contribute to achieving business goals such as increasing sales, user satisfaction, and user retention.
  • Evaluation and model comparison: Catalog coverage is a useful metric for comparing different recommendation models or evaluating improvements in the model over time. It can help assess the model’s capability to recommend a broad range of items, which can be an important factor in selecting the best model for a particular application.

Here’s a Python function to calculate the Catalog Coverage.

def catalog_coverage(recommended_items_list, catalog_items):
# Flatten the list of recommended items and convert it to a set
unique_recommended_items = set(item for sublist in recommended_items_list for item in sublist)

# Calculate the intersection of unique recommended items and catalog items
covered_items = unique_recommended_items.intersection(catalog_items)

# Calculate the catalog coverage
coverage = len(covered_items) / len(catalog_items)
return coverage

# Example usage
recommended_items_list = [
[1, 3, 5, 7, 9],
[2, 4, 6, 8],
[11, 12, 13, 14, 15, 16, 17]
]

catalog_items = set(range(1, 21))

coverage = catalog_coverage(recommended_items_list, catalog_items)
print(f"Catalog Coverage: {coverage}")

Output:

Catalog Coverage: 0.8

This function takes a list of lists recommended_items_list, where each inner list represents the recommended items for a specific user, and a set catalog_items representing all the items in the catalog. It calculates the catalog coverage by counting the number of unique recommended items and dividing that by the total number of items in the catalog. The function returns the calculated catalog coverage as a floating-point number.

In the example usage, recommended_items_list contains example recommended items for three users, and catalog_items represents a catalog of 20 items.

3.2 Prediction Coverage: Measures the proportion of possible user-item pairs for which the recommendation system can make predictions.

Prediction coverage provides valuable information about the recommendation model’s ability to make predictions across the user-item space. It helps identify potential limitations, evaluate model performance, and ensure that the model is well-suited for the intended application.

  • Model limitations: Prediction coverage provides insight into the limitations of the recommendation model. A low prediction coverage indicates that the model is only able to make predictions for a small proportion of user-item pairs, which may lead to less diverse or less accurate recommendations.
  • Cold-start problem: Prediction coverage can help identify the cold-start problem, where the model struggles to make recommendations for new users or items due to a lack of data. A low prediction coverage might indicate that the model is not well-suited for handling such situations, and alternative approaches or additional data sources should be considered.
  • Diversity and personalization: A high prediction coverage indicates that the model can make predictions for a wide range of user-item pairs, which is desirable in a recommendation system to ensure that users receive diverse and personalized recommendations. This is particularly important when the user base and item catalog are large and varied.
  • Evaluation and model comparison: Prediction coverage is a useful metric to compare different recommendation models or to evaluate improvements in the model over time. It helps in understanding the model’s capability to generate predictions across the entire user-item space, which can be a crucial factor in selecting the best model for a particular application.
  • Scalability: Prediction coverage can also be an indicator of the model’s scalability. If a model can make predictions for a large number of user-item pairs, it may be better suited for handling larger datasets or growing catalogs.

Here’s a Python function to calculate the Prediction Coverage.

def prediction_coverage(predicted_ratings, total_users, total_items):
# Count the number of user-item pairs for which the recommendation system can make predictions
predicted_pairs = sum(len(ratings) for ratings in predicted_ratings)

# Calculate the total number of possible user-item pairs
total_possible_pairs = total_users * total_items

# Calculate the prediction coverage
coverage = predicted_pairs / total_possible_pairs
return coverage

# Example usage
predicted_ratings = [
{1: 3.5, 3: 4.0, 5: 2.5, 7: 3.0, 9: 4.5},
{2: 4.5, 4: 3.0, 6: 2.0, 8: 3.5},
{11: 3.5, 12: 4.0, 13: 2.5, 14: 3.0, 15: 4.5, 16: 3.5, 17: 2.0}
]

total_users = 3
total_items = 20

coverage = prediction_coverage(predicted_ratings, total_users, total_items)
print(f"Prediction Coverage: {coverage:.2f}")

Output:

Prediction Coverage: 0.27

This function takes a list of dictionaries predicted_ratings, where each dictionary represents the predicted item ratings for a specific user, and two integers total_users and total_items representing the total number of users and items, respectively. It calculates the prediction coverage by counting the number of user-item pairs for which the recommendation system can make predictions and dividing that by the total number of possible user-item pairs. The function returns the calculated prediction coverage as a floating-point number.

In the example usage, predicted_ratings contains example predicted item ratings for three users, and there are 3 total users and 20 total items.

3.3 Diversity: Evaluates the dissimilarity between recommended items, ensuring that the recommendation list contains a good mix of different types of items.

Diversity is important in recommendation systems because it helps ensure personalized and engaging experiences for users, supports exploration, reduces filter bubbles, promotes long-tail items, and enhances the robustness of the system.

  • Personalization: A diverse set of recommendations helps ensure that different users with varying interests and preferences receive personalized suggestions that cater to their unique tastes. This can lead to higher user satisfaction and better engagement.
  • Exploration: Diversity in recommendations allows users to explore and discover new items or content they may not have known about or considered previously. This can enhance the user experience by providing users with fresh and novel recommendations.
  • Reducing filter bubbles: Over-personalization can result in “filter bubbles” where users are only exposed to items that are similar to their previous choices. This can limit users’ exposure to new ideas, perspectives, or experiences. Diverse recommendations help prevent filter bubbles by ensuring that users are exposed to a broader range of items.
  • Long-tail items: Diversity in recommendations can help promote long-tail items, which are items that may not be as popular but are still relevant to specific users. By recommending diverse items, the system can drive user engagement with a broader range of items, potentially increasing revenue and user satisfaction.
  • Robustness: A diverse set of recommendations is less susceptible to manipulation or bias. By ensuring that a wide variety of items are recommended, the system is more resistant to external factors like spam or targeted promotion of certain items.

Here’s a Python function to calculate the Diversity of the recommended items.

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

def diversity(recommended_items_list, item_features):
total_similarity = 0
total_pairs = 0

for recommended_items in recommended_items_list:
# Calculate pairwise cosine similarities for the recommended items
similarities = cosine_similarity([item_features[item] for item in recommended_items])
# Sum the similarities for all item pairs
total_similarity += np.sum(similarities) - np.trace(similarities) # Exclude the diagonal
# Count the total number of item pairs
total_pairs += len(recommended_items) * (len(recommended_items) - 1)

# Calculate the average similarity between item pairs
avg_similarity = total_similarity / total_pairs if total_pairs > 0 else 0
# Calculate the diversity by subtracting the average similarity from 1
diversity = 1 - avg_similarity
return diversity

# Example usage
recommended_items_list = [
[1, 3, 5, 7, 9],
[2, 4, 6, 8],
[11, 12, 13, 14, 15, 16, 17]
]

# Example item features dictionary (using random feature vectors)
item_features = {item: np.random.rand(5) for item in range(1, 21)}

diversity_value = diversity(recommended_items_list, item_features)
print(f"Diversity: {diversity_value:.2f}")

Output:

Diversity: 0.23

In this code, the diversity function calculates the diversity of the recommended items using cosine similarity. It takes a list of lists recommended_items_list, where each inner list represents the recommended items for a specific user, and a dictionary item_features representing the feature vectors for each item in the catalog.

The function computes pairwise cosine similarities for the recommended items and calculates the diversity by subtracting the average similarity between item pairs from 1.

In the example usage, recommended_items_list contains example recommended items for three users, and item_features represents a dictionary of random 5-dimensional feature vectors for 20 items. The function is called with these inputs, and the result is printed.

3.4 Serendipity: Measures the degree to which the recommended items are both relevant and unexpected, promoting the discovery of novel and interesting items.

A recommendation list with serendipity can introduce users to items they may not have expected to enjoy or find relevant, leading to serendipitous discoveries. This can create a more enjoyable and engaging user experience.

Measuring serendipity is a challenging task as it involves the combination of relevance, surprise, and novelty. Here’s a Python function to calculate a basic version of Serendipity, which measures the degree to which the recommended items are both relevant and unexpected:

def serendipity(recommended_items_list, relevant_items_list, popular_items, k=10):
serendipity_score = 0
total_users = len(recommended_items_list)

for recommended_items, relevant_items in zip(recommended_items_list, relevant_items_list):
# Select the top-k recommended items
top_k_recommended = recommended_items[:k]

# Find the serendipitous items by removing popular items from relevant items
serendipitous_items = set(relevant_items) - set(popular_items)

# Count the number of serendipitous items in the top-k recommendations
serendipitous_recommendations = len(set(top_k_recommended) & serendipitous_items)

# Calculate the proportion of serendipitous items in the top-k recommendations
serendipity_score += serendipitous_recommendations / k

# Calculate the average serendipity score across all users
avg_serendipity = serendipity_score / total_users
return avg_serendipity

# Example usage
recommended_items_list = [
[1, 3, 5, 7, 9, 2, 4, 6, 8, 11],
[2, 4, 6, 8, 1, 3, 5, 7, 9, 12],
[11, 12, 13, 14, 15, 16, 17, 1, 3, 5]
]

relevant_items_list = [
[2, 3, 5, 7, 11, 13, 15, 17],
[1, 4, 6, 8, 9, 11, 14, 16],
[1, 3, 5, 7, 9, 11, 12, 13, 15, 17]
]

popular_items = [1, 2, 3, 4, 5, 6, 7, 8, 9]

serendipity_value = serendipity(recommended_items_list, relevant_items_list, popular_items)
print(f"Serendipity: {serendipity_value:.2f}")

Output:

Serendipity: 0.20

In this code, the serendipity function calculates the serendipity by first identifying serendipitous items (relevant but not popular items) and then calculating the proportion of serendipitous items in the top-k recommendations. It takes three lists of lists recommended_items_list, relevant_items_list, and a list of popular items popular_items. The function also accepts an optional parameter k, which is the number of top recommendations to consider (default is 10).

In the example usage, recommended_items_list contains example recommended items for three users, relevant_items_list contains relevant items for the users, and popular_items is a list of popular items. The function is called with these inputs, and the result is printed. Note that this is a simple example, and other more sophisticated methods can be used to measure serendipity in recommendation systems.

Summary

Evaluating the performance of a recommendation system is crucial for its success, as it helps identify areas for improvement and ensures that it meets user needs. By understanding and applying these key metrics, you can optimize your recommender engine, enhance user satisfaction, and ultimately drive engagement and revenue for your platform. Remember, the most effective evaluation approach will depend on your specific use case and business goals, so always consider a combination of these metrics to get a comprehensive understanding of your system’s performance. Happy recommending!

For more information about data science and machine learning, please check out my YouTube channel and Medium Page or follow me on LinkedIn.


Recommended Tutorials

Leave a Comment

Your email address will not be published. Required fields are marked *