Machine learning model performance evaluation is one of the most commonly asked questions in data science and machine learning interviews. In this tutorial, you will learn:

- What are the most widely used metrics for a binary classification model performance evaluation?
- How to interpret each metric?

**Resources for this post:**

- More video tutorials on Data Science interviews and Statistics
- More blog posts on Data Science interviews and Statistics
- Video tutorial for this post on YouTube. Click here for the slides.

Letâ€™s get started!

**7 Metrics for Binary Classification Model Performance Evaluation**

There are different algorithms to evaluate a binary classification modelâ€™s performance. The 7 most commonly used metrics are ROC/AUC, log loss, accuracy, precision, recall, F1 score, and Mathew correlation coefficient.

**ROC/AUC**

- The ROC curve is plotted with the x-axis being False Positive Rate (FPR) and the y-axis being True Positive Rate (TPR). It plots the value of FPR and TPR combinations at different classification thresholds.
- The False Positive Rate (FPR) is calculated by the number of False Positives (FP) divided by the total of True Negatives (TN) and False Positives (FP)
- The equation is: FPR = FP/(TN+FP)

- The True Positive Rate (TPR) is calculated by the number of True Positives (TP) divided by the total of True Positives (TP) and False Negatives (FN)
- The equation is: TPR = TP/(TP+FN)

- The False Positive Rate (FPR) is calculated by the number of False Positives (FP) divided by the total of True Negatives (TN) and False Positives (FP)
- The value of the ROC curve ranges from 0 to 1, where 1 represents a perfect model, and 0.5 represents a random guess. The higher the value is, the better the model is.

**Log Loss**

- Log loss tells us how close the predicted probability is to the true label.
- The value for log loss is 0 to infinity. A smaller log loss indicates a better model.

**Accuracy**

- Accuracy is the percentage of correct predictions.
- Accuracy is calculated by using the total of True Positive (TP) and True Negatives (TN) divided by the total number of records. The total number of records is the sum of True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN).
- The equation is: (TP+TN)/(TP+TN+FP+FN)

- Accuracy is calculated by using the total of True Positive (TP) and True Negatives (TN) divided by the total number of records. The total number of records is the sum of True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN).
- The value for accuracy ranges from 0 to 1, where 1 means perfect prediction. The higher value the accuracy is, the better the model is when the data is balanced.
- Accuracy is not a good model evaluation metric for a highly imbalanced dataset.
- For example, if we predict all 0s for a dataset with 1% of the labels being 1s, the accuracy would be 99%, but this is clearly not a good model.
- In this case, we can either use other metrics such as precision and recall, or user oversampling or under-sampling techniques to change the ratio of the modeling dataset.

**Precision**

- Precision is also called specificity or positive predictive value (PPV). It is the percentage of correctly predicted positive events out of all the predicted positive events.
- Precision is calculated using the True Positives (TP) divided by the total of True Positives (TP) and False Positives (FP).
- The equation is: TP/(TP+FP)

- Precision is calculated using the True Positives (TP) divided by the total of True Positives (TP) and False Positives (FP).
- The value for precision ranges from 0 to 1. The higher value the precision is, the better the model is.
- The precision of 1 means all the predicted positives are actual positives.
- The precision of 0 means that none of the predicted positives are actual positives.

- Precision should be used for model performance evaluation when the cost of false positives is high.
- For example, for a model predicting if an email is a spam, the cost of misclassifying an important email as spam is high, so we need to maximize precision for the model.

**Recall**

- Recall is also called sensitivity or true positive rate. It is the percentage of positive events captured out of all the positive events.
- Recall is calculated using the True Positives (TP) divided by the total of True Positives (TP) and False Negatives (FN).
- The equation is: TP/(TP+FN)

- Recall is calculated using the True Positives (TP) divided by the total of True Positives (TP) and False Negatives (FN).
- The value for recall ranges from 0 to 1. The higher value the recall is, the better the model is.
- The recall of 1 means all the actual positives are captured by the model prediction.
- The recall of 0 means that none of the actual positives are captured by the model prediction.

- Recall should be used for model performance evaluation when the cost for false positive is low, but the reward for true positive is high.
- For example, for a model predicting the response propensity of a marketing campaign, the cost of misclassifying a non-responder as a responder is low, but the reward for capturing a true responder is high, so we need to maximize recall for the model.

### F1 Score

- F1 score is also called the F score or the F measure. It is calculated using both precision and recall.
- F1 score value is 2 times the multiplication of precision and recall divided by the sum of precision and recall.
- The equation is: 2*Precision*Recall/(Precision+Recall)

- F1 score value is 2 times the multiplication of precision and recall divided by the sum of precision and recall.
- The F1 score ranges from 0 to 1, with the best value being 1 and the worst value being 0.
- F1 is a metric that balances precision and recall values, and it should be used when there is no clear preference between precision and recall.

**Matthews Correlation Coefficient**

- The Matthews correlation coefficient (MCC) is also known as the phi coefficient. It is used in machine learning as a measure of the quality of binary and multiclass classifications.
- It considers true and false positives and negatives and is generally regarded as a balanced measure that can be used even if the classes are of very different sizes.
- The MCC is in essence a correlation coefficient value between -1 and +1.
- A coefficient of 1 represents a perfect prediction

- A coefficient of 0 represents an average random prediction
- A coefficient of -1 represents an inverse prediction. The statistic is also known as the phi coefficient.

For more information about data science and machine learning, please check out myÂ YouTube channelÂ andÂ Medium PageÂ or follow me onÂ LinkedIn.

### Recommended Tutorials

- GrabNGoInfo Machine Learning Tutorials Inventory
- One-Class SVM For Anomaly Detection
- Multivariate Time Series Forecasting with Seasonality and Holiday Effect Using Prophet in Python
- Hyperparameter Tuning For XGBoost
- Recommendation System: User-Based Collaborative Filtering
- Four Oversampling And Under-Sampling Methods For Imbalanced Classification Using Python
- How to detect outliers | Data Science Interview Questions and Answers
- Causal Inference One-to-one Matching on Confounders Using R for Python Users