10 Metrics to Evaluate Supervised Machine Learning Models

Evaluating a model is as important as building it

Published in

Heartbeat

10 min readDec 9, 2022

Photo by Dan Cristian Pădureț on Unsplash

Supervised machine learning algorithms try to model the relationship between features (independent variables) and a label (target variable) given a set of observations. The two main types of supervised learning tasks are classification and regression.

Classification is a supervised learning technique in which the target variable is discrete (or categorical).

Consider a bank collecting a dataset that contains several features about their customers such as the number of products, transactions within the last month, expected monthly income, and so on. The bank is trying to predict whether a customer will churn based on this dataset. This task is an example of a binary classification task because the target variable can only take one of two possible values.

An example of a multi-class classification task could be the famous iris dataset. It contains four features (length and width of sepals and petals) of samples that belong to one of three species of iris. The task is to predict the type of iris knowing the given four features.

In a regression task, the target variable is continuous. Some typical examples of regression are predicting the price of a house knowing its features and forecasting the future sales of a retail store.

When working on a machine learning problem, we are not done until we establish a robust and versatile evaluation process. The models deployed into production are expected to work on new, previously unseen data. Hence, it is of vital importance to have a comprehensive understanding of the model performance, which can only be done with a thorough evaluation.

The evaluation is usually an iterative process where there is a feedback loop between the results and model. The most critical part of this process is the selection of an appropriate metric. There are many different metrics to evaluate classification models. The optimal one depends on the problem, data, and what is expected from the model. In this article, we will learn the frequently-used metrics used for evaluating classification models.

Classification metrics

1. Classification Accuracy

Classification models try to predict the value of the target variable for a given observation. Classification accuracy tells us how many of those predictions are correct.

Classification accuracy is a simple yet efficient metric. However, it fails to perform an accurate evaluation in some cases especially when the target variable has an unbalanced class distribution.

Let’s say we are working on a binary classification task where 95% of the values in the target variable are 0 and the other 5% are 1. A model that always predicts 0 will have an accuracy of 95% in this case. Although 95% sounds like a reasonable value, it is hard to even call this a machine learning model because the output is always the same.

What if it is crucial to detect observations with a target value of 1 correctly (i.e. cancer prediction)? In such cases, we need a different approach and metric that focuses on the positive class.

3. Log loss

Classification accuracy is calculated by the discrete outputs. On the other hand, log loss compares the actual class labels and the predicted probabilities.

The comparison is quantified using cross-entropy based on the probability distributions of the actual labels and predicted probabilities.

When calculating the log loss, we take the negative of the natural log of predicted probabilities. The more certain we are at the prediction, the lower the log loss (assuming the prediction is correct).

Log loss offers a more thorough evaluation than classification accuracy. Consider two different models that output 0.9 and 0.8 for a data point that belongs to the positive class. If we set the threshold value as 0.5, the prediction of both models will be positive, which is true. Thus, there is no difference of the model performance in terms of the classification accuracy.

However, one of the models is more certain of the prediction which should make it a better choice. Log loss takes the difference in these probability values into consideration. The log loss associated with 0.9 is 0.10536 (-log(0.9) = 0.10536) whereas it is 0.22314 with 0.8 (-log(0.8)=0.22314). Thus, being 90% sure results in a lower log loss than being 80% sure.

3. Confusion Matrix

Confusion matrix is not a metric but it provides further insight into the evaluation of predictions. In the case of using the classification accuracy, a prediction is either true or false. Confusion matrix takes it one step further and evaluates predictions as true positive, true negative, false positive, and false negative.

True positive (TP): Predicting positive class as positive (ok)
False positive (FP): Predicting negative class as positive (not ok)
False negative (FN): Predicting positive class as negative (not ok)
True negative (TN): Predicting negative class as negative (ok)

The desired outcome is that the prediction and actual class are the same.

4. Precision

Precision measures how good our model is when the prediction is positive. The focus of precision is positive predictions so it can be used in a cancer detection task. Precision is calculated by taking the ratio of the correct positive predictions to all positive predictions.

Let’s go over a simple example. We have a binary target variable with 15 values and the associated predictions. We can calculate precision manually or just use the precision_score function in the metrics module of scikit-learn.

from sklearn.metrics import precision_score

actual_values = [0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0]
predictions = [0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0]

precision_score(actual_values, predictions)
# output
0.714

The focus of precision is the positive predictions. In this example, there are 7 positive predictions and 5 of them are correct. Thus, the precision score is 5/7 which is 0.714.

5. Recall

Recall measures how good our model is at correctly predicting positive classes. The focus of recall is actual positive classes. It indicates how many of the positive classes the model is able to predict correctly.

Recall is calculated by taking the ratio of the correct positive predictions to all positive values. The number of all positive values is the sum of true positive and false negative results.

Let’s calculate the recall score of the same predictions used for the precision.

from sklearn.metrics import recall_score

actual_values = [0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0]
predictions = [0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0]

recall_score(actual_values, predictions)
# output
0.83

The focus of recall is the actual positive classes. In this example, there are 6 observations that belong to the positive class and 5 of them are predicted accurately. Thus, the precision score is 5/6 which is 0.83.

There is a trade-off between precision and recall. Increasing precision decreases recall and vice versa. We should aim to maximize precision or recall depending on the task.

For an email spam detection task, we try to maximize precision because we want to be correct when an email is detected as spam. We definitely do not want to label a normal email as spam (i.e. false positive). However, it does not hurt us much if the model fails to catch a few spam emails.

On the other hand, for a tumor detection task, we need to maximize recall because it is of vital importance to detect all positive values (i.e. has tumor). If the model misclassifies a negative (i.e. no tumor) example and predicts it as positive, it is still important but not as critical as missing positive values. Better safe than sorry!

What tips do big name companies have for students and startups? We asked them!

6. F1 Score

F1 score is the weighted average of precision and recall. It is a more useful measure than the classification accuracy for problems with uneven class distribution because it takes into account both false positives and false negatives. F1 score takes a value between 0 and 1.

7. ROC Curve and AOC

The output of an algorithm can be a value that indicates the probability of an observation belonging to the positive class. These values can be converted to discrete outputs (i.e. 0 and 1) by setting a threshold value. Depending on this threshold value, the number of positive and negative class predictions change, which also changes the precision and recall values.

ROC curve provides a summary of the performance by plotting the true positive (TP) and false positive (FP) rates at different threshold values. The true positive rate is the same as recall. The false positive rate is calculated by using the formula FP / (FP + TN).

The ideal case would be to maximize the true positive rate while keeping the false positive rate at minimum. However, this is not possible as we see in the ROC curve above. They both increase in the same direction.

AOC (Area Under the Curve) is a metric that gives an aggregated evaluation of a model at all threshold values. It is the area under the ROC curve between the points (0,0) and (1,1). In general, the close the AUC is to 1, the more accurate a classification model is.

There is an important point to keep in mind though. AUC is classification-threshold invariant. For some specific tasks, AUC is not the best metric to use.

Consider an email spam detection task where we do not want to have any false positives (i.e. a normal email is marked as spam). In this case, we try to maximize precision.

On the other hand, when working on a tumor detection problem, we cannot afford to have any false negatives (i.e. a tumor case is classified as normal). The optimal metric to use here is recall.

We should be free to adjust the classification threshold in such cases. Therefore, AUC is not a good choice since it is not affected by the threshold value.

Regression metrics

1. R-squared

R-squared is a metric to use with regression models. It measures how much of the variation in the target variable is explained by our model. For instance, an R-squared score of 0.9 means that 90% of the variability in the target variable is explained by the regression model. In general, the closer R-squared score is to 1, the more accurate our model is.

Let’s do a simple example. We will use the r2_score function available in the metrics module of Scikit-learn to calculate the r-squared of a list of price predictions.

from sklearn.metrics import r2_score

actual_prices = [24.6, 26.2, 18.5, 22.4, 35.1, 42.2, 28.1]
predictions = [23.8, 25.9, 19.1, 21.1, 36.4, 39.5, 28.9]

r2_score(actual_prices, predictions)
# output
0.968

2. Mean Squared Error (MSE)

MSE is calculated by taking the average of the squared differences between the actual values of the target variable and the predicted values. For all predictions, we take the difference between the prediction and the actual value, square the difference, and then find the average.

Let’s calculate the MSE of the same price predictions that we used for the r-squared metric. We can use the mean_squared_error function in the metrics module.

from sklearn.metrics import mean_squared_error

actual_prices = [24.6, 26.2, 18.5, 22.4, 35.1, 42.2, 28.1]
predictions = [23.8, 25.9, 19.1, 21.1, 36.4, 39.5, 28.9]

mean_squared_error(actual_prices, predictions)
# output
1.771

3. Mean absolute percentage error (MAPE)

The value of MSE depends on the values of the target variable. Thus, MSE is not a great choice when it comes to comparing results on target variables of very different value ranges. For instance, MSE value of 2.5 for a target variable that changes between 50 and 100 indicates a completely different performance than MSE value of 2.5 for a target variable that changes between 500 and 1000.

We will calculate the MAPE of the same price predictions as in the previous examples. We can use the mean_absolute_percentage_error function in the metrics module.

from sklearn.metrics import mean_absolute_percentage_error

actual_prices = [24.6, 26.2, 18.5, 22.4, 35.1, 42.2, 28.1]
predictions = [23.8, 25.9, 19.1, 21.1, 36.4, 39.5, 28.9]

mean_absolute_percentage_error(actual_prices, predictions)
# output
0.038

Mean absolute percentage error (MAPE) is a better choice for such cases. It is calculated by taking the average of the absolute percentage error. We first find the absolute value of the difference between the actual values of the target variable and the predictions. Then, we calculate the ratio of the differences to the actual values and find the average of these ratios.

Conclusion

We have learned about 10 commonly-used metrics for evaluating supervised learning models. Which one to use largely depends on the task at hand. Keep in mind that evaluating a machine learning model is an iterative process rather than being a one time action.

Thank you for reading. Please let me know if you have any feedback.

Editor’s Note: Heartbeat is a contributor-driven online publication and community dedicated to providing premier educational resources for data science, machine learning, and deep learning practitioners. We’re committed to supporting and inspiring developers and engineers from all walks of life.

Editorially independent, Heartbeat is sponsored and published by Comet, an MLOps platform that enables data scientists & ML teams to track, compare, explain, & optimize their experiments. We pay our contributors, and we don’t sell ads.

If you’d like to contribute, head on over to our call for contributors. You can also sign up to receive our weekly newsletter (Deep Learning Weekly), check out the Comet blog, join us on Slack, and follow Comet on Twitter and LinkedIn for resources, events, and much more that will help you build better ML models, faster.