Classification Model Evaluation

Published in

Heartbeat

12 min readMay 15, 2018

What is Model Evaluation?

Model evaluation is the process of choosing between models, different model types, tuning parameters, and features. Better evaluation processes lead to better, more accurate models in your applications.

In this article, we’ll be discussing model evaluation for supervised classification models. We’ll cover evaluation procedures, evaluation metrics, and where to apply them.

Prerequisites

Python 3.+
Anaconda (Scikit Learn, Numpy, Pandas, Matplotlib, Seaborn)
Jupyter Notebook.
Basic understanding of supervised machine learning methods — specifically classification.

Recap

In my previous article series, I talked about how machine learning workflow can be performed for a classification task. We’ll take the same example and discuss in detail how the model evaluation can be applied to that.

You can find my previous articles below:

In this section we’ll recap the model selection process.

The complete workflow is explained in detail in the above posts.

First we’ll import the necessary libraries and then read the dataset using the read_csv function of pandas.

# Read data
diabetes = pd.read_csv(‘datasets/diabetes.csv’)
diabetes.head(2)

Next we go through a quick data cleaning process to remove unusual data rows in the dataset. The complete data cleaning process is described in the part 1 of the above given series.

# Remove unusual rows of data
diabetes_mod = diabetes[(diabetes.BloodPressure != 0) & (diabetes.BMI != 0) & (diabetes.Glucose != 0)]# Dimensions of data set after cleansing
print(diabetes_mod.shape)

Next we select the features that best represent the model. This step is explained in detail in part 2 of the series. We’ve selected the following features: ‘Pregnancies’, ‘Glucose’, ‘BMI’, ‘DiabetesPedigreeFunction’.

# Features/Response
feature_names = [‘Pregnancies’, ‘Glucose’, ‘BMI’, ‘DiabetesPedigreeFunction’]X = diabetes_mod[feature_names]
y = diabetes_mod.Outcome

Finally, in part 2 of the series, after the hyper-parameter tuning phase, we’ve selected the logistic regression model with the given hyper-parameters.

logreg = LogisticRegression(C=1, multi_class=’ovr’, penalty=’l2', solver=’liblinear’)

A newsletter for machine learners — by machine learners. Sign up to receive our weekly dive into all things ML, curated by our experts in the field.

Model Evaluation Procedures

Generally, we avoid training and testing a model on the same data because it could lead to overfitting. Models that overfit training data tend to perform poorly when given out-of-sample-data. To avoid this, we can take the following precautions.

Train/Test Split
K-Fold Cross Validation

Train/Test Split

This method splits the data set into two portions : a training set and a testing set. The training set is used to train the model. We can also measure the model’s accuracy on the training set, but we shouldn’t evaluate models based on this metric alone.

The testing set is only used to test the model and evaluate the accuracy after training. Data samples in the test set are never shown to the model during training. Accuracy on the test set provides a better indication of how models will perform on new data.

Pros : Train/test split is still useful because of its flexibility and speed
Cons : Provides a high-variance estimate of out-of-sample accuracy

The scikit-learn library provides us with a method to divide the the data into train and test sets called train_test_split from the model_selection module. First we’ll split the data into train and test sets. Then we’ll use the train set to train the logistic regression model. Then we’ll predict with the test set.

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y, random_state = 0)logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)

Finally, we calculate the performance of the model using the evaluation metric Classification Accuracy (which we’ll discuss in detail in an upcoming section). We get an accuracy score of 0.795 or 79.5%.

accuracy = accuracy_score(y_test, y_pred)
print(“Accuracy {}”.format(accuracy))Accuracy 0.7955801104972375

K-Fold Cross Validation

This method splits the data set into K equal partitions (“folds”), then uses 1 fold as the testing set and the union of the other folds as the training set.

The process will follow the above steps K times, using different folds as the testing set each time. The average testing accuracy of the process is the testing accuracy.

Pros : More accurate estimate of out-of-sample accuracy. More “efficient” use of data (every observation is used for both training and testing)
Cons : Much slower than Train/Test split.

For cross validation, Scikit Learn provides the method cross_val_score, which is also from the model_selection module. We pass the logistic regression model with the features X and responses y as parameters. And the the method will perform a 10-fold cross validation, using classification accuracy as the scoring method. We get a mean accuracy of 78%.

accuracy = cross_val_score(logreg, X, y, cv = 10, scoring=’accuracy’).mean()print(“Accuracy {}”.format(accuracy))Accuracy 0.7805877119643279

Model Evaluation Metrics

A module evaluation metric is a criterium by which the performance or the accuracy of a model is measured.

In the upcoming sections we will discuss evaluation metrics in detail.

Classification Accuracy

Classification accuracy is by far the most common model evaluation metric used for classification problems. Classification accuracy is the percentage of correct predictions.

Even though classification is a good metric, when class distribution is imbalanced, it can give a false sense of high accuracy.

Scikit-learn provides a separate method to evaluate the accuracy, which is accuracy_score in the metrics module. Also the accuracy estimator is built in as a parameter in cross_val_score. The scoring parameter is what decides the classification accuracy.

The classification accuracy metric works better if there is an equal number of samples in each class.
For example, if there is 90% class A samples and 10% of class B, and trained a model, the model would have a 90% training accuracy just by predicting every sample as class A.
However, if the same model is applied to a dataset with a different class distribution, (60% samples is class A, and 40% is class B), the test accuracy score would drop to 60%.

We already looked into classification accuracy using Scikit Learn in the Model Evaluation Procedures section.

Confusion Matrix

A confusion matrix can be defined loosely as a table that describes the performance of a classification model on a set of test data for which the true values are known. A confusion matrix is highly interpretative and can be used to estimate a number of other metrics.

Scikit-learn provides a method to perform the confusion matrix on the testing data set. The confusion_matrix method requires the actual response class values and the predicted values to determine the matrix.

from sklearn.metrics import confusion_matrixconfusion = confusion_matrix(y_test, y_pred)
print(confusion)

Fig — Confusion Matrix

Since our problem has only two response classes, it can be categorized as a binary classification problem. Therefore the confusion matrix is a 2 X 2 grid. The confusion matrix is interpreted differently in different implementations. Scikit-learn’s confusion matrix class document is found here.

The above matrix is not clear enough for us to predict anything. Therefore we’ll plot the confusion matrix using a sample method found in Scikit-learn examples, which can be found here.

plot_confusion_matrix(confusion, classes=[‘Non Diabetic’, ‘Diabetic’], title=’Confusion matrix’)

The basic terminology related to the confusion matrix is as follows. We’ll interpret with regards to our problem.

True Positives (TP) : Correct prediction as Diabetic
True Negatives (TN) : Correct prediction as Non-diabetic
False Positives (FP) : Incorrect prediction as Diabetic (‘Type I error’)
False Negatives (FN) : Incorrect prediction as Non-diabetic (‘Type II error’)

Metrics computed from the confusion matrix

First we’ll parse the obtained confusion matrix into True Positives(TP), True Negatives (TN), False Positives (FP), and False Negatives (FN).

# True Positives
TP = confusion[1, 1]# True Negatives
TN = confusion[0, 0]# False Positives
FP = confusion[0, 1]# False Negatives
FN = confusion[1, 0]

We can calculate the following metrics from the confusion matrix.

Classification accuracy

Classification accuracy is the ratio of correct predictions to the total no. of predictions. Or more simply, how often is the classifier correct.

Fig — Accuracy

We can calculate the accuracy using the confusion matrix. Following is the equation to calculate the accuracy using the confusion matrix:

Fig — Accuracy using Confusion Matrix

Accuracy can also be calculated using the method accuracy_score. We can observe that the accuracy is 0.795.

print((TP + TN) / float(TP + TN + FP + FN))
print(accuracy_score(y_test, y_pred))OUTPUT :
0.795580110497
0.795580110497

Sensitivity/Recall

Sensitivity or recall is the ratio of correct positive predictions to the total no. of positive predictions. Or more simply, how sensitive the classifier is for detecting positive instances. This is also called the True Positive Rate.

Fig — Recall

Using the confusion matrix recall can be calculated as follows:

Fig — Recall using Confusion Matrix

Also, Scikit-learn provides a method called recall_score to find the recall score. We can observe that the classifier has a recall score of 0.58.

print(TP / float(TP + FN))
print(recall_score(y_test, y_pred))OUTPUT :
0.58064516129
0.58064516129

Specificity

Specificity is the ratio of correct negative predictions to the total no. of negative predictions. This determines how specific the classifier is in predicting positive instances.

Fig — Specificity

We can calculate specificity using the confusion matrix as follows.

Fig — Specificity using Confusion Matrix

print(TN / float(TN + FP))OUTPUT :
0.90756302521

False Positive Rate

The false positive rate is the ratio of negative predictions that were determined to be positive to the total number of negative predictions. Or, when the actual value is negative, how often is the prediction incorrect.

Fig — False Positive Rate

This can be calculated using the confusion matrix as follows:

Fig — False Positive Rate using Confusion Matrix

print(FP / float(TN + FP))OUTPUT :
0.0924369747899

Precision

Precision is the ratio of correct predictions to the total no. of predicted correct predictions. This measures how precise the classifier is when predicting positive instances.

Fig — Precision

This can be calculated from the confusion matrix as follows:

Fig — Precision using Confusion Matrix

Scikit -learn provides the method precision_score to calculate precision. We can observe that the precision is 0.76.

print(TP / float(TP + FP))
print(precision_score(y_test, y_pred))OUTPUT :
0.765957446809
0.765957446809

Confusion matrix advantages:

Variety of metrics can be derived.
Useful for multi-class problems as well.

NOTE : Choosing which metric to use depends on the business objective or the nature of the problem.

Adjusting Classification Threshold

It’s possible to adjust the logistic regression model’s classification threshold to increase the model’s sensitivity.

After training, the model exposes an attribute called predict_proba, which returns the probability of the test data being in a particular response class. From this, we’ll get the probabilities of predicting a diabetic result.

# store the predicted probabilities for class 1 (diabetic)y_pred_prob = logreg.predict_proba(X_test)[:, 1]

Next we’ll plot the probability of becoming diabetic in a histogram.

plt.hist(y_pred_prob, bins=8, linewidth=1.2)
plt.xlim(0, 1)
plt.title(‘Histogram of predicted probabilities’)
plt.xlabel(‘Predicted probability of diabetes’)
plt.ylabel(‘Frequency’)

Since it’s a binary classification problem, the classification probability threshold is 0.5, which means if the probability is less than 0.5, it’s classified as “0 (non-diabetic)”. If the probability is more than 0.5, it’s classified as “1 (diabetic)”.

We can use the Scikit-learn’s binarize method to set the threshold to 0.3, which will classify as ‘0 (non-diabetic)’ if the probability is less than 0.3, and if it’s greater it will be classified as ‘1 (diabetic)’.

# predict diabetes if the predicted probability is greater than 0.3
from sklearn.preprocessing import binarizey_pred_class = binarize([y_pred_prob], 0.3)[0]

Next we’ll print the confusion matrix for the new threshold predictions, and compare with the original.

# new confusion matrix (threshold of 0.3)confusion_new = confusion_matrix(y_test, y_pred_class)
print(confusion_new)

Fig — New Confusion Matrix

TP = confusion_new[1, 1]
TN = confusion_new[0, 0]
FP = confusion_new[0, 1]
FN = confusion_new[1, 0]

Next we’ll calculate sensitivity and specificity to observe the changes from the previous confusion matrix calculations.

Previously the sensitivity calculated was 0.58. We can observe that the sensitivity has increased, which means it’s more sensitive to predict “positive (diabetic)” instances.

# sensitivity has increasedprint(TP / float(TP + FN))
print(recall_score(y_test, y_pred_class))OUTPUT :
0.870967741935
0.870967741935

Using the same process, we can calculate the specificity for the new confusion matrix. Previously it was 0.90. We observe that it has decreased.

# specificity has decreasedprint(TN / float(TN + FP))OUTPUT :
0.689075630252

We adjust the threshold of a classifier in order to suit the problem we’re trying to solve.

In the case of a spam filter (positive class is spam), optimization needs to be done for precision. This means it’s more acceptable to have false negatives (spam goes to the inbox) than false positives (non-spam is caught by the spam filter).
In the case of a fraudulent transaction detector (positive class is “fraud”), optimization is to be done for sensitivity, which means it’s acceptable to more have false positives (normal transactions that are flagged as possible fraud) than false negatives (fraudulent transactions that are not detected).

ROC curve

An ROC curve is a commonly used way to visualize the performance of a binary classifier, meaning a classifier with two possible output classes. The curve plots the True Positive Rate (Recall) against the False Positive Rate (also interpreted as 1-Specificity).

Scikit-learn provides a method called roc_curve to find the false positive and true positive rates across various thresholds, which we can use to draw the ROC curve. We can plot the curve as follows.

fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
plt.plot(fpr, tpr)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.title(‘ROC curve for diabetes classifier’)
plt.xlabel(‘False Positive Rate (1 — Specificity)’)
plt.ylabel(‘True Positive Rate (Sensitivity)’)
plt.grid(True)

We’re unable to find the threshold used to generate the ROC curve on the curve itself. But we can use the following method to find the specificity and sensitivity across various thresholds.

def evaluate_threshold(threshold):print(‘Sensitivity:’, tpr[thresholds > threshold][-1])
print(‘Specificity:’, 1 — fpr[thresholds > threshold][-1])

The following is an example to show how the sensitivity and specificity behave with several thresholds.

evaluate_threshold(0.3)OUTPUT :
Sensitivity: 0.870967741935
Specificity: 0.705882352941evaluate_threshold(0.5)OUTPUT :
Sensitivity: 0.58064516129
Specificity: 0.90756302521

ROC curve is a reliable indicator in measuring the performance of a classifier. It can also be extended to classification problems with three or more classes using the “one versus all” approach.

AUC (Area Under the Curve)

AUC or Area Under the Curve is the percentage of the ROC plot that is underneath the curve. AUC is useful as a single number summary of classifier performance.

In Scikit-learn, we can find the AUC score using the method roc_auc_score.

print(roc_auc_score(y_test, y_pred_prob))OUTPUT :
0.858769314177

Also, the cross_val_score method, which is used to perform the K-fold cross validation method, comes with the option to pass roc_auc as the scoring parameter. Therefore, we can measure the AUC score using the cross validation procedure as well.

cross_val_score(logreg, X, y, cv=10, scoring=’roc_auc’).mean()OUTPUT :
0.83743085106382975

ROC/AUC advantages:

Setting a classification threshold is not required.
Useful even when there is a high class imbalance.

Summary

In this article, we explored the evaluation of classification models. We discussed the need for an evaluation of a model, and main model evaluation procedures that are used such as “train/test split” and “k-fold cross validation”.

Next we talked about model evaluation metrics in detail along with code samples using Scikit-learn. We discussed, in detail: “classification accuracy”, “confusion matrix”, “roc curve” and “area under the curve”.

Now you should be able to confidently evaluate a classification model and choose the best performing model for a given dataset using the knowledge gained from this article.

Source code that created this post can be found below.

LahiruTjay/Machine-Learning-With-Python

Machine-Learning-With-Python - This repositories contain various Machine Learning examples done with Python.

github.com

If you have any problems or questions regarding this article, please don’t hesitate to leave a comment below or drop me an email: lahiru.tjay@gmail.com

Hope you enjoyed the article. Cheers!

Discuss this post on Hacker News.

Editor’s Note: Heartbeat is a contributor-driven online publication and community dedicated to providing premier educational resources for data science, machine learning, and deep learning practitioners. We’re committed to supporting and inspiring developers and engineers from all walks of life.

Editorially independent, Heartbeat is sponsored and published by Comet, an MLOps platform that enables data scientists & ML teams to track, compare, explain, & optimize their experiments. We pay our contributors, and we don’t sell ads.

If you’d like to contribute, head on over to our call for contributors. You can also sign up to receive our weekly newsletters (Deep Learning Weekly and the Comet Newsletter), join us on Slack, and follow Comet on Twitter and LinkedIn for resources, events, and much more that will help you build better ML models, faster.