Classification Model Evaluation
What is Model Evaluation?
Model evaluation is the process of choosing between models, different model types, tuning parameters, and features. Better evaluation processes lead to better, more accurate models in your applications.
In this article, we’ll be discussing model evaluation for supervised classification models. We’ll cover evaluation procedures, evaluation metrics, and where to apply them.
Prerequisites
- Python 3.+
- Anaconda (Scikit Learn, Numpy, Pandas, Matplotlib, Seaborn)
- Jupyter Notebook.
- Basic understanding of supervised machine learning methods — specifically classification.
Recap
In my previous article series, I talked about how machine learning workflow can be performed for a classification task. We’ll take the same example and discuss in detail how the model evaluation can be applied to that.
You can find my previous articles below:
- Machine Learning Workflow on Diabetes Data : Part 01
- Machine Learning Workflow on Diabetes Data : Part 02
In this section we’ll recap the model selection process.
The complete workflow is explained in detail in the above posts.
First we’ll import the necessary libraries and then read the dataset using the read_csv
function of pandas.
# Read data
diabetes = pd.read_csv(‘datasets/diabetes.csv’)
diabetes.head(2)
Next we go through a quick data cleaning process to remove unusual data rows in the dataset. The complete data cleaning process is described in the part 1 of the above given series.
# Remove unusual rows of data
diabetes_mod = diabetes[(diabetes.BloodPressure != 0) & (diabetes.BMI != 0) & (diabetes.Glucose != 0)]# Dimensions of data set after cleansing
print(diabetes_mod.shape)
Next we select the features that best represent the model. This step is explained in detail in part 2 of the series. We’ve selected the following features: ‘Pregnancies’, ‘Glucose’, ‘BMI’, ‘DiabetesPedigreeFunction’.
# Features/Response
feature_names = [‘Pregnancies’, ‘Glucose’, ‘BMI’, ‘DiabetesPedigreeFunction’]X = diabetes_mod[feature_names]
y = diabetes_mod.Outcome
Finally, in part 2 of the series, after the hyper-parameter tuning phase, we’ve selected the logistic regression model with the given hyper-parameters.
logreg = LogisticRegression(C=1, multi_class=’ovr’, penalty=’l2', solver=’liblinear’)
A newsletter for machine learners — by machine learners. Sign up to receive our weekly dive into all things ML, curated by our experts in the field.
Model Evaluation Procedures
Generally, we avoid training and testing a model on the same data because it could lead to overfitting. Models that overfit training data tend to perform poorly when given out-of-sample-data. To avoid this, we can take the following precautions.
- Train/Test Split
- K-Fold Cross Validation
Train/Test Split
This method splits the data set into two portions : a training set and a testing set. The training set is used to train the model. We can also measure the model’s accuracy on the training set, but we shouldn’t evaluate models based on this metric alone.
The testing set is only used to test the model and evaluate the accuracy after training. Data samples in the test set are never shown to the model during training. Accuracy on the test set provides a better indication of how models will perform on new data.
Pros : Train/test split is still useful because of its flexibility and speed
Cons : Provides a high-variance estimate of out-of-sample accuracy
The scikit-learn library provides us with a method to divide the the data into train and test sets called train_test_split
from the model_selection
module. First we’ll split the data into train and test sets. Then we’ll use the train set to train the logistic regression model. Then we’ll predict with the test set.
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y, random_state = 0)logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
Finally, we calculate the performance of the model using the evaluation metric Classification Accuracy
(which we’ll discuss in detail in an upcoming section). We get an accuracy score of 0.795 or 79.5%.
accuracy = accuracy_score(y_test, y_pred)
print(“Accuracy {}”.format(accuracy))Accuracy 0.7955801104972375
K-Fold Cross Validation
This method splits the data set into K equal partitions (“folds”), then uses 1 fold as the testing set and the union of the other folds as the training set.
The process will follow the above steps K times, using different folds as the testing set each time. The average testing accuracy of the process is the testing accuracy.
Pros : More accurate estimate of out-of-sample accuracy. More “efficient” use of data (every observation is used for both training and testing)
Cons : Much slower than Train/Test split.
For cross validation, Scikit Learn provides the method cross_val_score
, which is also from the model_selection
module. We pass the logistic regression model with the features X and responses y as parameters. And the the method will perform a 10-fold cross validation, using classification accuracy
as the scoring method. We get a mean accuracy of 78%.
accuracy = cross_val_score(logreg, X, y, cv = 10, scoring=’accuracy’).mean()print(“Accuracy {}”.format(accuracy))Accuracy 0.7805877119643279
Model Evaluation Metrics
A module evaluation metric is a criterium by which the performance or the accuracy of a model is measured.
In the upcoming sections we will discuss evaluation metrics in detail.
Classification Accuracy
Classification accuracy is by far the most common model evaluation metric used for classification problems. Classification accuracy is the percentage of correct predictions.
Even though classification is a good metric, when class distribution is imbalanced, it can give a false sense of high accuracy.
Scikit-learn provides a separate method to evaluate the accuracy, which is accuracy_score
in the metrics module. Also the accuracy estimator is built in as a parameter in cross_val_score
. The scoring parameter is what decides the classification accuracy.
The classification accuracy metric works better if there is an equal number of samples in each class.
For example, if there is 90% class A samples and 10% of class B, and trained a model, the model would have a 90% training accuracy just by predicting every sample as class A.
However, if the same model is applied to a dataset with a different class distribution, (60% samples is class A, and 40% is class B), the test accuracy score would drop to 60%.
We already looked into classification accuracy using Scikit Learn in the Model Evaluation Procedures section.
Confusion Matrix
A confusion matrix can be defined loosely as a table that describes the performance of a classification model on a set of test data for which the true values are known. A confusion matrix is highly interpretative and can be used to estimate a number of other metrics.
Scikit-learn provides a method to perform the confusion matrix on the testing data set. The confusion_matrix
method requires the actual response class values and the predicted values to determine the matrix.
from sklearn.metrics import confusion_matrixconfusion = confusion_matrix(y_test, y_pred)
print(confusion)
Since our problem has only two response classes, it can be categorized as a binary classification problem. Therefore the confusion matrix is a 2 X 2 grid. The confusion matrix is interpreted differently in different implementations. Scikit-learn’s confusion matrix class document is found here.
The above matrix is not clear enough for us to predict anything. Therefore we’ll plot the confusion matrix using a sample method found in Scikit-learn examples, which can be found here.
plot_confusion_matrix(confusion, classes=[‘Non Diabetic’, ‘Diabetic’], title=’Confusion matrix’)
The basic terminology related to the confusion matrix is as follows. We’ll interpret with regards to our problem.
- True Positives (TP) : Correct prediction as Diabetic
- True Negatives (TN) : Correct prediction as Non-diabetic
- False Positives (FP) : Incorrect prediction as Diabetic (‘Type I error’)
- False Negatives (FN) : Incorrect prediction as Non-diabetic (‘Type II error’)
Metrics computed from the confusion matrix
First we’ll parse the obtained confusion matrix into True Positives(TP)
, True Negatives (TN)
, False Positives (FP)
, and False Negatives (FN)
.
# True Positives
TP = confusion[1, 1]# True Negatives
TN = confusion[0, 0]# False Positives
FP = confusion[0, 1]# False Negatives
FN = confusion[1, 0]
We can calculate the following metrics from the confusion matrix.
Classification accuracy
Classification accuracy is the ratio of correct predictions to the total no. of predictions. Or more simply, how often is the classifier correct.
We can calculate the accuracy using the confusion matrix. Following is the equation to calculate the accuracy using the confusion matrix:
Accuracy can also be calculated using the method accuracy_score
. We can observe that the accuracy is 0.795.
print((TP + TN) / float(TP + TN + FP + FN))
print(accuracy_score(y_test, y_pred))OUTPUT :
0.795580110497
0.795580110497
Sensitivity/Recall
Sensitivity or recall is the ratio of correct positive predictions to the total no. of positive predictions. Or more simply, how sensitive the classifier is for detecting positive instances. This is also called the True Positive Rate
.
Using the confusion matrix recall can be calculated as follows:
Also, Scikit-learn provides a method called recall_score
to find the recall score. We can observe that the classifier has a recall score of 0.58.
print(TP / float(TP + FN))
print(recall_score(y_test, y_pred))OUTPUT :
0.58064516129
0.58064516129
Specificity
Specificity is the ratio of correct negative predictions to the total no. of negative predictions. This determines how specific the classifier is in predicting positive instances.
We can calculate specificity using the confusion matrix as follows.
print(TN / float(TN + FP))OUTPUT :
0.90756302521
False Positive Rate
The false positive rate
is the ratio of negative predictions that were determined to be positive to the total number of negative predictions. Or, when the actual value is negative, how often is the prediction incorrect.
This can be calculated using the confusion matrix as follows:
print(FP / float(TN + FP))OUTPUT :
0.0924369747899
Precision
Precision is the ratio of correct predictions to the total no. of predicted correct predictions. This measures how precise the classifier is when predicting positive instances.
This can be calculated from the confusion matrix as follows:
Scikit -learn provides the method precision_score
to calculate precision. We can observe that the precision is 0.76.
print(TP / float(TP + FP))
print(precision_score(y_test, y_pred))OUTPUT :
0.765957446809
0.765957446809
Confusion matrix advantages:
- Variety of metrics can be derived.
- Useful for multi-class problems as well.
NOTE : Choosing which metric to use depends on the business objective or the nature of the problem.
Adjusting Classification Threshold
It’s possible to adjust the logistic regression model’s classification threshold to increase the model’s sensitivity.
After training, the model exposes an attribute called predict_proba
, which returns the probability of the test data being in a particular response class. From this, we’ll get the probabilities of predicting a diabetic result.
# store the predicted probabilities for class 1 (diabetic)y_pred_prob = logreg.predict_proba(X_test)[:, 1]
Next we’ll plot the probability of becoming diabetic in a histogram.
plt.hist(y_pred_prob, bins=8, linewidth=1.2)
plt.xlim(0, 1)
plt.title(‘Histogram of predicted probabilities’)
plt.xlabel(‘Predicted probability of diabetes’)
plt.ylabel(‘Frequency’)
Since it’s a binary classification problem, the classification probability threshold is 0.5, which means if the probability is less than 0.5, it’s classified as “0 (non-diabetic)”. If the probability is more than 0.5, it’s classified as “1 (diabetic)”.
We can use the Scikit-learn’s binarize method to set the threshold to 0.3, which will classify as ‘0 (non-diabetic)’ if the probability is less than 0.3, and if it’s greater it will be classified as ‘1 (diabetic)’.
# predict diabetes if the predicted probability is greater than 0.3
from sklearn.preprocessing import binarizey_pred_class = binarize([y_pred_prob], 0.3)[0]
Next we’ll print the confusion matrix for the new threshold predictions, and compare with the original.
# new confusion matrix (threshold of 0.3)confusion_new = confusion_matrix(y_test, y_pred_class)
print(confusion_new)
TP = confusion_new[1, 1]
TN = confusion_new[0, 0]
FP = confusion_new[0, 1]
FN = confusion_new[1, 0]
Next we’ll calculate sensitivity and specificity to observe the changes from the previous confusion matrix calculations.
Previously the sensitivity calculated was 0.58. We can observe that the sensitivity has increased, which means it’s more sensitive to predict “positive (diabetic)” instances.
# sensitivity has increasedprint(TP / float(TP + FN))
print(recall_score(y_test, y_pred_class))OUTPUT :
0.870967741935
0.870967741935
Using the same process, we can calculate the specificity for the new confusion matrix. Previously it was 0.90. We observe that it has decreased.
# specificity has decreasedprint(TN / float(TN + FP))OUTPUT :
0.689075630252
We adjust the threshold of a classifier in order to suit the problem we’re trying to solve.
In the case of a spam filter (positive class is spam), optimization needs to be done for precision. This means it’s more acceptable to have false negatives (spam goes to the inbox) than false positives (non-spam is caught by the spam filter).
In the case of a fraudulent transaction detector (positive class is “fraud”), optimization is to be done for sensitivity, which means it’s acceptable to more have false positives (normal transactions that are flagged as possible fraud) than false negatives (fraudulent transactions that are not detected).
ROC curve
An ROC curve is a commonly used way to visualize the performance of a binary classifier, meaning a classifier with two possible output classes. The curve plots the True Positive Rate (Recall) against the False Positive Rate (also interpreted as 1-Specificity).
Scikit-learn provides a method called roc_curve
to find the false positive and true positive rates across various thresholds, which we can use to draw the ROC curve. We can plot the curve as follows.
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
plt.plot(fpr, tpr)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.title(‘ROC curve for diabetes classifier’)
plt.xlabel(‘False Positive Rate (1 — Specificity)’)
plt.ylabel(‘True Positive Rate (Sensitivity)’)
plt.grid(True)
We’re unable to find the threshold used to generate the ROC curve on the curve itself. But we can use the following method to find the specificity and sensitivity across various thresholds.
def evaluate_threshold(threshold):print(‘Sensitivity:’, tpr[thresholds > threshold][-1])
print(‘Specificity:’, 1 — fpr[thresholds > threshold][-1])
The following is an example to show how the sensitivity and specificity behave with several thresholds.
evaluate_threshold(0.3)OUTPUT :
Sensitivity: 0.870967741935
Specificity: 0.705882352941evaluate_threshold(0.5)OUTPUT :
Sensitivity: 0.58064516129
Specificity: 0.90756302521
ROC curve is a reliable indicator in measuring the performance of a classifier. It can also be extended to classification problems with three or more classes using the “one versus all” approach.
AUC (Area Under the Curve)
AUC or Area Under the Curve is the percentage of the ROC plot that is underneath the curve. AUC is useful as a single number summary of classifier performance.
In Scikit-learn, we can find the AUC score using the method roc_auc_score
.
print(roc_auc_score(y_test, y_pred_prob))OUTPUT :
0.858769314177
Also, the cross_val_score
method, which is used to perform the K-fold cross validation method, comes with the option to pass roc_auc
as the scoring parameter. Therefore, we can measure the AUC score using the cross validation procedure as well.
cross_val_score(logreg, X, y, cv=10, scoring=’roc_auc’).mean()OUTPUT :
0.83743085106382975
ROC/AUC advantages:
- Setting a classification threshold is not required.
- Useful even when there is a high class imbalance.
Summary
In this article, we explored the evaluation of classification models. We discussed the need for an evaluation of a model, and main model evaluation procedures that are used such as “train/test split” and “k-fold cross validation”.
Next we talked about model evaluation metrics in detail along with code samples using Scikit-learn. We discussed, in detail: “classification accuracy”, “confusion matrix”, “roc curve” and “area under the curve”.
Now you should be able to confidently evaluate a classification model and choose the best performing model for a given dataset using the knowledge gained from this article.
Source code that created this post can be found below.
If you have any problems or questions regarding this article, please don’t hesitate to leave a comment below or drop me an email: lahiru.tjay@gmail.com
Hope you enjoyed the article. Cheers!
Discuss this post on Hacker News.
Editor’s Note: Heartbeat is a contributor-driven online publication and community dedicated to providing premier educational resources for data science, machine learning, and deep learning practitioners. We’re committed to supporting and inspiring developers and engineers from all walks of life.
Editorially independent, Heartbeat is sponsored and published by Comet, an MLOps platform that enables data scientists & ML teams to track, compare, explain, & optimize their experiments. We pay our contributors, and we don’t sell ads.
If you’d like to contribute, head on over to our call for contributors. You can also sign up to receive our weekly newsletters (Deep Learning Weekly and the Comet Newsletter), join us on Slack, and follow Comet on Twitter and LinkedIn for resources, events, and much more that will help you build better ML models, faster.