Imbalanced Classification Demystified

How to solve 99% of all imbalanced classification problems

Harpreet Sahota

Published in

Heartbeat

10 min readNov 4, 2021

Imbalanced data is a pain to work with.

“But, Harpreet, why do you say that?”

Because machine learning techniques typically fail in these scenarios, and if they don’t fail…you’ll likely observe misleadingly optimistic performance with your classification model.

“Hold on, what? Why is that?”

Because many classifications algorithms are designed for situations where you have an equal number of observations for each class, causing the algorithm behave in…strange ways.

Especially when it encounters an example from the minority class.

“What do you mean by strange ways?”

If you train an algorithm on imbalanced data, you’ll end up with a model that is essentially blind to the minority class. Those few minority class examples aren’t considered important and are “glossed over” in order to achieve good performance.

“Okay, so…what’s your point?”

The majority class often reflects a normal case, it’s what you’d expect to happen most of the time. Oftentimes these cases are not what you’re interested in.

The real interesting cases are in the minority class.

Those infrequent, rare, extreme, severe, and highly consequential cases.

They could be a diagnostic fault, fraudulent transactions, or other types of black swan events.

“So…what am I to do?”

Have no fear, dear reader, that’s what we’ll discuss in this post.

Here’s an overview of three of the most important questions to ask yourself when you’re working on a classification task with imbalanced data:

1) What do I do about the class imbalance?
2) How do I know which algorithm to use?
3) What’s the right evaluation metric to use?

What do I do about the class imbalance?

A popular solution to the problem of class imbalance is to change the training data set by augmenting it in such a way that the classes are more balanced.

Instead of banging our head against the wall and trying to build a model to deal with the imbalance, we can balance the class frequencies.

There are a number of sampling methods available and I’ll discuss just a few of them for you here.

Oversampling

Oversampling methods basically create “fake” (maybe synthetic is a better word?) examples of the minority class using actual examples of the minority class from your training data.

Some of the more widely used and implemented oversampling methods include:

Random Oversampling
Synthetic Minority Oversampling Technique (SMOTE)
Borderline-SMOTE
Borderline Oversampling with SVM
Adaptive Synthetic Sampling (ADASYN)

One of the most widely used oversampling methods is called SMOTE (Synthetic Minority Oversampling Technique).

That’s a technique that’s widely written about pretty much everywhere, so I’ll briefly describe Borderline SMOTE

Borderline SMOTE

Unlike with the SMOTE, where the synthetic data are created randomly between the two data, Borderline-SMOTE only makes synthetic data along the decision boundary between the two classes.

Borderline SMOTE selects examples from the minority class that are misclassified, such as with a k-nearest neighbor classification model.

We can then oversample just those difficult instances, providing more resolution only where it may be required.

Here are the steps at a high-level and you can find more detail on page five of the original paper:

For every point in the minority class, find it’s k nearest neighbors, and call the number of examples from the majority class out of those `k’.`
If all the nearest neighbors of a point are part of the majority class, call that example a noisy point and stop. If the number of majority class neighbors is larger than the number of minority class neighbors, then put that point aside into a set called the Danger set. If the number of minority class neighbors are larger than majority class neighbors, call that point safe and stop.
All the points in the Danger set are borderline data of the minority class. For each example in the Danger set, find it’s k nearest neighbors.
Generated new synthetic data along the line between the minority borderline examples and their nearest neighbors of the same class.

Basically: this works by examining examples which are close in the feature space, drawing a line between the examples in the feature space, and creating a new sample as a point along that line.

Under-sampling

Under-sampling methods pretty much do the opposite of Oversampling. They delete or select a subset of examples from the majority class.

Some widely used under-sampling methods include:

Random Under-sampling
Condensed Nearest Neighbor Rule (CNN)
Near Miss Under-sampling
Tomek Links Under-sampling
Edited Nearest Neighbors Rule (ENN)
One-Sided Selection (OSS)
Neighborhood Cleaning Rule (NCR)

Let’s briefly discuss one of the more popular deletion methods, called Tomek Links.

A Tomek Link refers to a pair of examples in the training dataset that are both nearest neighbors — that is, they have the minimum distance in feature space — and belong to different classes.

Tomek Links are often misclassified examples found along the class boundary and the examples in the majority class are deleted.

Combining techniques

Used individually, oversampling or under-sampling methods are pretty effective. But combining them together can often result in better overall model performance.

Let’s consider what happens when we combine SMOTE + Tomek Links.

SMOTE works by synthesizing new plausible examples from the minority class.

Tomek Links identifies pairs of nearest neighbors in a data set that have different classes.

Removing one or both of the examples in these pairs — such as the examples in the majority class — has the effect of creating a less noisy or ambiguous decision boundary.

You can use the the imbalanced-learn library in python to perform this sampling technique, specifically using the SMOTETomek method.

Note that this is a point of experimentation.

You can experiment with different combinations of sampling methods yourself to see how your model performance fares.

For example, another popular combination is SMOTE + ENN.

Our results show that the over-sampling methods in general, and SMOTE + Tomek and SMOTE + ENN (two of the methods proposed in this work) in particular for data sets with few positive (minority) examples, provided very good results in practice.
- A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data, 2004.

Once you’ve balanced your data, the next task is to identify which algorithm to use.

How do I know which algorithm to use?

Data will often point with almost equal emphasis on several possible models, and it is important that the [data scientist] recognize and accept this.
— McCullagh, P. and Nedler, J., Generalized Linear Models, 1989

Which of the many algorithms out there am I to use? This is always a challenge when you start any modeling process. After all, different models, all of them equally good, may give different pictures of the relation between the features and the target.

You can spend endless hours researching which one to use, but at the end of the day your paycheck isn’t free…you need to be biased towards action.

So do this instead:

Select an evaluation metric so you can compare candidate models.
Determine a baseline level of performance.
Spot check several algorithms with their out-of-the-box hyperparameters.
Select best performing models and evaluate the results
Iterate.

Join 16,000 of your colleagues at Deep Learning Weekly for the latest products, acquisitions, technologies, deep-dives and more.

Selecting an evaluation metric

In any binary classification problem there are two types of misclassifications we can make: False positives and false negatives.

Let’s explore this a bit deeper.

False negatives

These are instances of the minority class which our model says belong to the majority class, but they really don’t.

False positives

These are examples of the majority class which our model says belong to the minority class, but they really don’t.

The trade-off

A trade-off exists between false negatives and false positives.

Optimize for fewer false negatives, and you must be willing to tolerate a greater occurrence of false positives (and vice versa).

And it’s your job, dear reader, to decide how to trade-off the two.

So here are a series of heuristics you can use to help identify the appropriate evaluation metric for your use case:

I’d also recommend this in-depth post which is a tour of various metrics for imbalanced classification.

Once we’ve figured out the appropriate metric for our use case, we can begin to build a model.

Determine a baseline level of performance

But how will we know if our results are any good or if machine learning is even appropriate for the problem?

We need a meaningful reference point for comparison.

We call this a baseline, which is the simplest possible model that will still yield a decent result.

In some cases it can be a random result, and in others the most common prediction. But in all cases it serves the purpose of providing a point of comparison for any advanced methods that we test out later on in our process. It’s a simple, yet powerful idea.

Here are two of my go-to methods for obtaining a baseline result:

The Dummy Classifier which is a classifier that makes predictions using simple rules. This classifier is useful as a simple baseline to compare with other (real, more complex) classifiers.
Logistic Regression

After we’ve established a baseline model we can add or change the features, test various algorithms, experiment with the parameters of the algorithms, and through this process determine whether our efforts are getting us any closer to an improved solution to our problem.

This experimental part of machine learning is, in my opinion, the most fun and creative aspect of it all.

And it’s where a tool like Comet becomes invaluable, because it helps you keep track of all the experiments you’re running so you can focus on building the best model for your use case.

Spot checking algorithms

The reason we spot check a suite of algorithms for our problem is primarily to determine whether we can even solve this problem using machine learning.

Your objective here is quickly testing a variety of techniques to discover which one shows promise, that way you can focus on it later during hyperparameter tuning.

Whatever results we obtain here will end up serving as a a basis for comparison for any more complex model we build.

You can try any number of linear, nonlinear, or ensemble algorithms.

Here are a few you can choose from:

Linear discriminant analysis
Support vector machine
Naive Bayes
XGBoost
Random forest
AdaBoost with base classifier as a decision tree with depth 2
Adaboost with a base classifier as LinearSVC
Extremely randomized trees classifier
Histogram-based gradient boosting classifier

Optimizing the best performer

The simplest approach to hyperparameter tuning is to select the top three or five algorithms — or algorithm combinations — that performed the best in the spot check phase and tune the hyperparameters for each.

The three most popular hyperparameter tuning algorithms are:

Random Search
Grid Search
Bayesian Optimization

Bayesian search is the one I recommend you use, but it’s difficult to set up.

Luckily, the Comet Optimizer makes it easy to use and you can see it in action in this notebook:

Google Colaboratory

Experimenting with hyperparameter tuning methods using Comet colab.research.google.com

If the results from this first round of experimentation are to your liking, great!

If not, then you can always iterate through the process by testing different combinations of sampling techniques plus algorithms, creating new features, or combining the results from two or more algorithms.

You’re only limited by your creativity here!

Enough talk, let’s see this in action

I’ve got two hands-on projects for you to see all of this in action.

Oil spill classification

The first one is a project where we use machine learning to help save the environment!

Kind of.

We’ll apply this frame work to the oil spill data set.

In this notebook you’ll see the following:

Creating a baseline with the dummy classifier.
Discuss the selection of appropriate evaluation metrics.
Spot checking logistic regression, linear discriminant analysis, and Gaussian naive Bayes.
Data sampling using SMOTE + ENN.
Testing the best performing algorithm from the spot check phase with various sampling methods.
Assessing performance of the fitted model.

See it in action here:

Google Colaboratory

Oil spill classification project with Comet

colab.research.google.com

Phoneme classification

In this project we perform a binary classification of vowel sounds from European languages.

You’ll see the following in action:

Data profiling with sweetviz.
Discuss the selection of appropriate evaluation metrics.
An in-depth discussion of: Random Oversampling, Synthetic Minority Oversampling Technique (SMOTE), Borderline SMOTE, SVM-SMOTE, Adaptive synthetic algorithm.
Spot checking of logistic regression, support vector machines, bagged decision trees, random forest, and extremely randomized trees.
Using the Comet Optimizer to perform a randomize search to find the best hyperparameters of the winning algorithm.

See it in action here:

Google Colaboratory

Phoneme classification project with Comet

colab.research.google.com

Conclusion

Imbalanced classification doesn’t have to be pain.

We’ve got methods to deal with this issue, and you’ll see them in action in the notebooks mentioned above.

Experiment and play around with these notebooks, track your experiments with Comet, and if you have any questions or comments swing by Comet’s open Slack community.

And remember my friends: You’ve got one life on this planet, why not try to do something big?

Editor’s Note: Heartbeat is a contributor-driven online publication and community dedicated to providing premier educational resources for data science, machine learning, and deep learning practitioners. We’re committed to supporting and inspiring developers and engineers from all walks of life.

Editorially independent, Heartbeat is sponsored and published by Comet, an MLOps platform that enables data scientists & ML teams to track, compare, explain, & optimize their experiments. We pay our contributors, and we don’t sell ads.

If you’d like to contribute, head on over to our call for contributors. You can also sign up to receive our weekly newsletters (Deep Learning Weekly and the Comet Newsletter), join us on Slack, and follow Comet on Twitter and LinkedIn for resources, events, and much more that will help you build better ML models, faster.

Imbalanced Classification Demystified

How to solve 99% of all imbalanced classification problems

What do I do about the class imbalance?

Oversampling

Borderline SMOTE

Under-sampling

Combining techniques

How do I know which algorithm to use?

Selecting an evaluation metric

False negatives

False positives

The trade-off

Determine a baseline level of performance

Spot checking algorithms

Optimizing the best performer

Google Colaboratory

Experimenting with hyperparameter tuning methods using Comet colab.research.google.com

Enough talk, let’s see this in action

Oil spill classification

Google Colaboratory

Oil spill classification project with Comet

Phoneme classification

Google Colaboratory

Phoneme classification project with Comet

Conclusion

Written by Harpreet Sahota