An imbalanced datasets is hard to deal with for most ML algorithms, as the model have hard time learning the decision boundaries between the classes.

The imbalanced-learn Python library provides different implementations of approaches to deal with imbalanced datasets. This library can be install with pip as follows:

$ pip install imbalanced-learn

All following techniques implemented in this library accepts a parameter called sampling_strategy that controls the sampling strategy. By default, it’s set to ‘auto’ but can have one of the following values:

minority resampling done only to the minority class.
not majority resample all classes except the majority class (same as auto).
not minority resample all classes except minority class.
all resample all classes.

The following sections describes some of the most common techniques, and will use the following synthetic imbalanced dataset.

from collections import Counter
from sklearn.datasets import make_classification

# Generate some data for a classification problem
X, y = make_classification(
    n_classes=2,
    class_sep=2,
    weights=[0.1, 0.9],
    n_informative=3,
    n_redundant=1,
    flip_y=0,
    n_features=20,
    n_clusters_per_class=1,
    n_samples=1000,
    random_state=42
)

# Count of samples per class -> {1: 900, 0: 100}
print('Original dataset shape %s' % Counter(y))

Undersampling

This technique samples down from the class containing more data until equivalent to the class containing the least samples. Suppose class A has 900 samples and class B has 100 samples, then the imbalance ratio is 9:1. Using the undersampling technique we keep class B as 100 samples and from class A we randomly select 100 samples out of 900. Then the ratio becomes 1:1 and we can say it’s balanced.

From the imblearn library, we have the under_sampling module which contains various libraries to achieve undersampling. Out of those, I’ve shown the performance of the NearMiss module.

from imblearn.under_sampling import NearMiss

# Apply NearMiss to balance the dataset
nm = NearMiss()
X_res, y_res=nm.fit_sample(X,Y)

# New count after balancing -> {0: 100, 1: 100}
print('Resampled dataset shape {}'.format(Counter(y_res)))

Oversampling

Oversampling (also called upsampling) is just the opposite of undersampling. Here the class containing less data is made equivalent to the class containing more data. This is done by adding more data to the least sample containing class. Let’s take the same example of undersampling, then, in this case, class A will remain 900 and class B will also be 900 (which was previously 100). Hence the ratio will be 1:1 and it’ll be balanced.

RandomOverSampler

The imblearn library contains an over_sampling module which contains various libraries to achieve oversampling. RandomOverSampler is the simplistic approach.

from imblearn.over_sampling import RandomOverSampler

# Apply RandomOverSampler to balance the dataset
os =  RandomOverSampler()
X_res, y_res = os.fit_sample(X, Y)

# New count after balancing -> {0: 900, 1: 900}
print('Resampled dataset shape {}'.format(Counter(y_res)))

SMOTE

One way to address this problem is by oversampling examples from the minority class, for instance by simply duplicating examples from the minority class. Such approach does not provide any additional information to the model, so a better approach would be to generate synthetic examples.

SMOTE which stands for Synthetic Minority Oversampling Technique is a widely used approach for generating synthetic examples for the minority class. It works by:

  • selecting a random example from the minority class
  • finding the k (typically k=5) nearest neighbors of that example
  • selecting a random example from those neighbors
  • drawing a line between those two examples
  • generating a synthetic example by choosing a random point from that line

The following is an example of using the SMOTE technique to balance a classification dataset:

from imblearn.over_sampling import SMOTE

# Apply SMOTE to balance the dataset
sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_resample(X, y)

# New count after balancing -> {0: 900, 1: 900}
print('Resampled dataset shape %s' % Counter(y_res))

Combine under-sampling and over-sampling

SMOTE allows to generate samples. However, this method of over-sampling does not have any knowledge regarding the underlying distribution. Therefore, some noisy samples can be generated, e.g. when the different classes cannot be well separated.

It can be beneficial to apply an under-sampling algorithm to clean the noisy samples. Imbalanced-learn provides two ready-to-use samplers SMOTETomek and SMOTEENN.

SMOTETomek

SMOTETomek is somewhere upsampling and downsampling. SMOTETomek is a hybrid method which is a mixture of the above two methods, it uses an under-sampling method (Tomek) with an oversampling method (SMOTE). This is present within imblearn.combine module.

from imblearn.combine import SMOTETomek

# Apply SMOTETomek to balance the dataset
smk = SMOTETomek()
X_res,y_res=smk.fit_sample(X, y)

# New count after balancing -> {0: 900, 1: 900}
print('Resampled dataset shape {}'.format(Counter(y_res)))

SMOTEEENN

SMOTEENN another combine sampling within imblearn.combine module. In general, SMOTEENN cleans more noisy data than SMOTETomek.

from imblearn.combine import SMOTEENN

# Apply SMOTEENN to balance the dataset
smk = SMOTEENN()
X_res,y_res=smk.fit_sample(X, y)

# New count after balancing -> {0: 900, 1: 895}
print('Resampled dataset shape {}'.format(Counter(y_res)))

Notice how class 1 has been upsampled from 100 to 895. This is because unlike SMOTETomek the ratio is not 1:1 but the difference between the samples is not very large.

Learn more details on how to use the imbalanced-learn library in this article - link.