SMOTE oversampling for imbalanced modeling data

Four Oversampling and Under-sampling Methods for Imbalanced Classification Using Python

Oversampling and under-sampling are the techniques to change the ratio of the classes in an imbalanced modeling dataset. This step-by-step tutorial explains how to use oversampling and under-sampling in the Python imblearn library to adjust the imbalanced classes for machine learning models. We will compare the following four methods with the baseline random forest model results:

  • Random Oversampling
  • SMOTE (Synthetic Minority Oversampling Technique)
  • Random Under-Sampling
  • Near Miss Under-Sampling

First off, what is imbalance classification? Imbalance classification is also called rare event modeling. When the target label for a classification modeling dataset is highly imbalanced, we call the minority event to be a rare event. In this case, the model tends to get learnings from the majority class, and predicting the minority class can be challenging. For example, if only 0.01% of the dataset is the minority event, the model tends not to do a good job identifying the pattern of the minority event.

Rare event modeling for imbalanced datasets has many use cases. Fraud event detection, severe disease diagnosis, and credit card invitation response are some examples.

Resources for this post:

If you prefer video format of the tutorial, please check out the video on YouTube

https://www.youtube.com/watch?v=kZNkaNATmd8&list=PLVppujud2yJo0qnXjWVAa8h7fxbFJHtfJ&index=1
Four oversampling and under-sampling methods – GrabNGoInfo.com

Step 1: Install and Import Python Libraries

We will use a Python library called imbalanced-learn to handle imbalanced datasets, so let’s install the library first.

pip install -U imbalanced-learn

The following text shows the successful installation of imblearn library. Note that the version of the package may be different from mine.

Successfully installed imbalanced-learn-0.8.0 scikit-learn-0.24.2 threadpoolctl-2.2.0

Now let’s import the Python libraries.

# Creating the modeling dataset
from sklearn.datasets import make_classification

# Data processing
import pandas as pd
import numpy as np

# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Model and performance
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Oversampling and under sampling
from imblearn.over_sampling import RandomOverSampler, SMOTE
from imblearn.under_sampling import RandomUnderSampler, NearMiss
from collections import Counter

Step 2: Create Imbalanced Dataset For Classification Model

Using make_classification from the sklearn library, we create an imbalanced dataset with two classes. The minority class is 0.5% of the dataset. I made two features to predict which type each data point belongs to.

# Create an imbalanced dataset
X, y = make_classification(n_samples=100000, n_features=2, n_informative=2,
                           n_redundant=0, n_repeated=0, n_classes=2,
                           n_clusters_per_class=1,
                           weights=[0.995, 0.005],
                           class_sep=0.5, random_state=0)

# Convert the data from numpy array to a pandas dataframe
df = pd.DataFrame({'feature1': X[:, 0], 'feature2': X[:, 1], 'target': y})

# Check the target distribution
df['target'].value_counts(normalize = True)

The dataset gives us around 1% data points for the minority class. It is higher than the specified weights of 0.5% but works for demonstrating the rare event modeling process.

0    0.9897
1    0.0103
Name: target, dtype: float64

Let’s visualize the data using a scatter plot.

# Visualize the data
plt.figure(figsize=(12, 8))
sns.scatterplot(x = 'feature1', y = 'feature2', hue = 'target', data = df)

We can see that the majority of the dataset belongs to class 0, and a small portion of the dataset belongs to class 1.

Imbalanced data for classification model

Step 3: Train Test Split For Imbalanced Data

In this step, we split the dataset into 80% training data and 20% validation data. random_state ensures that we have the same train test split every time. The seed number for random_state does not have to be 42, and it can be any number.

# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Check the number of records
print('The number of records in the training dataset is', X_train.shape[0])
print('The number of records in the test dataset is', X_test.shape[0])
print(f"The training dataset has {sorted(Counter(y_train).items())[0][1]} records for the majority class and {sorted(Counter(y_train).items())[1][1]} records for the minority class.")

The train test split gives us 80,000 records for the training dataset and 20,000 for the validation dataset. Thus, we have 79,183 data points from the majority class and 817 from the minority class in the training dataset.

The number of records in the training dataset is 80000
The number of records in the test dataset is 20000
The training dataset has 79183 records for the majority class and 817 records for the minority class.

Step 4: Decide the Performance Metric for Classification Model

Before building the model, we need to decide the performance metric we would like to optimize towards.

The most critical performance metric for the rare events modeling is usually the minority class recall or precision values. For example, in the context of fraud detection, we would like to maximize the true positive rate and capture as many fraud cases as possible, so recall for the minority class is the metric we would like to optimize.

While in the context of spam email classification, we would like to minimize the false positive rate and not misclassify any important email as spam, so the precision for the minority class is the metric we would like to optimize.

In this tutorial, we use fraud detection as an example and choose recall for the minority class as the metric to optimize.

Step 5: Baseline Random Forest Model for imbalanced data

We first check the model performance using the imbalanced data directly. A model without oversampling or under-sampling gives us a baseline to compare the model performance. A random forest model is used as an example here.

# Train the random forest model
rf = RandomForestClassifier()
baseline_model = rf.fit(X_train, y_train)
baseline_prediction = baseline_model.predict(X_test)

# Check the model performance
print(classification_report(y_test, baseline_prediction))

The recall value ranges from 0 to 1, 0 represents 0% true positive rate, and 1 represents 100% true positive rate. The output shows that the minority class has a recall of 0.03, which means around 3% of the minority class is captured by the model.

                precision    recall  f1-score   support

           0       0.99      1.00      0.99     19787
           1       0.50      0.03      0.06       213

    accuracy                           0.99     20000
   macro avg       0.74      0.52      0.53     20000
weighted avg       0.98      0.99      0.98     20000

Step 6: Random Oversampling for imbalanced dataset

One way of oversampling is to generate new samples for the minority class by sampling with replacement. The RandomOverSampler from the imblearn library provides such functionality.
Note that we apply the oversampling technique to the training dataset only. The testing dataset needs to keep untouched.

# Randomly over sample the minority class
ros = RandomOverSampler(random_state=42)
X_train_ros, y_train_ros= ros.fit_resample(X_train, y_train)

# Check the number of records after over sampling
print(sorted(Counter(y_train_ros).items()))

After random oversampling, the minority category increased from 817 to 79183, the same as the majority category.

[(0, 79183), (1, 79183)]

From the visualization, we can see more orange minority data points after random oversampling.

# Convert the data from numpy array to a pandas dataframe
df_ros = pd.DataFrame({'feature1': X_train_ros[:, 0], 'feature2': X_train_ros[:, 1], 'target': y_train_ros})

# Plot the chart
plt.figure(figsize=(12, 8))
sns.scatterplot(x = 'feature1', y = 'feature2', hue = 'target', data = df_ros)
plt.title('Random Over Sampling')
Random oversampling for imbalanced data

Now let’s run the same random forest model and check the performance after random oversampling.

# Train the random forest model
# rf = RandomForestClassifier()
ros_model = rf.fit(X_train_ros, y_train_ros)
ros_prediction = ros_model.predict(X_test)

# Check the model performance
print(classification_report(y_test, ros_prediction))
              precision    recall  f1-score   support

           0       0.99      1.00      0.99     19787
           1       0.17      0.03      0.05       213

    accuracy                           0.99     20000
   macro avg       0.58      0.52      0.52     20000
weighted avg       0.98      0.99      0.98     20000

We can see that random oversampling did not provide a better result. We get the same recall and the similar f1-score, and the precision decreased. Thus, the random oversampling has worse performance than the baseline model with no class ratio adjustment.

Step 7: SMOTE Oversampling for imbalanced dataset

Let’s try SMOTE (Synthetic Minority Oversampling Technique), published in 2002 by Chawla, Bowyer, Hall & Kegelmeyer [2]. Instead of randomly oversampling with replacement, SMOTE takes each minority sample and introduces synthetic data points connecting the minority sample and its nearest neighbors. Neighbors from the k nearest neighbors are chosen randomly.

# Randomly over sample the minority class
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote= smote.fit_resample(X_train, y_train)

# Check the number of records after over sampling
print(sorted(Counter(y_train_smote).items()))

Similar to random oversampling, the minority category increased from 817 to 79183 after SMOTE oversampling.

[(0, 79183), (1, 79183)]

Comparing the graph between random oversampling and SMOTE, we can see that the synthetic data points created by SMOTE are all along a line.

# Convert the data from numpy array to a pandas dataframe
df_smote = pd.DataFrame({'feature1': X_train_smote[:, 0], 'feature2': X_train_smote[:, 1], 'target': y_train_smote})

# Plot the chart
plt.figure(figsize=(12, 8))
sns.scatterplot(x = 'feature1', y = 'feature2', hue = 'target', data = df_smote)
plt.title('SMOTE Over Sampling')
SMOTE oversampling for imbalanced data

Now let’s run the same random forest model on the SMOTE dataset and check its performance.

# Train the random forest model
# rf = RandomForestClassifier()
smote_model = rf.fit(X_train_smote, y_train_smote)
smote_prediction = smote_model.predict(X_test)

# Check the model performance
print(classification_report(y_test, smote_prediction))
              precision    recall  f1-score   support

           0       0.99      0.84      0.91     19787
           1       0.02      0.24      0.03       213

    accuracy                           0.83     20000
   macro avg       0.50      0.54      0.47     20000
weighted avg       0.98      0.83      0.90     20000

We can see that the model using SMOTE increased recall from 0.03 to 0.24. Thus, it significantly improved the model’s ability to capture the minority class.

Step 8: Random Under-Sampling for imbalanced dataset

Random under-sampling randomly picks data points from the majority class. After the sampling, the majority class should have the same number of data points as the minority class.

# Randomly under sample the majority class
rus = RandomUnderSampler(random_state=42)
X_train_rus, y_train_rus= rus.fit_resample(X_train, y_train)

# Check the number of records after under sampling
print(sorted(Counter(y_train_rus).items()))

After random under-sampling, the majority category decreased from 79183 to 817, the same as the minority category.

[(0, 817), (1, 817)]

The visualization shows that we have fewer data points for the model.

# Convert the data from numpy array to a pandas dataframe
df_rus = pd.DataFrame({'feature1': X_train_rus[:, 0], 'feature2': X_train_rus[:, 1], 'target': y_train_rus})

# Plot the chart
plt.figure(figsize=(12, 8))
sns.scatterplot(x = 'feature1', y = 'feature2', hue = 'target', data = df_rus)
plt.title('Random Under Sampling')
Random under-sampling for imbalanced data

After randomly sampling, the recall increased to 0.50, and it’s much better than the over-sampling results.

# Train the random forest model
# rf = RandomForestClassifier()
rus_model = rf.fit(X_train_rus, y_train_rus)
rus_prediction = rus_model.predict(X_test)

# Check the model performance
print(classification_report(y_test, rus_prediction))
              precision    recall  f1-score   support

           0       0.99      0.60      0.75     19787
           1       0.01      0.50      0.03       213

    accuracy                           0.60     20000
   macro avg       0.50      0.55      0.39     20000
weighted avg       0.98      0.60      0.74     20000

Step 9: Under-Sampling Using NearMiss for imbalanced dataset

NearMiss from the imblearn library uses the KNN (K Nearest Neighbors) to do under-sampling.

There are three versions of NearMiss algorithms. Based on the documentation of the imblearn library, here are the differences between the three versions:

  • “NearMiss-1 selects the positive samples for which the average distance to the N closest samples of the negative class is the smallest.”
  • “NearMiss-2 selects the positive samples for which the average distance to the N farthest samples of the negative class is the smallest.”
  • “NearMiss-3 is a 2-steps algorithm. First, for each negative sample, their M nearest-neighbors will be kept. Then, the positive samples selected are the one for which the average distance to the N nearest-neighbors is the largest.”

This tutorial uses version 3, but you are encouraged to try other versions and compare the performances.

For version 3, first, for each data point in the minority class, M nearest neighbors are sampled. Then, for each majority data point that is sampled, we calculate the average distance to the N-nearest neighbors. The data points with the largest average distance are sampled.

# Under sample the majority class
nearmiss = NearMiss(version=3)
X_train_nearmiss, y_train_nearmiss= nearmiss.fit_resample(X_train, y_train)

# Check the number of records after over sampling
print(sorted(Counter(y_train_nearmiss).items()))

The visualization shows the pattern of the NearMiss under-sampling.

# Convert the data from numpy array to a pandas dataframe
df_nearmiss = pd.DataFrame({'feature1': X_train_nearmiss[:, 0], 'feature2': X_train_nearmiss[:, 1], 'target': y_train_nearmiss})

# Plot the chart
plt.figure(figsize=(12, 8))
sns.scatterplot(x = 'feature1', y = 'feature2', hue = 'target', data = df_nearmiss)
plt.title('NearMiss Under Sampling')
Under-sampling using NearMiss for imbalanced data

NearMiss gives us the recall value of 0.57, which is the highest among the four methods.

# Train the random forest model
# rf = RandomForestClassifier()
nearmiss_model = rf.fit(X_train_nearmiss, y_train_nearmiss)
nearmiss_prediction = nearmiss_model.predict(X_test)

# Check the model performance
print(classification_report(y_test, nearmiss_prediction))
              precision    recall  f1-score   support

           0       0.99      0.38      0.55     19787
           1       0.01      0.57      0.02       213

    accuracy                           0.38     20000
   macro avg       0.50      0.48      0.28     20000
weighted avg       0.98      0.38      0.54     20000

Step 10: Put All Code Together

###### Step 1: Import Libraries

# Creating the modeling dataset
from sklearn.datasets import make_classification

# Data processing
import pandas as pd
import numpy as np

# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Model and performance
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Over sampling and under sampling
from imblearn.over_sampling import RandomOverSampler, SMOTE
from imblearn.under_sampling import RandomUnderSampler, NearMiss
from collections import Counter


###### Step 2: Create Imbalanced Dataset

# Create an imbalanced dataset
X, y = make_classification(n_samples=100000, n_features=2, n_informative=2,
                           n_redundant=0, n_repeated=0, n_classes=2,
                           n_clusters_per_class=1,
                           weights=[0.995, 0.005],
                           class_sep=0.5, random_state=0)

# Convert the data from numpy array to a pandas dataframe
df = pd.DataFrame({'feature1': X[:, 0], 'feature2': X[:, 1], 'target': y})

# Check the target distribution
df['target'].value_counts(normalize = True)

# Check the count of each class
df['target'].value_counts()

# Visualize the data
plt.figure(figsize=(12, 8))
sns.scatterplot(x = 'feature1', y = 'feature2', hue = 'target', data = df)


###### Step 3: Train Test Split

# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Check the number of records
print('The number of records in the training dataset is', X_train.shape[0])
print('The number of records in the test dataset is', X_test.shape[0])
print(f"The training dataset has {sorted(Counter(y_train).items())[0][1]} records for the majority class and {sorted(Counter(y_train).items())[1][1]} records for the minority class.")


###### Step 4: Decide the Performance Metric (No code)


###### Step 5: Baseline Model

# Train the random forest model
rf = RandomForestClassifier()
baseline_model = rf.fit(X_train, y_train)
baseline_prediction = baseline_model.predict(X_test)

# Check the model performance
print(classification_report(y_test, baseline_prediction))


###### Step 6: Random Over Sampling

# Randomly over sample the minority class
ros = RandomOverSampler(random_state=42)
X_train_ros, y_train_ros= ros.fit_resample(X_train, y_train)

# Check the number of records after over sampling
print(sorted(Counter(y_train_ros).items()))

# Convert the data from numpy array to a pandas dataframe
df_ros = pd.DataFrame({'feature1': X_train_ros[:, 0], 'feature2': X_train_ros[:, 1], 'target': y_train_ros})

# Plot the chart
plt.figure(figsize=(12, 8))
sns.scatterplot(x = 'feature1', y = 'feature2', hue = 'target', data = df_ros)
plt.title('Random Over Sampling')

# Train the random forest model
# rf = RandomForestClassifier()
ros_model = rf.fit(X_train_ros, y_train_ros)
ros_prediction = ros_model.predict(X_test)

# Check the model performance
print(classification_report(y_test, ros_prediction))


###### Step 7: SMOTE Over Sampling

# Randomly over sample the minority class
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote= smote.fit_resample(X_train, y_train)

# Check the number of records after over sampling
print(sorted(Counter(y_train_smote).items()))

# Convert the data from numpy array to a pandas dataframe
df_smote = pd.DataFrame({'feature1': X_train_smote[:, 0], 'feature2': X_train_smote[:, 1], 'target': y_train_smote})

# Plot the chart
plt.figure(figsize=(12, 8))
sns.scatterplot(x = 'feature1', y = 'feature2', hue = 'target', data = df_smote)
plt.title('SMOTE Over Sampling')

# Train the random forest model
# rf = RandomForestClassifier()
smote_model = rf.fit(X_train_smote, y_train_smote)
smote_prediction = smote_model.predict(X_test)

# Check the model performance
print(classification_report(y_test, smote_prediction))


###### Step 8: Random Under Samplling

# Randomly under sample the majority class
rus = RandomUnderSampler(random_state=42)
X_train_rus, y_train_rus= rus.fit_resample(X_train, y_train)

# Check the number of records after under sampling
print(sorted(Counter(y_train_rus).items()))

# Convert the data from numpy array to a pandas dataframe
df_rus = pd.DataFrame({'feature1': X_train_rus[:, 0], 'feature2': X_train_rus[:, 1], 'target': y_train_rus})

# Plot the chart
plt.figure(figsize=(12, 8))
sns.scatterplot(x = 'feature1', y = 'feature2', hue = 'target', data = df_rus)
plt.title('Random Under Sampling')

# Train the random forest model
# rf = RandomForestClassifier()
rus_model = rf.fit(X_train_rus, y_train_rus)
rus_prediction = rus_model.predict(X_test)

# Check the model performance
print(classification_report(y_test, rus_prediction))


###### Step 9: Under Sampling Using NearMiss

# Under sample the majority class
nearmiss = NearMiss(version=3)
X_train_nearmiss, y_train_nearmiss= nearmiss.fit_resample(X_train, y_train)

# Check the number of records after over sampling
print(sorted(Counter(y_train_nearmiss).items()))

# Convert the data from numpy array to a pandas dataframe
df_nearmiss = pd.DataFrame({'feature1': X_train_nearmiss[:, 0], 'feature2': X_train_nearmiss[:, 1], 'target': y_train_nearmiss})

# Plot the chart
plt.figure(figsize=(12, 8))
sns.scatterplot(x = 'feature1', y = 'feature2', hue = 'target', data = df_nearmiss)
plt.title('NearMiss Under Sampling')

# Train the random forest model
# rf = RandomForestClassifier()
nearmiss_model = rf.fit(X_train_nearmiss, y_train_nearmiss)
nearmiss_prediction = nearmiss_model.predict(X_test)

# Check the model performance
print(classification_report(y_test, nearmiss_prediction))

Summary

In this tutorial, we talked about how to use oversampling and under-sampling techniques in imbalanced classification models. You have learned

  • What is imbalanced classification
  • How to decide the model performance metrics
  • How to do oversampling using random oversampling and SMOTE
  • How to do under-sampling using random under-sampling and Near Miss
  • How to compare the performance of oversampling and under-sampling

To learn more about imbalanced classification, please check out the following articles

For more information about data science and machine learning, please check out my YouTube channel and Medium Page or follow me on LinkedIn.


Recommended Tutorials

References

Leave a Comment

Your email address will not be published. Required fields are marked *