Neural network balanced weight for imbalanced classification

Neural Network Model Balanced Weight For Imbalanced Classification In Keras

When using a neural network model to classify imbalanced data, we can adjust the balanced weight for the cost function to give more attention to the minority class. Python’s Keras library has a built-in option called class_weight to help us achieve this quickly.

One benefit of using the balanced weight adjustment is that we can use the imbalanced data to build the model directly without oversampling or under-sampling before training the model. To learn about oversampling and under-sampling techniques, please check my previous posts here and here.

In this tutorial, we will go over the following topics:

  • Baseline neural network model for imbalanced classification
  • Calculate class weight using sklearn
  • Apply class weight on a neural network model
  • Apply manual class weight on a neural network model

Resources for this post:

Class Weight Control for Neural Network Model – GrabNGoInfo

Step 1: Import Libraries

# Synthetic dataset
from sklearn.datasets import make_classification

# Data processing
import pandas as pd
import numpy as np
from collections import Counter

# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Model and performance
from sklearn.model_selection import train_test_split, cross_validate, StratifiedKFold
from keras.layers import Dense
from keras.models import Sequential
from sklearn.metrics import classification_report, roc_auc_score
from sklearn.utils import class_weight

Step 2: Create Imbalanced Dataset

Using make_classification from the sklearn library, We created two classes with the ratio between the majority class and the minority class being 0.995:0.005. Two informative features were made as predictors. We did not include any redundant or repeated features in this dataset.

# Create an imbalanced dataset
X, y = make_classification(n_samples=100000, n_features=2, n_informative=2,
                           n_redundant=0, n_repeated=0, n_classes=2,
                           n_clusters_per_class=1,
                           weights=[0.995, 0.005],
                           class_sep=0.5, random_state=0)

# Convert the data from numpy array to a pandas dataframe
df = pd.DataFrame({'feature1': X[:, 0], 'feature2': X[:, 1], 'target': y})

# Check the target distribution
df['target'].value_counts(normalize = True)

The output shows that we have about 1% of the data in the minority class and 99% in the majority class.

0    0.9897
1    0.0103
Name: target, dtype: float64

Step 3: Train Test Split

In this step, we split the dataset into 80% training data and 20% validation data. random_state ensures that we have the same train test split every time. The seed number for random_state does not have to be 42, and it can be any number.

# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Check the number of records
print('The number of records in the training dataset is', X_train.shape[0])
print('The number of records in the test dataset is', X_test.shape[0])
print(f"The training dataset has {sorted(Counter(y_train).items())[0][1]} records for the majority class and {sorted(Counter(y_train).items())[1][1]} records for the minority class.")

The train test split gives us 80,000 records for the training dataset and 20,000 for the validation dataset. Thus, we have 79,183 data points from the majority class and 817 from the minority class in the training dataset.

The number of records in the training dataset is 80000
The number of records in the test dataset is 20000
The training dataset has 79183 records for the majority class and 817 records for the minority class.

Step 4: Baseline Neural Network Model

This step creates a neural network model on the imbalanced training datasets as the baseline model.

We created the neural network model with one input layer, one hidden layer, and one output layer. Since we have two features, the input_dim is 2. We set the input layer to have two neurons, the hidden layer to have two neurons, and the output layer to have one neuron.

The activation function for the input and hidden layers is 'relu', a popular activation function with good performance. The output activation function is 'sigmoid', which is used for binary classification.

# Train the neural network model using the imbalanced dataset
# Create model
nn_model=Sequential()
nn_model.add(Dense(2,input_dim=2,activation='relu'))
nn_model.add(Dense(2,activation='relu'))
nn_model.add(Dense(1,activation='sigmoid'))

We set the loss to be 'binary_crossentropy' when compiling the model because we are building a binary classification model. For a multi-class classification model, the loss is usually 'categorical_crossentropy', and for a linear regression model, the loss is usually 'mean_squared_error'.

The optimizer is responsible for changing the weights and the learning rate to reduce the loss. 'adam' is a widely used optimizer.

#Compile model
nn_model.compile(loss='binary_crossentropy',optimizer='adam')

After compiling the model, we fit the neural network model on the training dataset. The epochs of 50 means that the model will go through the training dataset 50 times. The batch_size of 100 means that each time the weights are updated, 100 data points are used.

#Fit the model
nn_model.fit(X_train,y_train, epochs=50, batch_size=100)

Now let’s make predictions on the testing dataset and check the model performance.

# Prediction
nn_model_prediction = nn_model.predict(X_test)
nn_model_classes =  [1 if i>0.5 else 0 for i in nn_model_prediction]

# Check the model performance
print(classification_report(y_test, nn_model_classes))

We got a recall of 0, which means that the neural network model did not predict any minority data correctly.

              precision    recall  f1-score   support

           0       0.99      1.00      0.99     19787
           1       0.00      0.00      0.00       213

    accuracy                           0.99     20000
   macro avg       0.49      0.50      0.50     20000
weighted avg       0.98      0.99      0.98     20000

Let’s see if the balanced weight can help us.

Step 5: Calculate Class Weight Using Sklearn

sklearn has a built-in utility function compute_class_weight to calculate the class weights. The weights are calculated using the inverse proportion of class frequencies.

# Calculate weights using sklearn
sklearn_weights = class_weight.compute_class_weight('balanced', np.unique(y_train), y_train)
sklearn_weights
array([ 0.50515894, 48.95960832])

The computed weights from sklearn are in array format. We need to transform it into a dictionary because Keras takes a dictionary as inputs.

# Transform array to dictionary
sklearn_weights = dict(enumerate(sklearn_weights))
sklearn_weights
{0: 0.5051589356301226, 1: 48.959608323133416}

Step 6: Neural Network Model With Balance Weight

In this step, we keep all the hyperparameters to be the same as the baseline model. The only difference is that we set the class_weight hyperparameter to be 'balanced' when fitting the model.

# Train the neural network model using the imbalanced dataset
# Create model
nn_model_balanced = Sequential()
nn_model_balanced.add(Dense(2,input_dim=2,activation='relu'))
nn_model_balanced.add(Dense(1,activation='sigmoid'))

#Compile model
nn_model_balanced.compile(loss='binary_crossentropy',optimizer='adam')

#Fit the model
nn_model_balanced.fit(X_train,y_train, epochs=50, batch_size=100, class_weight=sklearn_weights)

# Prediction
nn_model_balanced_prediction = nn_model_balanced.predict(X_test)
nn_model_balanced_classes = [1 if i>0.5 else 0 for i in nn_model_balanced_prediction]

# Check the model performance
print(classification_report(y_test, nn_model_balanced_classes))

We can see that the minority recall value increased from 0 to 56%, which is a significant improvement. Note that your results can be different than mine because of the randomness with the neural network model, but the difference should be small.

              precision    recall  f1-score   support

           0       0.99      0.62      0.76     19787
           1       0.02      0.56      0.03       213

    accuracy                           0.62     20000
   macro avg       0.50      0.59      0.40     20000
weighted avg       0.98      0.62      0.75     20000

Step 7: Manual Balance Weight On Neural Network Model

Although the balance weights are commonly calculated using the inverse proportion of class frequencies, we can set our own balance weight and tune it as a hyperparameter. For example, we can set the cost penalty ratio to be 1:200.

manual_weights = {0: 1, 1: 200}

# Train the neural network model using the imbalanced dataset
# Create model
nn_model_mbalanced = Sequential()
nn_model_mbalanced.add(Dense(2,input_dim=2,activation='relu'))
nn_model_mbalanced.add(Dense(1,activation='sigmoid'))

#Compile model
nn_model_mbalanced.compile(loss='binary_crossentropy',optimizer='adam')

#Fit the model
nn_model_mbalanced.fit(X_train,y_train, epochs=50, batch_size=100, class_weight=manual_weights)

# Prediction
nn_model_mbalanced_prediction = nn_model_mbalanced.predict(X_test)
nn_model_mbalanced_classes = [1 if i>0.5 else 0 for i in nn_model_mbalanced_prediction]

# Check the model performance
print(classification_report(y_test, nn_model_mbalanced_classes))

We are able to capture 98% of the minority class after increasing the cost penalty for the minority class.

              precision    recall  f1-score   support

           0       1.00      0.08      0.16     19787
           1       0.01      0.98      0.02       213

    accuracy                           0.09     20000
   macro avg       0.50      0.53      0.09     20000
weighted avg       0.99      0.09      0.15     20000

Step 8: Put All Code Together

###### Step 1: Import Libraries

# Synthetic dataset
from sklearn.datasets import make_classification

# Data processing
import pandas as pd
import numpy as np
from collections import Counter

# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Model and performance
from sklearn.model_selection import train_test_split, cross_validate, StratifiedKFold
from keras.layers import Dense
from keras.models import Sequential
from sklearn.metrics import classification_report, roc_auc_score
from sklearn.utils import class_weight


###### Step 2: Create Imbalanced Dataset

# Create an imbalanced dataset
X, y = make_classification(n_samples=100000, n_features=2, n_informative=2,
                           n_redundant=0, n_repeated=0, n_classes=2,
                           n_clusters_per_class=1,
                           weights=[0.995, 0.005],
                           class_sep=0.5, random_state=0)

# Convert the data from numpy array to a pandas dataframe
df = pd.DataFrame({'feature1': X[:, 0], 'feature2': X[:, 1], 'target': y})

# Check the target distribution
df['target'].value_counts(normalize = True)


###### Step 3: Train Test Split

# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Check the number of records
print('The number of records in the training dataset is', X_train.shape[0])
print('The number of records in the test dataset is', X_test.shape[0])
print(f"The training dataset has {sorted(Counter(y_train).items())[0][1]} records for the majority class and {sorted(Counter(y_train).items())[1][1]} records for the minority class.")


###### Step 4: Baseline Neural Network Model

# Train the neural network model using the imbalanced dataset
# Create model
nn_model=Sequential()
nn_model.add(Dense(2,input_dim=2,activation='relu'))
nn_model.add(Dense(2,activation='relu'))
nn_model.add(Dense(1,activation='sigmoid'))

#Compile model
nn_model.compile(loss='binary_crossentropy',optimizer='adam')

#Fit the model
nn_model.fit(X_train,y_train, epochs=50, batch_size=100)

# Prediction
nn_model_prediction = nn_model.predict(X_test)
nn_model_classes =  [1 if i>0.5 else 0 for i in nn_model_prediction]

# Check the model performance
print(classification_report(y_test, nn_model_classes))


###### Step 5: Calcuate Class Weight Using Sklearn

# Calculate weights using sklearn
sklearn_weights = class_weight.compute_class_weight('balanced', np.unique(y_train), y_train)
sklearn_weights

# Transform array to dictionary
sklearn_weights = dict(enumerate(sklearn_weights))
sklearn_weights


###### Step 6: Neural Network Model With Balance Weight

# Train the neural network model using the imbalanced dataset
# Create model
nn_model_balanced = Sequential()
nn_model_balanced.add(Dense(2,input_dim=2,activation='relu'))
nn_model_balanced.add(Dense(1,activation='sigmoid'))

#Compile model
nn_model_balanced.compile(loss='binary_crossentropy',optimizer='adam')

#Fit the model
nn_model_balanced.fit(X_train,y_train, epochs=50, batch_size=100, class_weight=sklearn_weights)

# Prediction
nn_model_balanced_prediction = nn_model_balanced.predict(X_test)
nn_model_balanced_classes = [1 if i>0.5 else 0 for i in nn_model_balanced_prediction]

# Check the model performance
print(classification_report(y_test, nn_model_balanced_classes))


###### Step 7: Manual Balance Weight On Neural Network Model

manual_weights = {0: 1, 1: 200}

# Train the neural network model using the imbalanced dataset
# Create model
nn_model_mbalanced = Sequential()
nn_model_mbalanced.add(Dense(2,input_dim=2,activation='relu'))
nn_model_mbalanced.add(Dense(1,activation='sigmoid'))

#Compile model
nn_model_mbalanced.compile(loss='binary_crossentropy',optimizer='adam')

#Fit the model
nn_model_mbalanced.fit(X_train,y_train, epochs=50, batch_size=100, class_weight=manual_weights)

# Prediction
nn_model_mbalanced_prediction = nn_model_mbalanced.predict(X_test)
nn_model_mbalanced_classes = [1 if i>0.5 else 0 for i in nn_model_mbalanced_prediction]

# Check the model performance
print(classification_report(y_test, nn_model_mbalanced_classes))

Summary

We built the neural network models with and without the balanced weight for imbalanced classification in this tutorial. Results show that the balanced weight significantly improved the model’s ability to capture the minority class. Python’s sklearn library can compute the balance weight based on the frequency of minority and majority class, but we can use our own weight and adjust it as a hyperparameter.

Recommended Tutorials


References

keras documentation

Leave a Comment

Your email address will not be published. Required fields are marked *