Isolation Forest For Anomaly Detection And Imbalanced Classification

Isolation Forest For Anomaly Detection

Isolation forest uses the number of tree splits to identify anomalies or minority classes in an imbalanced dataset. The idea is that anomaly data points take fewer splits because the density around the anomalies is low. Python’s sklearn library has an implementation for the isolation forest model.

Isolation forest is an unsupervised algorithm, where the actual labels of normal vs. anomaly data points are not used in model training.

To learn how to use supervised models to identify abnormal data points, please refer to Four Oversampling And Under-Sampling Methods For Imbalanced Classification Using Python, and Neural Network Model Balanced Weight For Imbalanced Classification In Keras.

In this article, you will learn

  • What is the isolation forest model
  • How to build an isolation forest model using Python
  • How to use an isolation forest model to do anomaly detection
  • How to continue training an isolation forest model using new data
  • How to continue training an isolation forest model using more trees

Resources for this post:

Anomaly Detection with Isolation Forest | GrabNGoInfo

Step 1: Import Libraries

The first step is to import libraries. We need to import make_classification from sklearn to create the modeling dataset. Import pandas and numpy for data processing, Counter will help us count the number of records.

We also need the train_test_split to create a training and validation dataset. IsolationForest for modeling, and classification_report for model performance evaluation.

# Synthetic dataset
from sklearn.datasets import make_classification

# Data processing
import pandas as pd
import numpy as np
from collections import Counter

# Model and performance
from sklearn.ensemble import IsolationForest
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

Step 2: Create Imbalanced Dataset

Using make_classification from the sklearn library, We created two classes with the ratio between the majority class and the minority class being 0.995:0.005. Two informative features were made as predictors. We did not include any redundant or repeated features in this dataset.

# Create an imbalanced dataset
X, y = make_classification(n_samples=100000, n_features=2, n_informative=2,
                           n_redundant=0, n_repeated=0, n_classes=2,
                           n_clusters_per_class=1,
                           weights=[0.995, 0.005],
                           class_sep=0.5, random_state=0)

# Convert the data from numpy array to a pandas dataframe
df = pd.DataFrame({'feature1': X[:, 0], 'feature2': X[:, 1], 'target': y})

# Check the target distribution
df['target'].value_counts(normalize = True)

The output shows that we have about 1% of the data in the minority class and 99% in the majority class.

0    0.9897
1    0.0103
Name: target, dtype: float64

Step 3: Train Test Split

In this step, we split the dataset into 80% training data and 20% validation data. random_state ensures that we have the same train test split every time. The seed number for random_state does not have to be 42, and it can be any number.

# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Check the number of records
print('The number of records in the training dataset is', X_train.shape[0])
print('The number of records in the test dataset is', X_test.shape[0])
print(f"The training dataset has {sorted(Counter(y_train).items())[0][1]} records for the majority class and {sorted(Counter(y_train).items())[1][1]} records for the minority class.")

The train test split gives us 80,000 records for the training dataset and 20,000 for the validation dataset. Thus, we have 79,183 data points from the majority class and 817 from the minority class in the training dataset.

The number of records in the training dataset is 80000
The number of records in the test dataset is 20000
The training dataset has 79183 records for the majority class and 817 records for the minority class.

Step 4: Train Isolation Forest Model

Isolation forest identify anomalies by isolating outliers using trees. The steps are:

  1. For a tree, randomly select features and randomly split for each feature.
  2. For each data point, there is a splitting path from the root node to the leaf node. Calculate the path length for each data point.
  3. Repeat step 1 and step 2 for each tree.
  4. Get the average path length across all trees.
  5. The anomalies have a shorter average path length than normal data points.
# Train the isolation forest model
if_model = IsolationForest(n_estimators=100, random_state=0).fit(X_train)

# Predict the anomalies
if_prediction = if_model.predict(X_test)

# Change the anomalies' values to make it consistent with the true values
if_prediction = [1 if i==-1 else 0 for i in if_prediction]

# Check the model performance
print(classification_report(y_test, if_prediction))

We train the isolation forest model using the training dataset and make the predictions on the testing dataset. By default, isolation forest labels the normal data points as 1s and anomalies as -1s. To compare the labels with the ground truth in the testing dataset, we changed the anomalies’ labels from -1 to 1, and the normal labels from 1 to 0.

              precision    recall  f1-score   support

           0       0.99      0.79      0.88     19787
           1       0.02      0.38      0.04       213

    accuracy                           0.79     20000
   macro avg       0.51      0.58      0.46     20000
weighted avg       0.98      0.79      0.87     20000

The model has a recall values of 38%, meaning that it captures 38% of the anomaly data points.

Step 5: Isolation Forest With Warm Start On New Data

Isolation forest supports a warm start, which can train trees in addition to the existing model.

Suppose we collected more data after training the model. Then, we can utilize the new data collected and train on top of the existing model.

Let’s create more data using make_classification.

# Create more imbalanced data
X_more, y_more = make_classification(n_samples=5000, n_features=2, n_informative=2,
                           n_redundant=0, n_repeated=0, n_classes=2,
                           n_clusters_per_class=1,
                           weights=[0.995, 0.005],
                           class_sep=0.5, random_state=0)

We set the option of warm_start=True for the original isolation forest model, then added 50 trees trained on the new dataset.

# Train the isolation forest model
if_model = IsolationForest(n_estimators=100, random_state=0, warm_start=True).fit(X_train)

# Use new data to train 50 trees on top of existing model 
if_model.n_estimators += 50
if_model.fit(X_more)

# Predict the anomalies
if_prediction = if_model.predict(X_test)

# Change the anomalies' values to make it consistent with the true values
if_prediction = [1 if i==-1 else 0 for i in if_prediction]

# Check the model performance
print(classification_report(y_test, if_prediction))

We see a 1% increase in the recall value.

              precision    recall  f1-score   support

           0       0.99      0.74      0.85     19787
           1       0.02      0.39      0.03       213

    accuracy                           0.74     20000
   macro avg       0.50      0.57      0.44     20000
weighted avg       0.98      0.74      0.84     20000

Step 6: Isolation Forest With Warm Start On New Trees

Even when no new data is available, we can still train the isolation forest with a warm start to improve model performance by introducing more trees.

The code below shows how to train additional trees using the same modeling dataset.

# Train the isolation forest model
if_model = IsolationForest(n_estimators=100, random_state=0, warm_start=True).fit(X_train)

# Use the existing data to train 20 trees on top of existing model 
if_model.n_estimators += 20
if_model.fit(X_train)

# Predict the anomalies
if_prediction = if_model.predict(X_test)

# Change the anomalies' values to make it consistent with the true values
if_prediction = [1 if i==-1 else 0 for i in if_prediction]

# Check the model performance
print(classification_report(y_test, if_prediction))
              precision    recall  f1-score   support

           0       0.99      0.79      0.88     19787
           1       0.02      0.38      0.04       213

    accuracy                           0.79     20000
   macro avg       0.51      0.58      0.46     20000
weighted avg       0.98      0.79      0.87     20000

We can choose to keep training the isolation model using more data, more trees on existing data, or use both.

Step 7: Put All Code Together

###### Step 1: Import Libraries

# Synthetic dataset
from sklearn.datasets import make_classification

# Data processing
import pandas as pd
import numpy as np
from collections import Counter

# Model and performance
from sklearn.ensemble import IsolationForest
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report


###### Step 2: Create Imbalanced Dataset

# Create an imbalanced dataset
X, y = make_classification(n_samples=100000, n_features=2, n_informative=2,
                           n_redundant=0, n_repeated=0, n_classes=2,
                           n_clusters_per_class=1,
                           weights=[0.995, 0.005],
                           class_sep=0.5, random_state=0)

# Convert the data from numpy array to a pandas dataframe
df = pd.DataFrame({'feature1': X[:, 0], 'feature2': X[:, 1], 'target': y})

# Check the target distribution
df['target'].value_counts(normalize = True)


###### Step 3: Train Test Split

# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Check the number of records
print('The number of records in the training dataset is', X_train.shape[0])
print('The number of records in the test dataset is', X_test.shape[0])
print(f"The training dataset has {sorted(Counter(y_train).items())[0][1]} records for the majority class and {sorted(Counter(y_train).items())[1][1]} records for the minority class.")


###### Step 4: Train Isolation Forest Model

# Train the isolation forest model
if_model = IsolationForest(n_estimators=100, random_state=0).fit(X_train)

# Predict the anomalies
if_prediction = if_model.predict(X_test)

# Change the anomalies' values to make it consistent with the true values
if_prediction = [1 if i==-1 else 0 for i in if_prediction]

# Check the model performance
print(classification_report(y_test, if_prediction))


###### Step 5: Isolation Forest With Warm Start On New Data

# Create more imbalanced data
X_more, y_more = make_classification(n_samples=5000, n_features=2, n_informative=2,
                           n_redundant=0, n_repeated=0, n_classes=2,
                           n_clusters_per_class=1,
                           weights=[0.995, 0.005],
                           class_sep=0.5, random_state=0)

# Train the isolation forest model
if_model = IsolationForest(n_estimators=100, random_state=0, warm_start=True).fit(X_train)

# Use new data to train 50 trees on top of existing model 
if_model.n_estimators += 50
if_model.fit(X_more)

# Predict the anomalies
if_prediction = if_model.predict(X_test)

# Change the anomalies' values to make it consistent with the true values
if_prediction = [1 if i==-1 else 0 for i in if_prediction]

# Check the model performance
print(classification_report(y_test, if_prediction))


###### Step 6: Isolation Forest With Warm Start On New Trees

# Train the isolation forest model
if_model = IsolationForest(n_estimators=100, random_state=0, warm_start=True).fit(X_train)

# Use the existing data to train 20 trees on top of existing model 
if_model.n_estimators += 20
if_model.fit(X_train)

# Predict the anomalies
if_prediction = if_model.predict(X_test)

# Change the anomalies' values to make it consistent with the true values
if_prediction = [1 if i==-1 else 0 for i in if_prediction]

# Check the model performance
print(classification_report(y_test, if_prediction))

Step 8: Summary

In this article, we created a synthetic dataset with anomalies and used it to go through using isolation forest to make anomaly detection.

Using the sklearn library in Python, we covered

  • What is the isolation forest model
  • How to build an isolation forest model using Python
  • How to use an isolation forest model to do anomaly detection
  • How to continue training an isolation forest model using new data
  • How to continue training an isolation forest model using more trees

To learn about detecting anomalies using a supervised model, please refer to my articles blow

For more information about data science and machine learning, please check out my YouTube channel and Medium Page or follow me on LinkedIn.

Recommended Tutorials


References

Leave a Comment

Your email address will not be published. Required fields are marked *