Local Outlier Factor (LOF) For Anomaly Detection

Local Outlier Factor (LOF) For Anomaly Detection

Local Outlier Factor (LOF) is an unsupervised model for outlier detection. It compares the local density of each data point with its neighbors and identifies the data points with a lower density as anomalies or outliers.

In this tutorial, we will talk about

  • What’s the difference between novelty detection and outlier detection?
  • When to use novelty detection vs. outlier detection?
  • How to use Local Outlier Factor (LOF) for novelty detection?
  • How to use Local Outlier Factor (LOF) for anomaly or outlier detection?

Resources for this post:

  • Python code is at the end of the post. Click here for the notebook.
  • More video tutorials on anomaly detection
  • More blog posts on anomaly detection
  • If you prefer the video version of the tutorial, please check out the video on YouTube

Step 1: Import Libraries

The first step is to import libraries. We need to import make_classification from sklearn to create the modeling dataset, Import pandas and numpy for data processing,  and Counter will help us count the number of records.

Matplotlib is for visualization.

We also need the train_test_split to create a training and validation dataset. LocalOutlierFactor is for modeling, and classification_report is for model performance evaluation.

# Synthetic dataset
from sklearn.datasets import make_classification

# Data processing
import pandas as pd
import numpy as np
from collections import Counter

# Visualization
import matplotlib.pyplot as plt

# Model and performance
from sklearn.model_selection import train_test_split
from sklearn.neighbors import LocalOutlierFactor
from sklearn.metrics import classification_report

Step 2: Create Dataset With Anomalies

Using make_classification from the sklearn library, We created two classes with the ratio between the majority class and the minority class being 0.995:0.005. Two informative features were made as predictors. We did not include any redundant or repeated features in this dataset.

# Create an imbalanced dataset
X, y = make_classification(n_samples=100000, n_features=2, n_informative=2,
                           n_redundant=0, n_repeated=0, n_classes=2,
                           n_clusters_per_class=1,
                           weights=[0.995, 0.005],
                           class_sep=0.5, random_state=0)

# Convert the data from numpy array to a pandas dataframe
df = pd.DataFrame({'feature1': X[:, 0], 'feature2': X[:, 1], 'target': y})

# Check the target distribution
df['target'].value_counts(normalize = True)

The output shows that we have about 1% of the data in the minority class and 99% in the majority class, which means we have around 1% anomalies.

Step 3: Train Test Split

In this step, we split the dataset into 80% training data and 20% validation data. random_state ensures that we have the same train test split every time. The seed number for random_state does not have to be 42, and it can be any number.

# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Check the number of records
print('The number of records in the training dataset is', X_train.shape[0])
print('The number of records in the test dataset is', X_test.shape[0])
print(f"The training dataset has {sorted(Counter(y_train).items())[0][1]} records for the majority class and {sorted(Counter(y_train).items())[1][1]} records for the minority class.")

The train test split gives us 80,000 records for the training dataset and 20,000 for the validation dataset. Thus, we have 79,183 data points from the majority class and 817 from the minority class in the training dataset.

The number of records in the training dataset is 80000
The number of records in the test dataset is 20000
The training dataset has 79183 records for the majority class and 817 records for the minority class.

Step 4: Outlier / Anomaly Detection vs. Novelty Detection

The local Outlier Factor (LOF) algorithm can be used for outlier/anomaly detection and novelty detection. The difference between outlier /anomaly detection and novelty detection lies in the training dataset.

Outlier/anomaly detection includes outliers in the training dataset. The algorithm fits the areas with high-density data and ignores the outliers and anomalies.

Novelty detection only includes the normal data points when training the model. Then the model will take a new dataset with outliers/anomalies for prediction. The outliers in novelty detection are also called novelties.

When to use novelty detection vs. outlier detection? That depends on what data is available. If we have the dataset with the outlier labels, we can use either of them. Otherwise, we can only use outlier detection because we cannot get the training dataset with only the normal data.

Step 5: Novelty Detection Using Local Outlier Factor (LOF)

Python’s sklearn library has the implementation for Local Outlier Factor (LOF). To use novelty detection, we need to set the hyperparameter novelty as True. fit_predictfit is not available because the algorithm fits and predicts on different datasets. We need to fit the training dataset with all normal data and predict the testing dataset that includes outliers.

# Keep only the normal data for the training dataset
X_train_normal = X_train[np.where(y_train == 0)]

# Train the local outlier factor (LOF) model for novelty detection
lof_novelty = LocalOutlierFactor(n_neighbors=5, novelty=True).fit(X_train_normal)

# Predict novelties
prediction_novelty = lof_novelty.predict(X_test)

# Change the anomalies' values to make it consistent with the true values
prediction_novelty = [1 if i==-1 else 0 for i in prediction_novelty]

# Check the model performance
print(classification_report(y_test, prediction_novelty))

We can see that the Local Outlier Factor (LOF) novelty detection captured 2% of the outliers/anomalies.

              precision    recall  f1-score   support

           0       0.99      1.00      0.99     19787
           1       0.05      0.02      0.03       213

    accuracy                           0.99     20000
   macro avg       0.52      0.51      0.51     20000
weighted avg       0.98      0.99      0.98     20000

Step 6: Outlier Detection Using Local Outlier Factor (LOF)

Local Outlier Factor (LOF) for outlier detection train and predict on the same dataset. So if we would like to compare the model performance between novelty detection and outlier detection, we need to fit and predict on the testing dataset. We also need to set novelty toFalse` to enable the outlier detection algorithm.

# The local outlier factor (LOF) model for outlier detection
lof_outlier = LocalOutlierFactor(n_neighbors=5, novelty=False)

# Predict novelties
prediction_outlier = lof_outlier.fit_predict(X_test)

# Change the anomalies' values to make it consistent with the true values
prediction_outlier = [1 if i==-1 else 0 for i in prediction_outlier]

# Check the model performance
print(classification_report(y_test, prediction_outlier))

We can see that the Local Outlier Factor (LOF) outlier/anomaly detection captured 3% of the outliers/anomalies, which is slightly better than the novelty detection result.

              precision    recall  f1-score   support

           0       0.99      0.99      0.99     19787
           1       0.06      0.03      0.04       213

    accuracy                           0.98     20000
   macro avg       0.53      0.51      0.52     20000
weighted avg       0.98      0.98      0.98     20000

Step 7: Visualization

This step will plot the data points and check the differences between actual, LOF novelty detection, and LOF outlier detection.

# Put the testing dataset and predictions in the same dataframe
df_test = pd.DataFrame(X_test, columns=['feature1', 'feature2'])
df_test['y_test'] = y_test
df_test['prediction_novelty'] = prediction_novelty
df_test['prediction_outlier'] = prediction_outlier

# Visualize the actual and predicted anomalies
fig, (ax0, ax1, ax2)=plt.subplots(1,3, sharey=True, figsize=(20,6))

# Ground truth
ax0.set_title('Original')
ax0.scatter(df_test['feature1'], df_test['feature2'], c=df_test['y_test'], cmap='rainbow')

# Local Outlier Factor (LOF) Novelty Detection
ax1.set_title('LOF Novelty Detection')
ax1.scatter(df_test['feature1'], df_test['feature2'], c=df_test['prediction_novelty'], cmap='rainbow')

# Local Outlier Factor (LOF) Outlier / Anomaly Detection
ax2.set_title('LOF Outlier / Anomaly Detection')
ax2.scatter(df_test['feature1'], df_test['feature2'], c=df_test['prediction_outlier'], cmap='rainbow')

We can see that in this example, the outlier detection identified more outliers than the novelty detection.

LOF Novelty Detection vs. Anomaly Detection - GrabNGoInfo.com
LOF Novelty Detection vs. Anomaly Detection – GrabNGoInfo.com

Step 8: Put All Code Together

###### Step 1: Import Libraries

# Synthetic dataset
from sklearn.datasets import make_classification

# Data processing
import pandas as pd
import numpy as np
from collections import Counter

# Visualization
import matplotlib.pyplot as plt

# Model and performance
from sklearn.neighbors import LocalOutlierFactor
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report


###### Step 2: Create Dataset With Anomalies

# Create an imbalanced dataset
X, y = make_classification(n_samples=100000, n_features=2, n_informative=2,
                           n_redundant=0, n_repeated=0, n_classes=2,
                           n_clusters_per_class=1,
                           weights=[0.995, 0.005],
                           class_sep=0.5, random_state=0)

# Convert the data from numpy array to a pandas dataframe
df = pd.DataFrame({'feature1': X[:, 0], 'feature2': X[:, 1], 'target': y})

# Check the target distribution
df['target'].value_counts(normalize = True)


###### Step 3: Train Test Split

# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Check the number of records
print('The number of records in the training dataset is', X_train.shape[0])
print('The number of records in the test dataset is', X_test.shape[0])
print(f"The training dataset has {sorted(Counter(y_train).items())[0][1]} records for the majority class and {sorted(Counter(y_train).items())[1][1]} records for the minority class.")


###### Step 4: Local Outlier Factor (LOF) Algorithm

# No code in this step


###### Step 5: Outlier / Anomaly Detection vs. Novelty Detection

# No code in this step


###### Step 6: Novelty Detection Using Local Outlier Factor (LOF)

# Keep only the normal data for the training dataset
X_train_normal = X_train[np.where(y_train == 0)]

# Train the local outlier factor (LOF) model for novelty detection
lof_novelty = LocalOutlierFactor(n_neighbors=5, novelty=True).fit(X_train_normal)

# Predict novelties
prediction_novelty = lof_novelty.predict(X_test)

# Change the anomalies' values to make it consistent with the true values
prediction_novelty = [1 if i==-1 else 0 for i in prediction_novelty]

# Check the model performance
print(classification_report(y_test, prediction_novelty))


###### Step 7: Outlier Detection Using Local Outlier Factor (LOF)

# The local outlier factor (LOF) model for outlier detection
lof_outlier = LocalOutlierFactor(n_neighbors=5, novelty=False)

# Predict novelties
prediction_outlier = lof_outlier.fit_predict(X_test)

# Change the anomalies' values to make it consistent with the true values
prediction_outlier = [1 if i==-1 else 0 for i in prediction_outlier]

# Check the model performance
print(classification_report(y_test, prediction_outlier))


###### Step 8: Visualization

# Put the testing dataset and predictions in the same dataframe
df_test = pd.DataFrame(X_test, columns=['feature1', 'feature2'])
df_test['y_test'] = y_test
df_test['prediction_novelty'] = prediction_novelty
df_test['prediction_outlier'] = prediction_outlier

# Visualize the actual and predicted anomalies
fig, (ax0, ax1, ax2)=plt.subplots(1,3, sharey=True, figsize=(20,6))

# Ground truth
ax0.set_title('Original')
ax0.scatter(df_test['feature1'], df_test['feature2'], c=df_test['y_test'], cmap='rainbow')

# Local Outlier Factor (LOF) Novelty Detection
ax1.set_title('LOF Novelty Detection')
ax1.scatter(df_test['feature1'], df_test['feature2'], c=df_test['prediction_novelty'], cmap='rainbow')

# Local Outlier Factor (LOF) Outlier / Anomaly Detection
ax2.set_title('LOF Outlier / Anomaly Detection')
ax2.scatter(df_test['feature1'], df_test['feature2'], c=df_test['prediction_outlier'], cmap='rainbow')

Summary

This tutorial demonstrated how to use Local Outlier Factor (LOF) for outlier and novelty detection.

Using the sklearn library in Python, we covered

  • What’s the difference between novelty detection and outlier detection?
  • When to use novelty detection vs. outlier detection?
  • How to use Local Outlier Factor (LOF) for novelty detection?
  • How to use Local Outlier Factor (LOF) for anomaly or outlier detection?

For more information about data science and machine learning, please check out my YouTube channel and Medium Page or follow me on LinkedIn.

Recommended Tutorials

References

Leave a Comment

Your email address will not be published. Required fields are marked *