4 Clustering Model Algorithms in Python and Which is the Best K-means, Gaussian Mixed Model (GMM), Hierarchical model, and DBSCAN model. Which one to choose for your project? PCA and t-SNE

4 Clustering Model Algorithms in Python and Which is the Best

Welcome to GrabNGoInfo! In this tutorial, we will talk about four clustering model algorithms, compare their results, and discuss how to choose a clustering algorithm for a project. You will learn:

  • What are the different types of clustering model algorithms?
  • How to run K-means, Gaussian Mixture Model (GMM), Hierarchical model, and DBSCAN (Density-Based Spatial Clustering of Applications with Noise) model in Python?
  • How to use PCA (Principal Component Analysis) and t-SNE (t-distributed stochastic neighbor embedding) for dimensionality reduction and visualization?
  • How to utilize clustering model results for the business?
  • How to select a clustering model algorithm for your project?

Resources for this post:

  • If you prefer the video version of the tutorial, watch the video below on YouTube.
4 clustering model algorithms and which is the best – GrabNGoInfo.com

Step 0: Clustering Model Algorithms

Based on the underlying algorithm for grouping the data, the clustering model can be divided into different types.

The following four types are the most widely used types of clustering models.

  • Centroid Model uses the distance between a data point and the centroid of the cluster to group data. K-means clustering is an example of a centroid model.
  • Distribution Model segments data based on their probability of belonging to the same distribution. Gaussian Mixture Model (GMM) is a popular distribution model.
  • Connectivity Model uses the closeness of the data points to decide the clusters. Hierarchical Clustering Model is a widely used connectivity model.
  • Density Model scans the data space and assigns clusters based on the density of data points. DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a popular density model.

In this tutorial, we will create clustering models for the same dataset using the four algorithms and compare their results.

Step 1: Import Libraries

In the first step, we will import the Python libraries.

  • pandas and numpy are for data processing.
  • matplotlib and seaborn are for visualization.
  • datasets from the sklearn library contains some toy datasets. We will use the iris dataset to illustrate different ways of deciding the number of clusters.
  • PCA and TSNE are for dimensionality reduction.
  • KMeansAgglomerativeClusteringGaussianMixture, and DBSCAN are for clustering models.
# Data processing 
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Dataset
from sklearn import datasets

# Dimensionality reduction
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

# Modeling
from sklearn.cluster import KMeans
from sklearn.cluster import AgglomerativeClustering
from sklearn.mixture import GaussianMixture
from sklearn.cluster import DBSCAN

Step 2: Read Data

Step 2 reads the data. we first load the data using load_iris(), which is in a Python dictionary format.

# Load data
iris = datasets.load_iris()

# Show data information
iris.keys()

The keys of the dictionary show that the iris dataset includes the data, target, frame, target names, description of the dataset, feature names, filename, and data module.

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])

Next, let’s print out the feature_namestarget_names, and target.

# Print feature and target information
print('The feature names are:', iris['feature_names'])
print('The target names are:', iris['target_names'])
print('The target values are:', iris['target'])

We can see that there are four features, sepal length, sepal width, petal length, and petal width. The target contains the names of three different flowers, Setosa, Versicolor, and Virginica, which are encoded into three numbers, 0, 1, and 2.

The feature names are: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
The target names are: ['setosa' 'versicolor' 'virginica']
The target values are: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]

In order to use the data for the clustering model, we need to convert the data into a dataframe format. Using .info(), we can see that the dataset has 150 records, and there are no missing values.

# Put features data into a dataframe
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)

# Add target to the dataframe 
df['target'] = iris.target

# Data information
df.info()

Output

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   sepal length (cm)  150 non-null    float64
 1   sepal width (cm)   150 non-null    float64
 2   petal length (cm)  150 non-null    float64
 3   petal width (cm)   150 non-null    float64
 4   target             150 non-null    int64  
dtypes: float64(4), int64(1)
memory usage: 6.0 KB

Using value_counts(), we can see that there are 50 records for each type of flower.

# Check counts of each category
df['target'].value_counts()

Output

0    50
1    50
2    50
Name: target, dtype: int64

Clustering model is a type of unsupervised model, so we will not use the target information for the model. Only the four features will be utilized in the model and the goal is to group the same type of flowers together. Therefore, a new dataframe called X is created, which includes only the four features.

# Remove target for the clustering model
X = df[df.columns.difference(['target'])]

Step 3: Decide the Number of Clusters

After creating the modeling dataset and before running the model, we need to decide the number of clusters.

Deciding the optimal number of clusters is a critical step in building a good unsupervised clustering model. In my previous tutorial, I talked about 5 Ways for Deciding Number of Clusters in a Clustering Model. It covered the elbow method, the Silhouette score, the hierarchical graph, AIC, BIC, and gap statistics for deciding the number of clusters.

In this tutorial, we will not repeat the content and will use 3 as the number of clusters directly.

Step 4: Kmeans Clustering (Model 1)

KMeans clustering algorithm works as follows for a dataset with n data points and k clusters:

  1. k data points are randomly selected as the centroids. In the Python sklearn implementation, this step corresponds to a hyperparameter called init. The default value is k-means++, which is an improved version that makes the centroids to be far from one another.
  2. Assign all the data points to the closest centroid, and we get k clusters.
  3. Calculate the new centroid for each cluster based on the data points in the cluster.
  4. Repeat step 2 and step 3 until the centroids do not change anymore.

When using the sklearn implementation of KMeans

  • Firstly, we need to initiate the model using the method KMeans, and specify the number of clusters. The random_state is for results reproducibility.
  • Secondly, we fit and predict on the modeling data, and save the prediction output in a variable called y_kmeans.
  • Next, the prediction results are saved in the dataframe as a column.
  • Finally, we check the counts for each cluster and see that there are 62, 50, and 38 data points in the three clusters respectively.
# Kmeans model
kmeans = KMeans(n_clusters = 3,  random_state = 42)

# Fit and predict on the data
y_kmeans = kmeans.fit_predict(X)

# Save the predictions as a column
df['y_kmeans']=y_kmeans

# Check the distribution
df['y_kmeans'].value_counts()

Output

0    62
1    50
2    38
Name: y_kmeans, dtype: int64

KMeans Pros

  • KMeans is fast and scalable

KMeans Cons

  • The model performance is highly impacted by the initial centroids. Some centroids initiation can produce sub-optimal results.
  • KMeans model does not perform well when the cluster sizes vary a lot, have different densities, or have a non-spherical shape [1].

Step 5: Hierarchical Clustering (Model 2)

AgglomerativeClustering is a type of hierarchical clustering algorithm.

  • It uses a bottom-up approach and starts each data point as an individual cluster.
  • Then the clusters that are closest to each other are connected until all the clusters are connected into one.
  • The hierarchical clustering algorithms produce a binary tree, where the root of the tree includes all the data points, and the leaves of the tree are the individual data points.

The Python code implementation of the hierarchical clustering model is similar to the KMeans clustering model, we just need to change the method from KMeans to AgglomerativeClustering.

# Hierachical clustering model
hc = AgglomerativeClustering(n_clusters = 3)

# Fit and predict on the data
y_hc = hc.fit_predict(X)

# Save the predictions as a column
df['y_hc']=y_hc

# Check the distribution
df['y_hc'].value_counts()

Output

0    64
1    50
2    36
Name: y_hc, dtype: int64

Hierarchical Model Pros

  • Hierarchical model works well for large datasets or a large number of clusters [1].
  • It’s flexible about the shape of the data.
  • It can use any pairwise distance, such as euclideanmanhatton, and cosine distance [3].

Hierarchical Model Cons

  • AgglomerativeClustering needs to use a connectivity matrix in order to scale to a large dataset. The connectivity matrix is computationally expensive when no connectivity constraints because it considers all the possible combinations of merges [4].

Step 6: Gaussian Mixture Model (GMM) (Model 3)

Gaussian Mixture Model (GMM) is a probabilistic model that assumes each data point belongs to a Gaussian distribution. It uses the expectation-maximization (EM) algorithm.

  • In the expectation step, the algorithm estimates the probability of each data point belonging to each cluster.
  • In the maximization step, each cluster is updated using the estimated probability of belonging to the cluster of all the data points.
  • The updates of the cluster are mostly impacted by the data points with high probabilities of belonging to the cluster.

The Python code implementation of Gaussian Mixture Model (GMM) is similar to the KMeans clustering model, we just need to change the method from KMeans to GaussianMixture.

One difference is that we changed the value for n_init from the default value of 1 to 5. n_init is the number of initialization to generate. When setting it to 5, it means that 5 initializations for the model will be performed, and the one with the best result is kept.

# Fit the GMM model
gmm = GaussianMixture(n_components=3, n_init=5, random_state=42)

# Fit and predict on the data
y_gmm = gmm.fit_predict(X)

# Save the prediction as a column
df['y_gmm']=y_gmm

# Check the distribution
df['y_gmm'].value_counts()

Output

2    55
1    50
0    45
Name: y_gmm, dtype: int64

Gaussian Mixture Model (GMM) Pros

  • It’s a generative model, so we can generate new data for the clusters based on the distributions.
  • It’s a probabilistic model, and works very well on ellipsoidal-shaped data [1].

Gaussian Mixture Model (GMM) Cons

  • May converge to a suboptimal solution because of the initial values, so we need to set n_init to run the model multiple times and pick the best initial values [1].
  • It does not work well on the dataset that is not ellipsoidal shape.

Step 7: Density-based spatial clustering of applications with noise (DBSCAN) (Model 4)

DBSCAN defines clusters using data density. It has two important hyperparameters to tune, eps and min_samples.

  • eps is the epsilon distance to be considered as the neighborhood of a data point. It is the most important parameter for DBSCAN [6].
  • min_samples is the number of minimum data points in the neighborhood in order for a data point to be considered as a core data point. This number includes the data point itself [6].
  • All data points in the neighborhood of the core data points belong to the same cluster.
  • The data points that are not core data points and do not have a core data point in the neighborhood are considered outliers. The label -1 in the prediction results represents outliers. To learn more about anomaly detection, please check out my previous tutorials on One-Class SVM For Anomaly DetectionIsolation Forest For Anomaly Detection, and Time Series Anomaly Detection Using Prophet in Python.

DBSCAN does not take a pre-defined number of clusters, and it identifies the number of clusters based on the density distribution of the dataset. We can see that DBSCAN was able to identify two clusters, but was not able to separate the two types of flowers that are similar to each other because they are not well separated.

# Fit the DBSCAN model
dbscan = DBSCAN(eps=0.8, min_samples=5) 

# Fit and predict on the data
y_dbscan = dbscan.fit_predict(X)

# Save the prediction as a column
df['y_dbscan'] = y_dbscan

# Check the distribution
df['y_dbscan'].value_counts()

Output

 1    98
 0    50
-1     2
Name: y_dbscan, dtype: int64

DBSCAN Pros

  • It works on datasets of any shape.
  • It identifies anomalies automatically.

DBSCAN Cons

  • It does not work well for identifying the clusters that are not well separated.
  • Different clusters in the dataset need to have similar densities, otherwise, the DBSCAN does not perform well.

Step 8: Dimensionality Reduction

In step 8, we will use two popular algorithms, PCA (Principal Component Analysis) and t-SNE (t-distributed stochastic neighbor embedding) to reduce the dimensionality of the dataset for visualization. There are four features in the dataset. We need to convert the features from a 4-dimensional space to a 2-dimensional space. The output from PCA and t-SNE are saved in the dataframe as columns.

# PCA with 2 components
pca=PCA(n_components=2).fit_transform(X)

# Create columns for the 2 PCA components
df['PCA1'] = pca[:, 0]
df['PCA2'] = pca[:, 1]

# TSNE with 2 components
tsne=TSNE(n_components=2).fit_transform(X)

# Create columns for the 2 TSNE components
df['TSNE1'] = tsne[:, 0]
df['TSNE2'] = tsne[:, 1]

# Take a look at the data
df.head()
4 Clustering Model Algorithms in Python and Which is the Best K-means, Gaussian Mixed Model (GMM), Hierarchical model, and DBSCAN model. Which one to choose for your project? PCA and t-SNE
Dimensionality Reduction using PCA and t-SNE – GrabNGoInfo.com

Step 9: Visual Comparison of Models

After dimensionality reduction, in step 9, we will visualize the clustering results of each model, and compare them with the ground truth. Note that in the real-world project, the ground truth is not available for clustering models most of the time. The comparison is for illustrating the differences in algorithms only.

Before visualization, we need to align the labels of the model outputs. Since the labels are randomly generated, they do not represent the same flower type in different models, so we need to rename the labels so the same label across different models represents the same flowers.

# Check label mapping
df.groupby(['target', 'y_kmeans']).size().reset_index(name='counts')
4 Clustering Model Algorithms in Python and Which is the Best K-means, Gaussian Mixed Model (GMM), Hierarchical model, and DBSCAN model. Which one to choose for your project? PCA and t-SNE
Clustering model label mapping – GrabNGoInfo.com

For example, we can see that the kmeans predicted label 1 corresponds to the true label 0, the kmeans predicted label 0 corresponds to the true label 1, and the kmeans predicted label 2 corresponds to the true label 2. So the labels are renamed to be consistent with the true label.

This process is implemented to the outputs of hierarchical model, GMM, and DBSCAN as well.

# Rename labels
df['y_kmeans'] = df['y_kmeans'].map({1: 0, 0: 1, 2: 2})

# Check label mapping
df.groupby(['target', 'y_hc']).size().reset_index(name='counts')

# Rename labels
df['y_hc'] = df['y_hc'].map({1: 0, 0: 1, 2: 2})

# Check label mapping
df.groupby(['target', 'y_gmm']).size().reset_index(name='counts')

# Rename labels
df['y_gmm'] = df['y_gmm'].map({1: 0, 0: 1, 2: 2})

# Check label mapping
df.groupby(['target', 'y_dbscan']).size().reset_index(name='counts')

# Rename labels
df['y_dbscan'] = df['y_dbscan'].map({0: 0, -1: 2, 1: 1})

After relabeling the model predictions, let’s visualize the data using PCA first.

In the visualization, there are five charts. The first chart is the ground truth, the second chart is the KMeans prediction, the third chart is the hierarchical model prediction, the fourth chart is the GMM prediction, and the fifth chart is the DBSCAN prediction.

We can see that most of the models are able to accurately predict label 0, because it is well separated from other data points, but GMM did the best job separating label 2 and label 3.

DBSCAN identified two clusters and two data points as outliers.

# Visualization using PCA
fig, axs = plt.subplots(ncols=5, sharey=True, figsize=(20,12))
sns.scatterplot(x='PCA1', y='PCA2', data=df, hue='target', ax=axs[0]).set(title='Ground Truth')
sns.scatterplot(x='PCA1', y='PCA2', data=df, hue='y_kmeans', ax=axs[1]).set(title='KMeans')
sns.scatterplot(x='PCA1', y='PCA2', data=df, hue='y_hc', ax=axs[2]).set(title='Hierachical')
sns.scatterplot(x='PCA1', y='PCA2', data=df, hue='y_gmm', ax=axs[3]).set(title='GMM')
sns.scatterplot(x='PCA1', y='PCA2', data=df, hue='y_dbscan', ax=axs[4]).set(title='DBSCAN')
4 Clustering Model Algorithms in Python and Which is the Best K-means, Gaussian Mixed Model (GMM), Hierarchical model, and DBSCAN model. Which one to choose for your project? PCA and t-SNE
Clustering algorithms comparison PCA – GrabNGoInfo.com

t-SNE shows similar results with more condensed cluster visualization.

# Visualization using t-SNE
fig, axs = plt.subplots(ncols=5, sharey=True, figsize=(20,12))
sns.scatterplot(x='TSNE1', y='TSNE2', data=df, hue='target', ax=axs[0]).set(title='Ground Truth')
sns.scatterplot(x='TSNE1', y='TSNE2', data=df, hue='y_kmeans', ax=axs[1]).set(title='KMeans')
sns.scatterplot(x='TSNE1', y='TSNE2', data=df, hue='y_hc', ax=axs[2]).set(title='Hierachical')
sns.scatterplot(x='TSNE1', y='TSNE2', data=df, hue='y_gmm', ax=axs[3]).set(title='GMM')
sns.scatterplot(x='TSNE1', y='TSNE2', data=df, hue='y_dbscan', ax=axs[4]).set(title='DBSCAN')
4 Clustering Model Algorithms in Python and Which is the Best K-means, Gaussian Mixed Model (GMM), Hierarchical model, and DBSCAN model. Which one to choose for your project? PCA and t-SNE
Clustering algorithms comparison t-SNE – GrabNGoInfo.com

Step 10: Utilize Clustering Model Results for Business

In step 10, we will talk about how to utilize the clustering model predictions in a business environment.

One of the most common use cases for clustering models is customer segmentation. In this example, we will talk about customer segmentation using the clustering model results.

Let’s imagine each flower is a customer and the four features are customer age, number of children in a household, tenure with the brand, and distance to the nearest store. We can take the three steps below to do customer segmentation:

  • The first step is to do customer profiling for each cluster.
  • The second step is to understand the persona of each cluster based on the profiling results.
  • The third step is to create personalized strategies for different customer segments.

From the visualization, we can see that two out of the four features are the features that differentiate the three clusters, so we can create personas around the two features and create personalized marketing strategies.

# Feature list
varList = ['petal length (cm)', 'petal width (cm)', 'sepal length (cm)', 'sepal width (cm)']
# Calculate average values by predicted cluster
avg = pd.DataFrame(df.groupby(['y_gmm'])[varList].mean().reset_index())

# Visualizae the average values by cluster for each feature
fig, axs = plt.subplots(nrows=2, ncols=2, sharey=False, figsize=(12,12))
sns.barplot(x='y_gmm', y=varList[0], data=avg, ax=axs[0,0])
sns.barplot(x='y_gmm', y=varList[1], data=avg, ax=axs[0,1])
sns.barplot(x='y_gmm', y=varList[2], data=avg, ax=axs[1,0])
sns.barplot(x='y_gmm', y=varList[3], data=avg, ax=axs[1,1])
4 Clustering Model Algorithms in Python and Which is the Best K-means, Gaussian Mixed Model (GMM), Hierarchical model, and DBSCAN model. Which one to choose for your project? PCA and t-SNE
Utilize Clustering Model Results for Business – GrabNGoIinfo.com

Step 11: Which Model to Use?

Now we have learned how to build different clustering models, you may wonder which model to use for your specific project. I created a diagram to illustrate how to choose clustering model algorithms based on cluster shapes and densities. Please cite this tutorial when using the diagram.

  • If the clusters in the dataset are in ellipsoidal shape and with different densities, we can use GMM or hierarchical model.
  • Any of the four clustering model algorithms work well with the dataset with ellipsoidal shape clusters and similar density.
  • For the non-ellipsoidal-shaped clusters, we can only choose from DBSCAN and hierarchical model, and DBSCAN model does not work well for the clusters with different densities.
  • Hierarchical clustering model is the most flexible, and it can be used for datasets with any shape and density.
  • In addition to data shape and density, if there is a need to generate new data points for clusters, we need to use GMM because GMM is a generative model.
4 Clustering Model Algorithms in Python and Which is the Best K-means, Gaussian Mixed Model (GMM), Hierarchical model, and DBSCAN model. Which one to choose for your project? PCA and t-SNE
Choosing clustering model algorithms – GrabNGoInfo.com

Put All Code Together

#---------------------------------------------
# Step 1: Import Libraries
#---------------------------------------------

# Data processing 
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Dataset
from sklearn import datasets

# Dimensionality reduction
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

# Modeling
from sklearn.cluster import KMeans
from sklearn.cluster import AgglomerativeClustering
from sklearn.mixture import GaussianMixture
from sklearn.cluster import DBSCAN

#---------------------------------------------
# Step 2: Read Data
#---------------------------------------------

# Load data
iris = datasets.load_iris()

# Show data information
iris.keys()

# Print feature and target information
print('The feature names are:', iris['feature_names'])
print('The target names are:', iris['target_names'])
print('The target values are:', iris['target'])

# Put features data into a dataframe
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)

# Add target to the dataframe 
df['target'] = iris.target

# Data information
df.info()

# Check counts of each category
df['target'].value_counts()

# Remove target for the clustering model
X = df[df.columns.difference(['target'])]

#---------------------------------------------
# Step 3: Decide the Number of Clusters
#---------------------------------------------

# please check out https://medium.com/grabngoinfo/5-ways-for-deciding-number-of-clusters-in-a-clustering-model-5db993ea5e09

#---------------------------------------------
# Step 4: Kmeans Clustering (Model 1)
#---------------------------------------------

# Kmeans model
kmeans = KMeans(n_clusters = 3,  random_state = 42)

# Fit and predict on the data
y_kmeans = kmeans.fit_predict(X)

# Save the predictions as a column
df['y_kmeans']=y_kmeans

# Check the distribution
df['y_kmeans'].value_counts()

#---------------------------------------------
# Step 5: Hierarchical Clustering (Model 2)
#---------------------------------------------

# Hierachical clustering model
hc = AgglomerativeClustering(n_clusters = 3)

# Fit and predict on the data
y_hc = hc.fit_predict(X)

# Save the predictions as a column
df['y_hc']=y_hc

# Check the distribution
df['y_hc'].value_counts()

#---------------------------------------------
# Step 6: Gaussian Mixture Model (GMM) (Model 3)
#---------------------------------------------

# Fit the GMM model
gmm = GaussianMixture(n_components=3, n_init=5, random_state=42)

# Fit and predict on the data
y_gmm = gmm.fit_predict(X)

# Save the prediction as a column
df['y_gmm']=y_gmm

# Check the distribution
df['y_gmm'].value_counts()

#---------------------------------------------
# Step 7: Density-based spatial clustering of applications with noise (DBSCAN) (Model 4)
#---------------------------------------------

# Fit the DBSCAN model
dbscan = DBSCAN(eps=0.8, min_samples=5) 

# Fit and predict on the data
y_dbscan = dbscan.fit_predict(X)

# Save the prediction as a column
df['y_dbscan'] = y_dbscan

# Check the distribution
df['y_dbscan'].value_counts()

#---------------------------------------------
# Step 8: Dimensionality Reduction
#---------------------------------------------

# PCA with 2 components
pca=PCA(n_components=2).fit_transform(X)

# Create columns for the 2 PCA components
df['PCA1'] = pca[:, 0]
df['PCA2'] = pca[:, 1]

# TSNE with 2 components
tsne=TSNE(n_components=2).fit_transform(X)

# Create columns for the 2 TSNE components
df['TSNE1'] = tsne[:, 0]
df['TSNE2'] = tsne[:, 1]

# Take a look at the data
df.head()

#---------------------------------------------
# Step 9: Visual Comparison of Models
#---------------------------------------------

# Check label mapping
df.groupby(['target', 'y_kmeans']).size().reset_index(name='counts')

# Rename labels
df['y_kmeans'] = df['y_kmeans'].map({1: 0, 0: 1, 2: 2})

# Check label mapping
df.groupby(['target', 'y_hc']).size().reset_index(name='counts')

# Rename labels
df['y_hc'] = df['y_hc'].map({1: 0, 0: 1, 2: 2})

# Check label mapping
df.groupby(['target', 'y_gmm']).size().reset_index(name='counts')

# Rename labels
df['y_gmm'] = df['y_gmm'].map({1: 0, 0: 1, 2: 2})

# Check label mapping
df.groupby(['target', 'y_dbscan']).size().reset_index(name='counts')

# Rename labels
df['y_dbscan'] = df['y_dbscan'].map({0: 0, -1: 2, 1: 1})

# Visualization using PCA
fig, axs = plt.subplots(ncols=5, sharey=True, figsize=(20,12))
sns.scatterplot(x='PCA1', y='PCA2', data=df, hue='target', ax=axs[0]).set(title='Ground Truth')
sns.scatterplot(x='PCA1', y='PCA2', data=df, hue='y_kmeans', ax=axs[1]).set(title='KMeans')
sns.scatterplot(x='PCA1', y='PCA2', data=df, hue='y_hc', ax=axs[2]).set(title='Hierachical')
sns.scatterplot(x='PCA1', y='PCA2', data=df, hue='y_gmm', ax=axs[3]).set(title='GMM')
sns.scatterplot(x='PCA1', y='PCA2', data=df, hue='y_dbscan', ax=axs[4]).set(title='DBSCAN')

# Visualization using t-SNE
fig, axs = plt.subplots(ncols=5, sharey=True, figsize=(20,12))
sns.scatterplot(x='TSNE1', y='TSNE2', data=df, hue='target', ax=axs[0]).set(title='Ground Truth')
sns.scatterplot(x='TSNE1', y='TSNE2', data=df, hue='y_kmeans', ax=axs[1]).set(title='KMeans')
sns.scatterplot(x='TSNE1', y='TSNE2', data=df, hue='y_hc', ax=axs[2]).set(title='Hierachical')
sns.scatterplot(x='TSNE1', y='TSNE2', data=df, hue='y_gmm', ax=axs[3]).set(title='GMM')
sns.scatterplot(x='TSNE1', y='TSNE2', data=df, hue='y_dbscan', ax=axs[4]).set(title='DBSCAN')

#---------------------------------------------
# Step 10: Utilize Clustering Model Results for Business
#---------------------------------------------

# Feature list
varList = ['petal length (cm)', 'petal width (cm)', 'sepal length (cm)', 'sepal width (cm)']
# Calculate average values by predicted cluster
avg = pd.DataFrame(df.groupby(['y_gmm'])[varList].mean().reset_index())

# Visualizae the average values by cluster for each feature
fig, axs = plt.subplots(nrows=2, ncols=2, sharey=False, figsize=(12,12))
sns.barplot(x='y_gmm', y=varList[0], data=avg, ax=axs[0,0])
sns.barplot(x='y_gmm', y=varList[1], data=avg, ax=axs[0,1])
sns.barplot(x='y_gmm', y=varList[2], data=avg, ax=axs[1,0])
sns.barplot(x='y_gmm', y=varList[3], data=avg, ax=axs[1,1])

Summary

In this tutorial, we discussed four clustering model algorithms, compared their results, and talked about how to choose a clustering algorithm for a project. You learned:

  • What are the different types of clustering model algorithms?
  • How to run K-means, Gaussian Mixture Model (GMM), Hierarchical model, and DBSCAN (Density-Based Spatial Clustering of Applications with Noise) model in Python?
  • How to use PCA (Principal Component Analysis) and t-SNE (t-distributed stochastic neighbor embedding) for dimensionality reduction and visualization?
  • How to utilize clustering model results for businesses?
  • How to select a clustering model algorithm for your project?

For more information about data science and machine learning, please check out my YouTube channel and Medium Page or follow me on LinkedIn.

Recommended Tutorials

References

[1] Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow 2nd Edition

[2] sklearn documentation on KMeans

[3] sklearn documentation on AgglomerativeClustering

[4] sklearn user guide on hierachical clustering

[5] sklearn user guide on Gaussian Mixture Model

[6] sklearn documentation on DBSCAN

Leave a Comment

Your email address will not be published. Required fields are marked *