5 Ways for Deciding Number of Clusters in a Clustering Model | Machine Learning | Python

5 Ways for Deciding Number of Clusters in a Clustering Model

Welcome to GrabNGoInfo! Deciding the optimal number of clusters is a critical step in building an unsupervised clustering model. In this tutorial, we will talk about five ways to decide the number of clusters for a clustering model in Python. You will learn:

  • How to use the elbow method on within cluster sum of squares to decide the number of clusters?
  • How to use the Silhouette score to decide the number of clusters?
  • How to use the hierarchical graph to find the optimal number of clusters visually?
  • How to use AIC and BIC for choosing the number of clusters?
  • How to use gap statistics to decide the number of clusters?

Resources for this post:

  • Python code is at the end of the post. Click here for the Colab notebook.
  • More video tutorials on unsupervised model
  • More blog posts on unsupervised model
  • If you prefer the video version of the tutorial, watch the video below on YouTube.
5 Ways for Deciding Number of Clusters in a Clustering Model – GrabNGoInfo.com

Let’s get started!

Step 1: Import Libraries

In the first step, we will import the Python libraries.

  • pandas and numpy are for data processing.
  • matplotlib and seaborn are for visualization.
  • datasets from the sklearn library contains some toy datasets. We will use the iris dataset to illustrate the different ways of deciding the number of clusters.
  • PCA is for dimensionality reduction.
  • KMeanshierachy, and GaussianMixture are three widely used clustering models.
  • silhouette_score is a metric to evaluate clustering model performance.
# Data processing 
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Dataset
from sklearn import datasets

# Dimensionality reduction
from sklearn.decomposition import PCA

# Modeling
from sklearn.cluster import KMeans
import scipy.cluster.hierarchy as hier
from sklearn.mixture import GaussianMixture

# Number of clusters
from sklearn.metrics import silhouette_score

Step 2: Read Data

Step 2 reads the data. we first load the data using load_iris(), which is in a Python dictionary format. The keys of the dictionary show that the iris dataset includes the data, target, frame, target names, description of the dataset, feature names, filename, and data module.

# Load data
iris = datasets.load_iris()

# Show data information
iris.keys()

Output

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])

Next, let’s print out the feature_namestarget_names, and target. We can see that there are four features, Sepal length, sepal width, petal length, and petal width. The target contains the names of three different flowers, setosa, Versicolor, and virginica, which are encoded into three numbers, 0, 1, and 2.

# Print feature and target information
print('The feature names are:', iris['feature_names'])
print('The target names are:', iris['target_names'])
print('The target values are:', iris['target'])

Output

The feature names are: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
The target names are: ['setosa' 'versicolor' 'virginica']
The target values are: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]

In order to use the data for the clustering model, we need to convert the data into a dataframe format. Using .info(), we can see that the dataset has 150 records, and there are no missing values.

# Put features data into a dataframe
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)

# Add target to the dataframe 
df['target'] = iris.target

# Data information
df.info()

Output

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   sepal length (cm)  150 non-null    float64
 1   sepal width (cm)   150 non-null    float64
 2   petal length (cm)  150 non-null    float64
 3   petal width (cm)   150 non-null    float64
 4   target             150 non-null    int64  
dtypes: float64(4), int64(1)
memory usage: 6.0 KB

Using value_counts(), we can see that there are 50 records for each type of flower.

# Check counts of each category
df['target'].value_counts()

Output

0    50
1    50
2    50
Name: target, dtype: int64

Clustering model is a type of unsupervised model, so we will not use the target information for the model. Only the four features will be utilized in the model and the goal is to group the same type of flowers together, therefore, a new dataframe called X is created, which includes only the four features.

# Remove target for the clustering model
X = df[df.columns.difference(['target'])]

Step 3: Dimensionality Reduction and Visualization

In step 3, we will visualize the data to understand the ground truth clusters.

Since there are four features, we need to use a dimensionality reduction technique to convert the features from a 4-dimensional space to a 2-dimensional space for visualization.

Principle Component Analysis (PCA) is applied and the three types of flowers are visualized in the scatter plot. We can see that among the three types of flowers, one type is quite different from the other two flowers, but the other two flowers are quite similar to each other.

# Fit PCA with 2 components
pca = PCA(n_components=2).fit_transform(X)

# Add the two components as columns in the dataframe
df['PCA1'] = pca[:, 0]
df['PCA2'] = pca[:, 1]

# Visualize the data
sns.set(rc={'figure.figsize':(12,8)})
sns.scatterplot(data=df, x="PCA1", y="PCA2", hue="target")
Number of clusters for clustering model – GrabNGoInfo.com

We know the ground truth for the number of clusters is three, but two can be a reasonable number as well because of the similarity of the two types of flowers.

Note that in a real-world project, there is usually no ground truth label available for the clustering model. We pick a dataset with labels for this tutorial mainly to evaluate different ways of deciding the number of clusters.

Step 4: Deciding Number of Clusters Using Elbow Method (Method 1)

In step 4, we will show how to use an elbow method to decide the number of optimal clusters.

The elbow method is the most widely used method for choosing the number of clusters. It runs clustering models for a range of cluster numbers and plots the within cluster sum of squares. The elbow on the plot shows the point where the diminishing returns for additional clusters happen.

# Create an empty dictionary to save the within cluster sum of square values
wcss = {} 
# Look through the number of clusters
for i in range(1,11):
  # Run kmeans model
  kmeans = KMeans(n_clusters=i, random_state=0).fit(X)
  #Sum of squared distances of samples to their closest cluster center.
  wcss[i] = (kmeans.inertia_)
  # Print the within cluster sum of squares for each cluster number
  print(f'The within cluster sum of squares for {i} clusters is {wcss[i]:.2f}')

Output

The within cluster sum of squares for 1 clusters is 681.37
The within cluster sum of squares for 2 clusters is 152.35
The within cluster sum of squares for 3 clusters is 78.85
The within cluster sum of squares for 4 clusters is 57.23
The within cluster sum of squares for 5 clusters is 46.47
The within cluster sum of squares for 6 clusters is 39.04
The within cluster sum of squares for 7 clusters is 34.30
The within cluster sum of squares for 8 clusters is 30.06
The within cluster sum of squares for 9 clusters is 28.27
The within cluster sum of squares for 10 clusters is 26.09

From the visualization, we can see that the biggest decrease in within cluster sum of squares happens at 2 clusters, and there is a relatively large decrease from 2 to 3 clusters. The decreases are small for the models with the number of clusters greater than 3. Therefore, we can conclude that the optimal number of clusters is probably 2 or 3.

# Visualization
plt.figure(figsize=(12,8))
plt.plot(list(wcss.keys()),list(wcss.values()))
plt.title('Elbow Method')
plt.xlabel('Number of Clusters')
plt.ylabel('Within Cluster Sum Of Squares')
plt.show()
Number of clusters using the elbow method – GrabNGoInfo.com

Step 5: Deciding Number of Clusters Using Silhouette Score (Method 2)

In step 5, we will talk about how to use the Silhouette score to decide the number of clusters.

The silhouette score measures how similar a data point is to its own cluster compared to the closest neighbor cluster. The silhouette ranges from −1 to +1.

  • The best value is 1, and a value close to 1 means that the sample’s distance to the nearest neighbor cluster is much larger than the intra-cluster distance.
  • 0 indicate overlapping clusters[2].
  • The worst value is -1, and a value close to -1 means that the sample is assigned to the wrong cluster[2].

Note that Silhouette Coefficient is only defined if the number of clusters is at least 2 and at most the number of samples minus 1.

The Silhouette score of the model is the average Silhouette value of all samples.

# Create an empty dictionary for the Silhouette score
s_score = {} 
# Loop through the number of clusters
for i in range(2,11): # Note that the minimum number of clusters is 2
  # Fit kmeans clustering model for each cluster number
  kmeans = KMeans(n_clusters=i, random_state=0).fit(X)
  # Make prediction
  classes = kmeans.predict(X)
  # Calculate Silhouette score
  s_score[i] = (silhouette_score(X, classes)) 
  # Print the Silhouette score for each cluster number
  print(f'The silhouette score for {i} clusters is {s_score[i]:.3f}') 

Output

The silhouette score for 2 clusters is 0.681
The silhouette score for 3 clusters is 0.553
The silhouette score for 4 clusters is 0.498
The silhouette score for 5 clusters is 0.493
The silhouette score for 6 clusters is 0.365
The silhouette score for 7 clusters is 0.357
The silhouette score for 8 clusters is 0.362
The silhouette score for 9 clusters is 0.349
The silhouette score for 10 clusters is 0.331

From the visualization, we can see that the model with 2 clusters has the highest value of Silhouette score, and the model with 3 clusters has the 2nd highest value, so we get the consistent result that there are 2 or 3 clusters.

# Visualization
plt.figure(figsize=(12,8))
plt.plot(list(s_score.keys()),list(s_score.values()))
plt.title('Silhouette Score')
plt.xlabel('Number of Clusters')
plt.ylabel('Silhouette Value')
plt.show()
Number of clusters using the silhouette score – GrabNGoInfo.com

Step 6: Deciding Number of Clusters Using Hierarchical Graph (Method 3)

In step 6, we will use the hierarchical graph to find the number of clusters.

The y axis is the euclidean distance, so the longer the vertical line is, the larger distance between the clusters.

From the graph, we can see that the two clusters connected by the blue line have the largest distance, and the two sub-clusters in red have a relatively large distance too. Therefore, the hierarchical graph suggests 2 or 3 clusters as well.

# Change figure size
plt.figure(figsize=(16,12))

# Fit the heirachical graph
heirachical_graph=hier.dendrogram(hier.linkage(X, method='ward')) #method='ward' uses the Ward variance minimization algorithm

# Add titles and labels
plt.title('Hierachical Clustering Graph')
plt.xlabel('Flowers')
plt.ylabel('Euclidean Distance')
Number of clusters using the hierarchical graph – GrabNGoInfo.com

Step 7: Deciding Number of Clusters Using AIC and BIC from GMM (Method 4)

In step 7, we will talk about how to decide the number of clusters using AIC and BIC values.

Both AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion) are metrics that measure the relative prediction errors of different models. The lower the value is, the better the model is.

The Gaussian Mixture Model AIC and BIC scores can help us decide the optimal number of clusters.

# Create empty dictionary for AIC and BIC values
aic_score = {} 
bic_score = {}

# Loop through different number of clusters
for i in range(1,11): 
  # Create Gaussian Mixture Model
  gmm = GaussianMixture(n_components=i, random_state=0).fit(X)
  # Get AIC score for the model
  aic_score[i] = gmm.aic(X)
  # Get BIC score for the model
  bic_score[i] = gmm.bic(X)

# Visualization
plt.figure(figsize=(12,8))
plt.plot(list(aic_score.keys()),list(aic_score.values()), label='AIC')
plt.plot(list(bic_score.keys()),list(bic_score.values()), label='BIC')
plt.legend(loc='best')
plt.title('AIC and BIC from GMM')
plt.xlabel('Number of Clusters')
plt.ylabel('AIC and BIC values')
plt.show()
Number of clusters using AIC and BIC – GrabNGoInfo.com

From the visualization, we can see that AIC has the smallest value at 3 clusters. Although the AIC value for 10 clusters is smaller, the difference between 3 clusters and 10 clusters is quite small. In this case, we will choose a simpler model with 3 clusters.

BIC has the smallest value at the 2-cluster model, and the 3-cluster model has a similar value, suggesting that the optimal number of clusters is 2 or 3.

Step 8: Deciding Number of Clusters Using Gap Statistics (Method 5)

In step 8, we will talk about how to decide the optimal number of clusters using gap statistics. Gap statistics compares the change in within-cluster dispersion with the uniform distribution[3]. A large gap statistics value means that the clustering is very different from the uniform distribution.

Anaconda.org has a notebook with the implementation of gap statistics[1]. The code in the gap statistics section are all borrowed from the notebook.

def optimalK(data, nrefs=3, maxClusters=15):
    """
    Calculates KMeans optimal K using Gap Statistic from Tibshirani, Walther, Hastie
    Params:
        data: ndarry of shape (n_samples, n_features)
        nrefs: number of sample reference datasets to create
        maxClusters: Maximum number of clusters to test for
    Returns: (gaps, optimalK)
    """
    gaps = np.zeros((len(range(1, maxClusters)),))
    resultsdf = pd.DataFrame({'clusterCount':[], 'gap':[]})
    for gap_index, k in enumerate(range(1, maxClusters)):

        # Holder for reference dispersion results
        refDisps = np.zeros(nrefs)

        # For n references, generate random sample and perform kmeans getting resulting dispersion of each loop
        for i in range(nrefs):
            
            # Create new random reference set
            randomReference = np.random.random_sample(size=data.shape)
            
            # Fit to it
            km = KMeans(k)
            km.fit(randomReference)
            
            refDisp = km.inertia_
            refDisps[i] = refDisp

        # Fit cluster to original data and create dispersion
        km = KMeans(k)
        km.fit(data)
        
        origDisp = km.inertia_

        # Calculate gap statistic
        gap = np.log(np.mean(refDisps)) - np.log(origDisp)

        # Assign this loop's gap statistic to gaps
        gaps[gap_index] = gap
        
        resultsdf = resultsdf.append({'clusterCount':k, 'gap':gap}, ignore_index=True)

    return (gaps.argmax() + 1, resultsdf)  # Plus 1 because index of 0 means 1 cluster is optimal, index 2 = 3 clusters are optimal

The function output the optimal number of cluster based on the number of clusters with the largest gap value, but we can see from the visualization that after 3 clusters, the graph is relatively flat, showing a diminishing gap increase. Therefore, the optimal number of clusters is 3 if we use the elbow method.

# Automatically output the number of clusters
k, gapdf = optimalK(X, nrefs=3, maxClusters=11)
print('Optimal k is: ', k)

# Visualization
plt.plot(gapdf.clusterCount, gapdf.gap, linewidth=3)
plt.scatter(gapdf[gapdf.clusterCount == k].clusterCount, gapdf[gapdf.clusterCount == k].gap, s=250, c='r')
plt.grid(True)
plt.xlabel('Cluster Count')
plt.ylabel('Gap Value')
plt.title('Gap Values by Cluster Count')
plt.show()
Number of clusters using the gap statistics – GrabNGoInfo.com

Step 9: Which method to use for the optimal number of clusters?

Now we have learned the 5 different ways of choosing the number of clusters, you might wonder how to choose a method for your project. Here is a general guideline:

  • Start with the elbow method for the within cluster sum of squares and see if there is a clear elbow.
  • If the elbow is not very clear, use the hierarchical graph to help decide the optimal number.
  • If the optimal number of clusters is still not obvious after using the elbow method and hierarchical graph, use all five methods and choose the optimal number of clusters the majority of methods suggest.

Put All Code Together

#-----------------------------------------------
# Step 1: Import Libraries
#-----------------------------------------------

# Data processing 
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Dataset
from sklearn import datasets

# Dimensionality reduction
from sklearn.decomposition import PCA

# Modeling
from sklearn.cluster import KMeans
import scipy.cluster.hierarchy as hier
from sklearn.mixture import GaussianMixture

# Number of clusters
from sklearn.metrics import silhouette_score

#-----------------------------------------------
# Step 2: Read Data
#-----------------------------------------------

# Load data
iris = datasets.load_iris()

# Show data information
iris.keys()

# Print feature and target information
print('The feature names are:', iris['feature_names'])
print('The target names are:', iris['target_names'])
print('The target values are:', iris['target'])

# Put features data into a dataframe
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)

# Add target to the dataframe 
df['target'] = iris.target

# Data information
df.info()

# Check counts of each category
df['target'].value_counts()

# Remove target for the clustering model
X = df[df.columns.difference(['target'])]

#-----------------------------------------------
# Step 3: Dimensionality Reduction and Visualization
#-----------------------------------------------

# Fit PCA with 2 components
pca = PCA(n_components=2).fit_transform(X)

# Add the two components as columns in the dataframe
df['PCA1'] = pca[:, 0]
df['PCA2'] = pca[:, 1]

# Visualize the data
sns.set(rc={'figure.figsize':(12,8)})
sns.scatterplot(data=df, x="PCA1", y="PCA2", hue="target")

#-----------------------------------------------
# Step 4: Step 4: Deciding Number of Clusters Using Elbow Method (Method 1)
#-----------------------------------------------

# Create an empty dictionary to save the within cluster sum of square values
wcss = {} 
# Look through the number of clusters
for i in range(1,11):
  # Run kmeans model
  kmeans = KMeans(n_clusters=i, random_state=0).fit(X)
  #Sum of squared distances of samples to their closest cluster center.
  wcss[i] = (kmeans.inertia_)
  # Print the within cluster sum of squares for each cluster number
  print(f'The within cluster sum of squares for {i} clusters is {wcss[i]:.2f}') 

# Visualization
plt.figure(figsize=(12,8))
plt.plot(list(wcss.keys()),list(wcss.values()))
plt.title('Elbow Method')
plt.xlabel('Number of Clusters')
plt.ylabel('Within Cluster Sum Of Squares')
plt.show()

#-----------------------------------------------
# Step 5: Deciding Number of Clusters Using Silhouette Score (Method 2)
#-----------------------------------------------

# Create an empty dictionary for the Silhouette score
s_score = {} 
# Loop through the number of clusters
for i in range(2,11): # Note that the minimum number of clusters is 2
  # Fit kmeans clustering model for each cluster number
  kmeans = KMeans(n_clusters=i, random_state=0).fit(X)
  # Make prediction
  classes = kmeans.predict(X)
  # Calculate Silhouette score
  s_score[i] = (silhouette_score(X, classes)) 
  # Print the Silhouette score for each cluster number
  print(f'The silhouette score for {i} clusters is {s_score[i]:.3f}') 

# Visualization
plt.figure(figsize=(12,8))
plt.plot(list(s_score.keys()),list(s_score.values()))
plt.title('Silhouette Score')
plt.xlabel('Number of Clusters')
plt.ylabel('Silhouette Value')
plt.show()

#-----------------------------------------------
# Step 6: Deciding Number of Clusters Using Heirachical Graph (Method 3)
#-----------------------------------------------

# Change figure size
plt.figure(figsize=(16,12))

# Fit the heirachical graph
heirachical_graph=hier.dendrogram(hier.linkage(X, method='ward')) #method='ward' uses the Ward variance minimization algorithm

# Add titles and labels
plt.title('Hierachical Clustering Graph')
plt.xlabel('Flowers')
plt.ylabel('Euclidean Distance')

#-----------------------------------------------
# Step 7: Deciding Number of Clusters Using AIC and BIC from GMM (Method 4)
#-----------------------------------------------

# Create empty dictionary for AIC and BIC values
aic_score = {} 
bic_score = {}

# Loop through different number of clusters
for i in range(1,11): 
  # Create Gaussian Mixture Model
  gmm = GaussianMixture(n_components=i, random_state=0).fit(X)
  # Get AIC score for the model
  aic_score[i] = gmm.aic(X)
  # Get BIC score for the model
  bic_score[i] = gmm.bic(X)

# Visualization
plt.figure(figsize=(12,8))
plt.plot(list(aic_score.keys()),list(aic_score.values()), label='AIC')
plt.plot(list(bic_score.keys()),list(bic_score.values()), label='BIC')
plt.legend(loc='best')
plt.title('AIC and BIC from GMM')
plt.xlabel('Number of Clusters')
plt.ylabel('AIC and BIC values')
plt.show()

#-----------------------------------------------
# Step 8: Deciding Number of Clusters Using Gap Statistics (Method 5)
#-----------------------------------------------

def optimalK(data, nrefs=3, maxClusters=15):
    """
    Calculates KMeans optimal K using Gap Statistic from Tibshirani, Walther, Hastie
    Params:
        data: ndarry of shape (n_samples, n_features)
        nrefs: number of sample reference datasets to create
        maxClusters: Maximum number of clusters to test for
    Returns: (gaps, optimalK)
    """
    gaps = np.zeros((len(range(1, maxClusters)),))
    resultsdf = pd.DataFrame({'clusterCount':[], 'gap':[]})
    for gap_index, k in enumerate(range(1, maxClusters)):

        # Holder for reference dispersion results
        refDisps = np.zeros(nrefs)

        # For n references, generate random sample and perform kmeans getting resulting dispersion of each loop
        for i in range(nrefs):
            
            # Create new random reference set
            randomReference = np.random.random_sample(size=data.shape)
            
            # Fit to it
            km = KMeans(k)
            km.fit(randomReference)
            
            refDisp = km.inertia_
            refDisps[i] = refDisp

        # Fit cluster to original data and create dispersion
        km = KMeans(k)
        km.fit(data)
        
        origDisp = km.inertia_

        # Calculate gap statistic
        gap = np.log(np.mean(refDisps)) - np.log(origDisp)

        # Assign this loop's gap statistic to gaps
        gaps[gap_index] = gap
        
        resultsdf = resultsdf.append({'clusterCount':k, 'gap':gap}, ignore_index=True)

    return (gaps.argmax() + 1, resultsdf)  # Plus 1 because index of 0 means 1 cluster is optimal, index 2 = 3 clusters are optimal

# Automatically output the number of clusters
k, gapdf = optimalK(X, nrefs=3, maxClusters=11)
print('Optimal k is: ', k)

# Visualization
plt.plot(gapdf.clusterCount, gapdf.gap, linewidth=3)
plt.scatter(gapdf[gapdf.clusterCount == k].clusterCount, gapdf[gapdf.clusterCount == k].gap, s=250, c='r')
plt.grid(True)
plt.xlabel('Cluster Count')
plt.ylabel('Gap Value')
plt.title('Gap Values by Cluster Count')
plt.show()

Summary

In this tutorial, we talked about five ways to decide the number of clusters for a clustering model in Python. You learned:

  • How to use the elbow method on within cluster sum of squares to decide the number of clusters?
  • How to use the Silhouette score to decide the number of clusters?
  • How to use the hierarchical graph to find the optimal number of clusters visually?
  • How to use AIC and BIC for choosing the number of clusters?
  • How to use gap statistics to decide the number of clusters?

For more information about data science and machine learning, please check out my YouTube channel and Medium Page or follow me on LinkedIn.

Recommended Tutorials

References

[1] Python implementation of the Gap Statistic

[2] sklearn silhouette score documentation

[3] Gap Statistics paper

2 thoughts on “5 Ways for Deciding Number of Clusters in a Clustering Model”

  1. Michael Schreiber

    Thanks a lot for your enriching and informative article. I really appreciate it. I’ve tested your code and it works well. But the result for the optimal number of clusters k is 8. Isn’t that a bit too much? Thanks for your answer in advance.

    1. Hi Michael,

      Thank you for your comments. I am glad that the code worked for your project. Whether the cluster number k = 8 is too large depends on the specific project. If the dataset is large and it makes sense to have 8 clusters based on business knowledge, then 8 can be the optimal number of clusters. For example, if we would like to cluster 100k Amazon reviews into topics, 8 can be a reasonable cluster number.

      I hope it helps.

      Best regards,
      Amy

Leave a Comment

Your email address will not be published. Required fields are marked *