k-means cluster visualization using Python's seaborn

K-Means Clustering Example Code Using Python Scikit Learn

K-Means is a widely used unsupervised model that can group similar objects. This article will go through a step-by-step example of building a k-means clustering model using the Python Scikit Learn library.

Step 1: Import Libraries

# Import libraries for data processing
import numpy as np
import pandas as pd

# Import library that contains that dataset for the example
from sklearn import datasets

# Import library for visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Import libary for dimensionality reduction
from sklearn.decomposition import PCA

# Import libary for K-means model
from sklearn.cluster import KMeans

Step 2: Read In Data

We are using the iris dataset for this tutorial. This dataset contains 150 records and 4 features for 3 types of iris flowers. Sepal length, sepal width, petal length, and petal width are the features.

# Load the iris data from sklearn
iris = datasets.load_iris()

# Check the information that comes with the dataset
iris.keys()

The code output the keys for the information contained in this dataset.

dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])

Now let’s check are the features available. Other information can be checked similarly.

print(iris['feature_names'])

The output shows that we have four features for this dataset.

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

Now let’s read the dataset into padas DataFrame format and set the four feature names as the column headers. Note that we do not need to read in the target because k-means is an unsupervised model.

# Read the data into pandas DataFrame format
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)

# Check the first 5 records
df.head()

To get more information about the dataset, we can use .info to check the total number of records, the data type of each variable, and the number of non-missing values.

# Check the information of the DataFrame
df.info()

For example, we can see that this dataframe has 150 records, no missing values, and all the values are in float64 format.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 4 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   sepal length (cm)  150 non-null    float64
 1   sepal width (cm)   150 non-null    float64
 2   petal length (cm)  150 non-null    float64
 3   petal width (cm)   150 non-null    float64
dtypes: float64(4)
memory usage: 4.8 KB

Step 3: K-Means Clustering Model

After reading the data, we run the k-means clustering model with 3 clusters. Why did we choose 3 clusters for this model? I will write a separate article to explain that.

The random_state can be any number. It fixes the seed for the random number generator to have the same output each time.

# Modeling dataset X
X = df.copy()

# Run k-means model
kmeans = KMeans(n_clusters = 3,  random_state = 42)
y_kmeans = kmeans.fit_predict(X)

# Save the prediction as a column in the DataFrame
df['y_kmeans']=y_kmeans

# Check the distribution of the predictions
df['y_kmeans'].value_counts()

The output shows 62 records in group 0, 50 records in group 1, and 38 records in group 2.

0    62
1    50
2    38
Name: y_kmeans, dtype: int64

Step 4: Visualize K-Means Predictions

Since we have four features, we need to reduce the dimensions to two to visualize the results. Therefore, we will use Principle Component Analysis (PCA) to reduce dimensionality in this example.

# Reduce the dimensions to 2
pca=PCA(n_components=2).fit_transform(X)
df['PCA1'] = pca[:, 0]
df['PCA2'] = pca[:, 1]

# Visualize the k-means clustering results
sns.set(rc={'figure.figsize':(12,8)})
sns.scatterplot(data=df, x="PCA1", y="PCA2", hue="y_kmeans")

For more information about data science and machine learning, please check out my YouTube channel and Medium Page or follow me on LinkedIn.

Recommendation For You

GrabNGoInfo Machine Learning Tutorials Inventory

K-Means Clustering Example Code Using Python Scikit Learn

Get Free Stock Data From Yahoo Finance API Using Python

How To Connect Tableau To Google Drive

Leave a Comment

Your email address will not be published. Required fields are marked *