K-Means Algorithm Python Implementation

In our previous post, we’ve discussed about Clustering algorithms and implementation of KNN in python. In this post, we’ll be discussing about K-means algorithm and it’s implementation in python.

K-Means Algorithm

K-Means algorithm is one of the simplest and popular unsupervised learning algorithm. The main objective of this algorithm is to find clusters or groups in the data where the number of groups is specified by using a hyperparameter “k“. The algorithm iteratively assigns each data point to one of the K clusters. Data points are assigned to the class based on the similarity measure. As discussed in the previous post, similarity can be measured by using Minkowoski distance , Euclidean distance or any other measure.In this blog, we use Euclidean distance to find the similarity

How K-Means work?

K-Means is the one of the simplest clustering algorithm. Let’s assume that we have data points x1,x2,……,xn and “k” the number of clusters.

Select K points as initial centroids from the dataset. It can be done either randomly or first k data points.
Find the similarity measure between the data point and the identified k-cluster centroids.
Assign each data point to a cluster based on the similarity computed in the previous step.
Calculate the new centroid by taking the average of all data point belonging to that particular cluster.
Repeat the above steps until the centroids remain constant

Choosing the right K

When we are building a model and when we know how many cluster are required, then it’s fine. But, it’s really problamatic when we don’t know about the number of clusters. So, at that time we want to figure out the approximate “k” value. In order to achieve that we use a method called “Elbow method“.

Elbow Method

The method plots the various value of costs with varying value of “k“. As the value of “k” increases the elements in the clusters decrease gradually. The lesser the number of elements means closer to the centroids. The point at which the distortion declines is the optimal “k” value.

We can see in the above plot, 3 is the optimal number of clusters for the dataset.

Implementation of K-Means in Python

Now, we’ve entered into the most interesting part of our blog. Let’s build our own K-Means classifier using Python.

Before writing all the math functions required let’s import every module that we use in this post.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

Creation of K-Means class

Let’s first create a class that helps us to perform all the operations required. Let’s first use __init()__ constructor to initialize the model with a default value of “k” cluster to 8 and the max_iter i.e, maximum iterations to take when the centroids do not coincide to 100.

class KMeans:
    
    def __init__(self ,k = 8, max_iter = 100 ):
        self.k = k
        self.max_iter = max_iter
        print("Initalized k with :",k)

Implementing the similarity function

As discussed earlier we can use any distance metric in order to find the similarity. For now, we’ll build our classifier with Euclidean distance metric.

def euclidDistance(self , x1, x2):
    return (np.sqrt(np.sum(np.square(x1 - x2) , axis = 1)))

The above code performs the following :

Takes the centriod vector and data point vector performs subtraction between the vectors
Squares the resultant vector
As it is a k*n dimensional vector, we sum all the squares of the columns along row and apply square root which forms a k*1 dimensional vector.

Fitting the data

The fit function takes the data and forms clusters based on the similarity. The fit function performs the following tasks

Initialize the “k” centroids with first “k” points from the dataset.

def fit(self, data):
    self.centroids = []
        
    #initialize the centroids, the first 'k' elements in the dataset will be our initial centroids

    for i in range(self.k):
        self.centroids.append(data.iloc[i].to_numpy())

Finds the similarity measure between the data point and the identified k-cluster centroids. Assigns each data point to a cluster based on the similarity.

self.classes = {}
for cluster in range(self.k):
      self.classes[cluster] = []

      #find the distance between the point and cluster; choose the nearest centroid
      for point in range(len(data)):
           distances = self.euclidDistance(self.centroids, data.iloc[point].to_numpy())
           classification = np.argmin(distances)
           self.classes[classification].append(data.iloc[point])

Calculate the new centroid by taking the average of all data point belonging to that particular cluster and compare the new centroid with new centroid and check for the co-incidence.

previous = np.array(self.centroids)
#average the cluster datapoints to re-calculate the centroids
for classification in self.classes:
       self.centroids[classification] = np.average(self.classes[classification], axis = 0)
            
optimal = True
curr = np.array(self.centroids)
            
#difference in the cluster centers of two consecutive iterations to declare convergence.
if np.sum((curr - previous)/previous * 100.0) > 0.0001:
      optimal = False

          
#break out of the main loop if the results are optimal, ie. the centroids don't change their positions much(more than our tolerance)
if optimal:
      break

Since we need to run the algorithm until the centroids converge or until the maximum iterations we define the fit function as follows

def fit(self, data):
    self.centroids = []
    
    #initialize the centroids, the first 'k' elements in the dataset will be our initial centroids
    for i in range(self.k):
        self.centroids.append(data.iloc[i].to_numpy())
        
    for itr in range(self.max_iter):
        self.classes = {}
        for cluster in range(self.k):
            self.classes[cluster] = []

        #find the distance between the point and cluster; choose the nearest centroid
        for point in range(len(data)):
            distances = self.euclidDistance(self.centroids, data.iloc[point].to_numpy())
            classification = np.argmin(distances)
            self.classes[classification].append(data.iloc[point])

        previous = np.array(self.centroids)
        #average the cluster datapoints to re-calculate the centroids
        for classification in self.classes:
            self.centroids[classification] = np.average(self.classes[classification], axis = 0)
        
        optimal = True
        curr = np.array(self.centroids)
        
        #difference in the cluster centers of two consecutive iterations to declare convergence.
        if np.sum((curr - previous)/previous * 100.0) > 0.0001:
            optimal = False

Loading the data

In this example, we will use the iris dataset to perform the predictions using our K-Means classifier.

file = './iris.csv'
data = pd.read_csv(file)
data.head()

Now, let’s use our K-Means classifier to perform some predictions.

clf = KMeans(3)

Now, let’s visualize our dataset before we fit to the model. Since we have 4 features it is difficult to plot all the features. So, we consider first 2 features for plotting.

sb.scatterplot(X.iloc[:,0] , X.iloc[:,1] , hue = data.iloc[:,-1])

Let’s now fit our data to the model. Before fitting, we need to extract the features and fit to the model.

X = data.iloc[:,:-1]
clf.fit(X)

Now, let’s plot the cluster centroids obtained by fitting the data.

Now, let’s visualize the cluster centroids including with data points.

Plot 3 : Cluster centroids with data points

Note : For plot 2, the scales of the axis range from (4.75,2.7) to (7,3.5). Whereas, the scales for the plot 3 range from (4.0,1.75) to (8.25, 4.55).

Woooow! Our classifier forms cluster very similar to the actual plot.Yeah, we have built our own K-Means algorithm. You can download the whole code from our github repo.

Conclusion

In this post, we’ve discussed about K-means algorithm and it’s implementation in python. In the next post, we’ll be discussing about Hierarchical clustering.Until then, Stay home , Stay safe. Cheers✌️. Follow our blog’s Facebook page at fb.com/HelloWorldFB.

One response to “K-Means Algorithm Python Implementation”

Daisey Prejsnar says:

June 30, 2020 at 11:50 am

I’m amazed, I must say. Seldom do I come across a blog that’s equally educative and entertaining, and without a doubt, you have hit the nail on the head. The problem is something which not enough folks are speaking intelligently about. I am very happy I found this in my search for something regarding this.

LikeLike

Reply

Hello World!