Clustering Algorithms

K-Nearest Neighbors

In our previous post, we’ve discussed Support Vector Machine and implementation of Support Vector Machine in python. In this post we’ll be discussing about Clustering algorithms and implementation of KNN algorithm.

Clustering

Clustering is the task of grouping the data into several groups such that the objects or the data points in one group are similar to each other than those in other groups. It can be simply understood as collection of objects on the basis of similarity and dissimilarity between them.

Clustering Algorithms

There are many clustering algorithms. Many algorithms use similarity measures among the data points in order to form clusters. Some clustering algorithms requires to guess at the number of clusters to discover in the data, whereas others require the minimum distance between observations in which examples may be considered “close” or “connected.”

Some commonly used clustering algorithms are listed below

K-Nearest Neighbors
K-means
Hierarchical clustering
DBSCAN

We’ll be discussing about the algorithms each in detail.So,now let’s begin.

K-Nearest Neighbors Algorithm

K-Nearest Neighbor is the basic and most essential classification algorithm.The KNN mainly based on the rule similar things exist at closer to each other.

How similarity is measured?

In order to find similarity between two data points we use some distance metrics to find the similarity. The distance metrics can be tuned to get an optimal result. There are many kinds of distance metrics that can be used in KNN are the following

Euclidean distance

Euclidean distance is the most common distance metric used for calculating the distance of numeric data points. The formula for Euclidean distance is as follows

Minkowski distance

Minkowski distance is the generalized distance metric. The formula for Minkowski distance is as follows

We can change the value of “p” to get different types of distance metrics.

If p = 1, then we get “Manhattan distance“
If p = 2, then we get “Euclidean distance”
If p = ∞, then we get “Chebychev Distance“

How KNN works?

Load the data
Initialize k where k is the number of neighbors to be considered to find proximity.
For each point in the data calculate distance between the current data point and the query point.
Sort the distances and select the first K points
Get labels of the first K points and return the most repeating label.

Implementation of KNN in Python

Now, we’ve entered into the most interesting part of our blog. Let’s build our own KNN classifier using Python.

Before writing all the math functions required let’s import every module we’re gonna use in this post.

import numpy as np
import pandas as pd
from collections import Counter

Creation of KNN class

Let’s first create a class that helps us to perform all the operations required. Let’s first use __init()__ constructor to initialize the model with a default value of k = 10

class KNN:
    
    # Intializing k 
    def __init__(self,k = 10):
        self.k = k
        print("Initalized k with :",k)

Implementing Euclidean distance function

As discussed earlier we can use any distance metric in order to find the similarity. For now, we’ll build our classifier with Euclidean distance metric.

# Finding distance between 2 data points to find the similarity
    def euclidDistance(self , x1, x2):
        return (np.sqrt(np.sum(np.square(x1 - x2))))

The above code performs the following :

Takes the both vectors and performs subtraction between the vectors
Squares the resultant vector
As it is a n-columned vector, we sum all the squares of the columns and apply square root to get the distance

Prediction using the model

The predict function takes the training data and the test point to find the class to which the given test point belongs. The predict function performs the following operations

For each point in the data calculate distance between the current data point and the query point.

#Finding nearest neighbors
    def predict(self, trainData, labels , testPoint):
        distances = {}
        
        # calculating distance of data point from each point in dataset
        for i in range(len(trainData)):
            distances[i] = self.euclidDistance(trainData.iloc[i] , testPoint)

Sort the distances and select the first K points. After getting the indexes of the k nearest points ,we get the labels of the k-points

# selecting the "k" nearest points
        neighbors = sorted(distances.items(), key = lambda kv:(kv[1], kv[0]))[:self.k]
        neighborLabels = []
        
        # getting labels of the "k" nearest points
        for neigh in neighbors:
            neighborLabels.append(labels[neigh[0]])

After getting the labels of the first K points and return the most repeating label.

outputClass = Counter(neighborLabels)
print("The given data point belongs to :",max(outputClass))

Loading the data

In this example, we will use the iris dataset to perform the predictions using our KNN classifier.

file = './iris.csv'
trainData = pd.read_csv(file)
trainData.head()

Now, let’s use our KNN classifier to perform some predictions.

model = KNN(15)

Now, let’s split our data into features and target to make prediction with the test data point.

dataPoint = [1,5,2,4]
features = trainData.iloc[:,:-1]
target = trainData.iloc[:,-1]
model.predict(features, target , dataPoint)

Our model takes the data point and calculates the distances and outputs the most frequent label that is observed in the set of selected “k” points.

Yeah, we have built a KNN classifier with the Euclidean distance metric. You can download the whole code from our github repo.

Conclusion

In this post, we’ve discussed about Clustering algorithms and implementation of KNN in python. In the next post, we’ll be discussing about K-means algorithm. Until then, Stay home , Stay safe. Cheers✌️. Follow our blog’s Facebook page at fb.com/HelloWorldFB.

3 responses to “Clustering Algorithms”

My Wieseman says:

June 16, 2020 at 11:28 am

TERIMA KASIH

LikeLike

Reply
K-Means Algorithm – Hello World! says:

June 19, 2020 at 10:41 am

[…] our previous post, we’ve discussed about Clustering algorithms and implementation of KNN in python. In this post, […]

LikeLike

Reply
K-Means Algorithm Python Implementation – Hello World! says:

June 19, 2020 at 10:42 am

[…] our previous post, we’ve discussed about Clustering algorithms and implementation of KNN in python. In this post, […]

LikeLike

Reply

Clustering Algorithms

K-Nearest Neighbors

Clustering

Clustering Algorithms

K-Nearest Neighbors Algorithm

How similarity is measured?

Euclidean distance

Minkowski distance

How KNN works?

Implementation of KNN in Python

Creation of KNN class

Implementing Euclidean distance function

Prediction using the model

Loading the data

Conclusion

3 responses to “Clustering Algorithms”

Leave a comment Cancel reply