K-Nearest Neighbors
In our previous post, we’ve discussed Support Vector Machine and implementation of Support Vector Machine in python. In this post we’ll be discussing about Clustering algorithms and implementation of KNN algorithm.
Clustering
Clustering is the task of grouping the data into several groups such that the objects or the data points in one group are similar to each other than those in other groups. It can be simply understood as collection of objects on the basis of similarity and dissimilarity between them.

Clustering Algorithms
There are many clustering algorithms. Many algorithms use similarity measures among the data points in order to form clusters. Some clustering algorithms requires to guess at the number of clusters to discover in the data, whereas others require the minimum distance between observations in which examples may be considered “close” or “connected.”
Some commonly used clustering algorithms are listed below
- K-Nearest Neighbors
- K-means
- Hierarchical clustering
- DBSCAN
We’ll be discussing about the algorithms each in detail.So,now let’s begin.
K-Nearest Neighbors Algorithm

K-Nearest Neighbor is the basic and most essential classification algorithm.The KNN mainly based on the rule similar things exist at closer to each other.
How similarity is measured?
In order to find similarity between two data points we use some distance metrics to find the similarity. The distance metrics can be tuned to get an optimal result. There are many kinds of distance metrics that can be used in KNN are the following
Euclidean distance
Euclidean distance is the most common distance metric used for calculating the distance of numeric data points. The formula for Euclidean distance is as follows

Minkowski distance
Minkowski distance is the generalized distance metric. The formula for Minkowski distance is as follows

We can change the value of “p” to get different types of distance metrics.
- If p = 1, then we get “Manhattan distance“
- If p = 2, then we get “Euclidean distance”
- If p = ∞, then we get “Chebychev Distance“
How KNN works?
- Load the data
- Initialize k where k is the number of neighbors to be considered to find proximity.
- For each point in the data calculate distance between the current data point and the query point.
- Sort the distances and select the first K points
- Get labels of the first K points and return the most repeating label.
Implementation of KNN in Python
Now, we’ve entered into the most interesting part of our blog. Let’s build our own KNN classifier using Python.

Before writing all the math functions required let’s import every module we’re gonna use in this post.
import numpy as np
import pandas as pd
from collections import Counter
Creation of KNN class
Let’s first create a class that helps us to perform all the operations required. Let’s first use __init()__ constructor to initialize the model with a default value of k = 10
class KNN:
# Intializing k
def __init__(self,k = 10):
self.k = k
print("Initalized k with :",k)
Implementing Euclidean distance function
As discussed earlier we can use any distance metric in order to find the similarity. For now, we’ll build our classifier with Euclidean distance metric.
# Finding distance between 2 data points to find the similarity
def euclidDistance(self , x1, x2):
return (np.sqrt(np.sum(np.square(x1 - x2))))
The above code performs the following :
- Takes the both vectors and performs subtraction between the vectors
- Squares the resultant vector
- As it is a n-columned vector, we sum all the squares of the columns and apply square root to get the distance
Prediction using the model
The predict function takes the training data and the test point to find the class to which the given test point belongs. The predict function performs the following operations
- For each point in the data calculate distance between the current data point and the query point.
#Finding nearest neighbors
def predict(self, trainData, labels , testPoint):
distances = {}
# calculating distance of data point from each point in dataset
for i in range(len(trainData)):
distances[i] = self.euclidDistance(trainData.iloc[i] , testPoint)
- Sort the distances and select the first K points. After getting the indexes of the k nearest points ,we get the labels of the k-points
# selecting the "k" nearest points
neighbors = sorted(distances.items(), key = lambda kv:(kv[1], kv[0]))[:self.k]
neighborLabels = []
# getting labels of the "k" nearest points
for neigh in neighbors:
neighborLabels.append(labels[neigh[0]])
- After getting the labels of the first K points and return the most repeating label.
outputClass = Counter(neighborLabels)
print("The given data point belongs to :",max(outputClass))
Loading the data
In this example, we will use the iris dataset to perform the predictions using our KNN classifier.
file = './iris.csv'
trainData = pd.read_csv(file)
trainData.head()

Now, let’s use our KNN classifier to perform some predictions.
model = KNN(15)

Now, let’s split our data into features and target to make prediction with the test data point.
dataPoint = [1,5,2,4]
features = trainData.iloc[:,:-1]
target = trainData.iloc[:,-1]
model.predict(features, target , dataPoint)
Our model takes the data point and calculates the distances and outputs the most frequent label that is observed in the set of selected “k” points.

Yeah, we have built a KNN classifier with the Euclidean distance metric. You can download the whole code from our github repo.
Conclusion
In this post, we’ve discussed about Clustering algorithms and implementation of KNN in python. In the next post, we’ll be discussing about K-means algorithm. Until then, Stay home , Stay safe. Cheers✌️. Follow our blog’s Facebook page at fb.com/HelloWorldFB.
3 responses to “Clustering Algorithms”
TERIMA KASIH
LikeLike
[…] our previous post, we’ve discussed about Clustering algorithms and implementation of KNN in python. In this post, […]
LikeLike
[…] our previous post, we’ve discussed about Clustering algorithms and implementation of KNN in python. In this post, […]
LikeLike