Support Vector Machine

In our previous post, we’ve discussed about Logistic Regression and implementation of Logistic regression in python. In this post, let’s understand the most popular binary classification algorithm Support Vector Machine.

Support Vector Machine

Support Vector Machine commonly abbreviated as SVM, is a popular supervised machine learning algorithm which can be used for both regression and classification. However, mostly used in classification problems. In SVM algorithm, we plot the each data item in a n-dimensional space where n is the number of features and each coordinate represents one feature. To perform classification task, the SVM algorithm finds a n-1 dimensional hyper-plane which separates the classes very well.

Generally, when we try to acquire a hyper-plane we get many planes that perform the classification task. How can we identify the right hyper-plane?

How to select the right hyper-plane?

As we get many planes that perform the task, we select a hyper-plane with maximum margin. Margin is the distances between nearest data point (either class). Maximizing the margin will help us to decide the right hyper-plane. Let’s understand this by a simple example.

Here, we have three hyper-planes A,B,C and all are segregating the classes well.

Above, we can see B has a large margin from the red circle class but it is very close to the blue star class. Similarly, A is distant from blue class but very close to the red class. C is the hyper-plane that is distant from both the classes. So, we select C as the right hyper-plane of the data.

We always require a large margin because as we have a large margin there will be less chance of misclassification.

Cost function

In SVM algorithm, we try to maximize the margin between the data points and the hyper-plane. The “Hinge Loss” helps us to maximize the margin.

If the predicted value and actual value are of same sign then the cost is 0 and otherwise we calculate by using (1 – y * f(x)). The Hinge loss function can be represented as follows

We add a regularization parameter that is responsible to balance the margin maximization and loss. So, our cost function looks as follows

Tuning Parameters

Kernel

If the data is linear, the learning of hyper-plane is done by transforming the problem into linear algebra. However, if the data is linearly inseparable, then kernels are used to non-linearly map the given data into a high dimensional feature space such that, the new mapping is linearly separable.

Kernel accepts the original inputs in the lower dimensional space and returns dot product of transformed vectors in higher dimensional space

Gamma

Gamma defines how far the influence of one training example reaches. With low gamma, points far away from separation line are considered in calculation for the separation line. Where as high gamma means the points close to plausible line are considered in calculation.

SVM implementation in Python

Now, we’ve entered into the most interesting part of our blog. Now, let’s build a model by following the steps mentioned in Machine Learning Pipeline.

Data Collection

In this notebook we use the Breast Cancer dataset which can be downloaded from UCI Machine Learning repository. The dataset is available in sklearn library.

from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
df_cancer = pd.DataFrame(np.c_[cancer['data'], cancer['target']], columns = np.append(cancer['feature_names'], ['target']))

df_cancer.head()

Selecting features by using Correlation matrix

plt.figure(figsize=(20,12)) 
corrMat = df_cancer.corr()
sns.heatmap(corrMat, annot=True)

We compare the correlation between features and remove one of two features that have a correlation higher than 0.5.

columns = np.full((corrMat.shape[0],), True, dtype=bool)
for i in range(corrMat.shape[0]):
    for j in range(i+1, corrMat.shape[0]):
        if corrMat.iloc[i,j] >= 0.5 or corrMat.iloc[i,j] <= -0.5:
            if columns[j]:
                columns[j] = False
selected_columns = df_cancer.columns[columns]
selected_columns

Data pre-processing

Since the data is of different scales, we use MinMaxScaler to scale the data.

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X = scaler.fit_transform(features)

Splitting the dataset

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)

Building the model

Let’s fit our data into a SVM classifier using the SVC classifier in sklearn.svm

from sklearn.svm import SVC
svc_model = SVC()
svc_model.fit(X_train, y_train)

Validating the model

Now, let’s test our model on the unseen data.So,we perform as follows

y_predict = svc_model.predict(X_test)
from sklearn.metrics import classification_report, confusion_matrix

cm = np.array(confusion_matrix(y_test, y_predict, labels=[1,0]))
confusion = pd.DataFrame(cm, index=['is_cancer', 'is_healthy'], columns = ['predicted_cancer','predicted_healthy'])

sns.heatmap(confusion, annot=True)

So, our model gets an accuracy nearly 97%. You can get the complete code and dataset on our github.

Conclusion

In this post, we’ve discussed about Support Vector Machine and implementation of SVM in python. In the next post we’ll be discussing about Clustering algorithms. Until then, Stay home , Stay safe .Cheers✌️ . Follow our blogs Facebook page at fb.com/HelloWorldFB.

One response to “Support Vector Machine”

Clustering Algorithms – Hello World! says:

June 5, 2020 at 6:47 pm

[…] our previous post, we’ve discussed Support Vector Machine and implementation of Support Vector Machine in […]

LikeLike

Reply

Support Vector Machine

Support Vector Machine

How to select the right hyper-plane?

Cost function

Tuning Parameters

Kernel

Gamma

SVM implementation in Python

Data Collection

Selecting features by using Correlation matrix

Data pre-processing

Splitting the dataset

Building the model

Validating the model

Conclusion

One response to “Support Vector Machine”

Leave a comment Cancel reply