Statistically Speaking

Unveiling the Power of Support Vector Machines in Classification Tasks

Tuhin Kumar Dutta — Sat, 22 Jan 2022 18:30:00 GMT

Support Vector Machines (SVM) are among the most effective and widely used classification algorithms in machine learning. At its core, the SVM algorithm works by finding the optimal boundary—called a hyperplane—that best separates different classes in a dataset.

In the case of 2-dimensional data, this boundary is a line.
For 3-dimensional data, it becomes a plane.
And for higher dimensions, it is referred to as a hyperplane.

To understand this concept intuitively, let's begin by exploring how SVM behaves with a simple 2D dataset. We'll start by generating an arbitrary dataset for visualization and demonstration purposes.

import pandas as pd, numpy as np, matplotlib.pyplot as plt, seaborn as sns
from sklearn.svm import SVC
import plotly.express as px

def making_df(slope_type=1,n=100, p1=100, p2=30, p3=2):
    np.random.seed(n)

    p=np.random.randn(p1)*p2
    q=np.random.randn(p1)*p2

    if slope_type == 1:        
        r=p+p3*p.max()
        s=q-p3*q.max()
    else:
        r=p+p3*p.max()
        s=q+p3*q.max()        


    df1 = pd.DataFrame({'ones':np.ones(len(p)),'feature1':p,
                       'feature2':q, 'label':np.ones(len(p))})

    df2 = pd.DataFrame({'ones':np.ones(len(r)),'feature1':r,
                       'feature2':s, 'label':[-1]*len(r)})

    df = pd.concat([df1,df2], axis=0)
    df = df.sample(n=len(df))
    return [df,p,q,r,s]

To avoid diving too deeply into the implementation details, let’s assume that the above-defined function returns a dataframe consisting of two visually separable clusters of data points. These clusters represent two distinct classes:

(p, q) for negative class data points
(r, s) for positive class data points

The function also accepts a few input parameters, such as the desired slope direction (positive or negative) and other customization options for generating the dataset. Since we’ll be generating dataframes multiple times throughout our practical exploration, having a reusable function makes the process more efficient.

Now, to understand how SVM finds a boundary between these classes, we refer to the general equation of a straight line:

ax+by+c=0

Where:

x and y are the coordinate variables
a, b, and c are constants

From this, we can derive:

Slope of the line: −a/b
Y-intercept: −c/b

Without delving further into theory just yet, let’s generate an arbitrary dataframe and plot it. This visual representation will help us clearly see the separation between classes and better understand what we are working with.

obt = making_df(slope_type=1,n=40)

df = obt[0]
p = obt[1]
q = obt[2]
r = obt[3]
s = obt[4]

plt.figure(figsize=(20,8))
plt.scatter(p, q, facecolors='purple', marker='_', s=100)
plt.scatter(r,s, facecolors='red',marker='+', s=100)
plt.xlabel('Features', fontsize=16)
plt.ylabel('Label', fontsize=16)

plt.show()

Now we have a clear dataset with distinct data points, conveniently labeled as ‘+1’ and ‘-1’.

def primary_fit(weight): 
    def nor(lst):
        factor = (1/sum([i**2 for i in lst[:2]]))**0.5
        return [i*factor for i in lst[:2]]
    a = nor(weight)[0]
    b = nor(weight)[1]
    c = weight[2]

    def st(a,b,c):
        slope = -(a/b)
        intercept = -(c/b)
        x = np.concatenate([p,r])
        y = slope*x + intercept
        plt.plot(x,y) 
        return [slope,intercept]

    plt.figure(figsize=(20,8))
    plt.scatter(p, q, facecolors='purple', marker='_', s=100)
    plt.scatter(r,s, facecolors='red',marker='+', s=100)
    plt.xlabel('Features', fontsize=16)
    plt.ylabel('Label', fontsize=16)
    st(a,b,c)
    plt.show()


primary_fit([-1,1,110])

We can try to fit a line using a = -1, b = 1, and c = 110.

With the selected values of a, b, and c, we were able to fit a line in the form of ax+by+c=0

This line successfully separates the data points into two distinct clusters or classes. In the context of Support Vector Machines (SVM), this separating boundary is referred to as a hyperplane.

In 2D, the hyperplane is simply a line.
In 3D, it becomes a 2D plane.
In higher dimensions (4D, 5D, etc.), the hyperplane could theoretically take forms like a cube or even a tesseract, although we can't visualize them directly.

From the plotted graph, we can logically infer:

If ax+by+c > 0, the point can be classified as belonging to one class (say, –1)
If ax+by+c < 0, the point belongs to the other class (say, +1)

This behavior forms the basis for classification using SVM. However, one interesting observation is that different values of aaa, bbb, and ccc can generate different hyperplanes—each potentially separating the points in a slightly different manner.

Let’s explore this further by fitting another line using a new set of parameters:

a=−1, b=3, c=80

We’ll visualize how this new line affects the classification boundary and compare it with the previous one.

We observe that with every new combination of values for a, b, and c, an entirely different line is produced. This illustrates a key challenge in Support Vector Machines:

The first issue is that the parameters aaa, bbb, and ccc need to be carefully optimized in order to find the best possible separating line—one that generalizes well to unseen data.
The second issue arises when, during real-world application, new data points appear that are either very close to the decision boundary or even on the opposite side of it. These edge cases can lead to misclassifications and reduce the model’s reliability.

To address both of these concerns, SVM introduces the concept of margins.

What are Margins?

Simply put, margins are the regions on either side of the classifier (or hyperplane) that define the boundaries of each class. More formally:

A margin is the sum of the perpendicular distances from the classifier to the closest data point in each class.

The goal of an SVM is not just to find any separating line, but to find the one that maximizes this margin. This optimal boundary is known as the maximum margin classifier and is considered to be the most robust in terms of generalization.

In the next section, we’ll visualize these margins and understand how they make SVM one of the most powerful classification algorithms available.

🧠 SVM Formulation

The fundamental goal of a Support Vector Machine (SVM) is to find a hyperplane that not only separates the data but does so with the maximum possible margin. This margin is the buffer zone that helps ensure better generalization when the model is applied to unseen data.

Mathematically, this objective can be expressed as:

$$l_i\cdot (W \cdot y_i) \geq M$$

Where:

$l_i$ is the label of the ithi^{th}ith training data point (either +1 or -1),
$y_i$ is the feature vector of the ithi^{th}ith point,
$W$ is the weight vector perpendicular to the hyperplane,
$M$ is the margin we want to maximize.

🧾 Understanding the Expression

The term $W \cdot y_i$ represents the projected distance of the point from the hyperplane. Since both $l_i$ (the class label) and the projection share the same sign, their product will always be positive when the point is on the correct side of the margin.

Thus, the constraint $ l_i\cdot (W \cdot y_i) \geq M$ ensures that all data points lie outside or on the margin, reinforcing the separation.

📐 Distance from a Point to a Hyperplane

For any point PPP, the perpendicular distance ddd from a line (or hyperplane) defined by $ax+by+c=0$ is given by:

$$d=\frac{|a x + b y + c|}{\sqrt{a^2 + b^2}}$$

This formula helps quantify how close each point is to the classifier. In the context of SVM, we are particularly interested in the closest points from both classes—these are the support vectors.

🎯 Goal of SVM

The classifier that we are searching for must be one that is farther from the closest points (support vectors) of both classes. That is, the SVM algorithm maximizes the minimum distance to any data point, ensuring the most robust separation.

This leads to the optimization problem at the heart of SVM:

$min⁡\frac{1}{2}\|W\|^2$ subject to $l_i(W \cdot y_i + b)≥1$

In simple terms, this optimization tries to:

Minimize the norm of the weight vector (which is equivalent to maximizing margin),
While making sure that all data points are correctly classified and lie outside the margin.

def sup(weight, epsilon=0):
    def nor(lst):
        factor = (1/sum([i**2 for i in lst[:2]]))**0.5
        return [i*factor for i in lst[:2]]

    a = nor(weight)[0]
    b = nor(weight)[1]
    c = weight[2]

    def st(a,b,c):
        slope = -(a/b)
        intercept = -(c/b)

        x = np.concatenate([p,r])
        y = slope*x + intercept
        return [x,y]

    df['distances'] = np.matmul(np.array(df.iloc[:,:3]),np.array([c,a,b]))

    df['formulation'] = df.label * df.distances

    if -(a/b)>0:
        df_neg = df[df.label<0].sort_values('formulation')
        df_pos = df[df.label>0].sort_values('formulation')
    elif -(a/b)<0:
        df_neg = df[df.label<0].sort_values('formulation', ascending=False)
        df_pos = df[df.label>0].sort_values('formulation', ascending=False)



    neg_dist = df_neg.formulation.values[0]
    pos_dist = df_pos.formulation.values[0]


    def vert_dist(distance):
        import math
        slope = -(a/b)
        angle = abs(math.degrees(math.atan(slope)))
        vertical = distance/math.cos(math.radians(angle))
        return vertical

    neg_vert_dist = vert_dist(neg_dist)
    pos_vert_dist = vert_dist(pos_dist)

    intercept = -(c/b)
    pos_intercept = intercept + pos_vert_dist
    neg_intercept = intercept - neg_vert_dist
    c_intercept = min([neg_intercept, pos_intercept]) + abs((neg_intercept - pos_intercept)/2)

    c_pos = -(pos_intercept*b)
    c_neg = -(neg_intercept*b)
    c_new = -(c_intercept*b)

    if epsilon != 0:

        margin = (neg_dist+pos_dist)*(1-epsilon)
        distance = margin/2
        dist_vert_dist = vert_dist(distance)

        pos_intercept = c_intercept + dist_vert_dist
        neg_intercept = c_intercept - dist_vert_dist
        c_pos = -(pos_intercept*b)
        c_neg = -(neg_intercept*b) 


    plt.figure(figsize=(20,8))
    plt.scatter(p, q, facecolors='purple', marker='_', s=100)
    plt.scatter(r,s, facecolors='red', marker='+', s=100)
    plt.xlabel('Features', fontsize=16)
    plt.ylabel('Label', fontsize=16)

    plt.plot(st(a,b,c_new)[0], st(a,b,c_new)[1], linewidth=3.5, label='classifier')
    plt.plot(st(a,b,c_pos)[0], st(a,b,c_pos)[1], '--', linewidth=1, alpha=0.8, label='negative margin')
    plt.plot(st(a,b,c_neg)[0], st(a,b,c_neg)[1], '--', linewidth=1, alpha=0.8, label='positive margin')
    plt.legend()
    plt.show()


weights = [-1,1,110]
sup(weights)

💡

Note: The above function is solely intended for visualization and demonstration purposes to aid conceptual understanding. It does not represent the actual model-building process in practice.

The negative and positive margin lines pass through the nearest negative and positive data points to the classifier, respectively. However, there can be instances where data points lie very close to the classifier. In such scenarios, the model may attempt to overfit the data by shrinking the margin, as it strives to achieve the ideal condition of zero classification errors. Let’s visualize how this behavior manifests.

sup(weights, overfit=True)

While this overfitted model may perform exceptionally well during training, it suffers a significant drawback. Due to the narrow margin—i.e., a highly constrained decision boundary—the model is more likely to misclassify data during evaluation or testing. Let’s understand this visually. Suppose we receive a new data point that truly belongs to the positive class but falls just above the negative margin. The model, relying on the tight boundary, would incorrectly classify it as ‘-1’. This misclassification leads to a drop in overall accuracy.

To address this issue, Support Vector Machines introduce a slack variable (ϵ) into the formulation. This allows the model to tolerate some degree of misclassification, thereby increasing its generalization ability.

The updated SVM constraint becomes:

$$l_i\cdot (W \cdot y_i) \geq M(1 - \epsilon_i)$$

From this formulation, we can clearly infer that the margin is inversely proportional to the slack variable. The slack variable ϵᵢ represents the degree of allowable error for each data point. In other words, it measures how much a data point is permitted to violate the margin.

A higher value of ϵ indicates more flexibility, meaning the model is allowed to misclassify certain points, thus preventing overfitting. Let’s visualize the impact of this by setting ϵ = 0.6 and examining how the decision boundary adapts.

Now, the model becomes generalized by allowing a few training data points to cross the margin boundaries. These relaxed thresholds are known as soft margins, which define the extent to which the model can tolerate errors in the training set without significantly compromising its generalization capabilities.

By incorporating this flexibility using the slack variable ε\varepsilonε, the model avoids overfitting and becomes more robust to unseen data. Let’s now revisit the same scenario illustrated previously, where the overfitted model misclassified a positive point due to overly narrow margins, and observe how the introduction of soft margins improves the outcome.

sup(weights, epsilon=0.6, test_pos_pos=48)

Unlike the earlier overfitted model, this generalized version will no longer misclassify the data point (represented by the red dot), as it now falls within the permissible margin. This demonstrates how the introduction of soft margins helps the model tolerate minor violations and improves overall robustness.

I hope this demonstration has provided a clear understanding of the foundational concepts behind Support Vector Machines. Now, let’s explore an additional insight: as the value of ε\varepsilonε increases, the soft margin becomes narrower. Conversely, when ε=0, the soft margins coincide exactly with the classifier, as the condition becomes $l_i(W⋅yi)≥0$. Let’s visualize how the fit changes with different values of ε.

for i in np.arange(0.2,1.1,0.2):
    sup(weights, epsilon=i, show=True)

The greater the value of ε, the simpler the model becomes. A larger ε allows the model to be more flexible, reducing its tendency to fit the classifier too aggressively and thereby avoiding overfitting.

However, all these formulations assume that the classes are linearly separable—an ideal condition that rarely holds true in real-world scenarios. To better reflect practical situations, let’s now construct another dataset that introduces some overlap between classes, mimicking the complexity of real-world data.

x = np.linspace(-5.0, 2.0, 100)
y = np.sqrt(10**2 + x**2)
y=np.hstack([y,-y])
x=np.hstack([x,-x])

x1 = np.linspace(-5.0, 2.0, 100)
y1 = np.sqrt(5**2 - x1**2)
y1=np.hstack([y1,-y1])
x1=np.hstack([x1,-x1])

df1 =pd.DataFrame(np.vstack([y,x]).T,columns=['X1','X2'])
df1['Y']=0
df2 =pd.DataFrame(np.vstack([y1,x1]).T,columns=['X1','X2'])
df2['Y']=1
df = df1.append(df2)

df.head()

Let's take a look at the graphical representation of the data.

As we can see, these data points are not linearly separable. To handle such cases, the concept of Kernels was introduced. Kernels are mathematical functions that, when applied to the data points from different features, transform them into a higher-dimensional space, allowing us to better classify the data. Essentially, a kernel function allows us to find a decision boundary in a transformed feature space, even when the original space is not linearly separable.

There are several types of kernels, each of which performs a different transformation. While the mathematical expressions for these kernels vary, their main goal is the same: to map the data points into a higher-dimensional space where they become linearly separable.

For simplicity, let’s consider one of the most commonly used kernels: Polynomial Kernel.

$$K(x,y)=(x^Ty+c)^d$$

Above mentioned is the mathematical expression for the Polynomial kernel. Let’s now calculate the transformation and observe the new dimensions formulated by the kernel. In our dataset, $x$ and $y$ are represented by$X_1$ and$X_2$ respectively.

When applying the Polynomial kernel function, we are essentially mapping the input features (from a lower-dimensional space) into a higher-dimensional space. The transformation process allows us to achieve linear separability in situations where the original data is not linearly separable.

We can compute the transformed feature space for our dataset by applying the kernel to pairs of data points, where each pair is processed through the following:

From the above derivation, we obtained three new dimensions, namely:

$X_1^2$
$X_2^2$
$X_1 \cdot X_2$

Let's calculate these new dimensions and integrate them into the original dataframe.

df['X1_Square']= df['X1']**2
df['X2_Square']= df['X2']**2
df['X1X2'] = (df['X1'] *df['X2'])
df.head()

Now, let’s plot these 3 newly formed dimensions in a 3D graph.

# Creating dataset
df1 = df[df.Y==0]
a = np.array(df1.X1_Square)
b = np.array(df1.X2_Square)
c = np.array(df1.X1X2)

# Creating figure
plt.figure(figsize = (13,13))
ax = plt.axes(projection ="3d")

# Creating plot
ax.scatter3D(a, b, c, s=8, label='class 0')

# Creating dataset
df1 = df[df.Y==1]
a = np.array(df1.X1_Square)
b = np.array(df1.X2_Square)
c = np.array(df1.X1X2)

# Creating plot
ax.scatter3D(a, b, c, s=8, label='class 1')

ax. set_xlabel('X1_Square', fontsize=13)
ax. set_ylabel('X2_Square', fontsize=13)
ax. set_zlabel('X1X2', fontsize=13)
plt.legend(fontsize=12)

ax.view_init(30,250)

# show plot
plt.show()

After applying the kernel, we can easily imagine fitting a 2D plane between the two classes, using all the theoretical concepts discussed earlier. Unlike before kernel execution, where the data was not linearly separable, now there exists a hyperplane that can successfully classify both the classes efficiently. This becomes possible only due to the introduction of additional dimensions over the initial 2D Cartesian plane, allowing the model to find a suitable separating boundary in the transformed feature space.

I hope this explanation provided a clear understanding of the Support Vector Machine (SVM) classifier algorithm and the internal process it follows. Although the practical implementation using automated processes with libraries (such as scikit-learn) has not been discussed here, I can assure you that once the working of the algorithm is clear, model building becomes pretty straightforward.

For further details and practical implementation, you can refer to the scikit-learn SVC documentation.

💡

For more insights, projects, and articles, visit my portfolio at tuhindutta.com.

Boosting Machine Learning Adaboost Guide

Tuhin Kumar Dutta — Wed, 17 Nov 2021 18:30:00 GMT

Boosting is one of the most widely used classes of algorithms in machine learning, applied globally to tackle a variety of complex problems. It involves combining multiple weak learners—typically simple models that perform just slightly better than random guessing—to collectively form a strong predictive model. Each learner is trained sequentially, with a focus on correcting the mistakes made by its predecessors. Misclassified or hard-to-learn data points are given more importance in subsequent rounds. In this blog, we’ll explore one of the simplest and most well-known boosting algorithms, AdaBoost (Adaptive Boosting). We'll also implement it from scratch and demonstrate how it can significantly improve model accuracy.

Before diving into AdaBoost, let’s first understand the core principle of boosting through the following illustration.

In the illustration above, Boxes 1, 2, and 3 represent classifications made by individual models—D1, D2, and D3—each of which is a weak learner and performs poorly when used alone. However, when these models are combined, as shown in Box 4, they work together to make highly accurate predictions.

This is the core idea behind AdaBoost as well. In AdaBoost, each model’s prediction is weighted, and these weights are updated during training based on whether the model classifies each instance correctly or not. Incorrectly classified instances are given higher weights, making them more influential in the next iteration.

We’ll walk through each step of the AdaBoost algorithm and its implementation. For practical understanding, we’ll use the Breast Cancer dataset from the sklearn library. Let's start by loading and visualizing the dataset.

df = pd.DataFrame(load_breast_cancer().data, columns=load_breast_cancer().feature_names)
df['label'] = load_breast_cancer().target

copy = df.copy()

df.head()

Since the dataset contains a large number of features, we’ll primarily focus on the label column along with the new columns that will be appended next to it during the boosting process.

To begin, let’s use a Decision Tree as our initial weak learner to make a first-round prediction and observe the accuracy achieved.

dt = DecisionTreeClassifier(random_state=50, min_samples_split=100)
dt.fit(df.iloc[:,:-1],df.iloc[:,-1])
df['prediction'] = dt.predict(df.iloc[:,:-1])
df.iloc[:5,-2:]

Accuracy = 0.945518453427065
We are getting an accuracy of around 95%.

Let’s now apply AdaBoost and see whether it improves the model's accuracy.

Step 1

Assign an initial weight of 1\m to each data point, where mmm is the total number of data points in the dataset.

df['weight'] = 1/len(df)
df.iloc[:5,-3:]

Step 2

Calculate the number of incorrectly classified data points.


no_of_errors = len(df[df.label != df.prediction1])
no_of_errors

There are 31 incorrectly classified data points.

Next, let's calculate the total error, where the total error is given by:

total error = no_of_errors/total number of data points

total_errors = no_of_errors/len(df)
total_errors

The total error is calculated to be 0.0545.

Step 3

Calculating the amount of say (α)

alpha = 0.5 * np.log((1-total_errors)/total_errors)
alpha

We get 1.423 as the value for the amount of say.

Step 4

The weight update is performed using the following rules:

For a correct prediction:

updated weight = old weight X e^–α
For an incorrect prediction:

updated weight = old weight X e^α

df['weight_updated'] = df.loc[df.label != df.prediction].weight * np.exp(alpha)
df.weight_updated = df['weight_updated'].fillna(df[df.label == df.prediction].weight * np.exp(-alpha))
df.iloc[:5,-4:]

The updated weights are then normalized, ensuring that the sum of all the weights in the column equals 1.

df.weight_updated = df.weight_updated/df.weight_updated.sum()
df.iloc[:5,-4:]

Now, we can observe that in the updated weights column, the values for the incorrectly classified data points are significantly higher compared to those for the correctly classified ones.

Step 5

Next, ranges are created for each updated weight, which represent the cumulative sum of the values in the column. For example, the range for index 0 is '0 to 0.016129', the range for index 1 is '0.016123 to (0.016129 + 0.000929)', the range for index 2 is '(0.016129 + 0.000929) to ((0.016129 + 0.000929) + 0.000929)', and so on.

In this manner, the range values for the last index will sum up to 1, as the weights have been normalized, ensuring their total is equal to 1.

Step 6

Resampling of data is performed by selecting a random number between 0 and 1. The range in which this number falls determines which index from the 'df' dataframe is included in the new resampled dataframe. Since the weights of the incorrectly predicted data points are higher, the corresponding ranges will also be larger. As a result, many of the randomly chosen numbers will fall within the ranges of the incorrectly predicted data points. Consequently, these data points will be repeated more frequently in the resampled dataframe, giving them more priority in subsequent iterations.

resampled = pd.DataFrame(columns=df.columns[:31])
for i in range(len(df)):
    index = df[df.ranges == df[np.random.rand()31].values[0])

resampled.head()

The resampled data is then processed in the same way as in step 1. These 6 steps are repeated iteratively until the total error becomes zero, or the number of iterations reaches infinity. To automate this process, let's build a function that accumulates all the above steps to perform the iterations.

def adaboost(df):
    dt = DecisionTreeClassifier(random_state=50, min_samples_split=100)
    dt.fit(df.iloc[:,:30],df.iloc[:,30])
    df['prediction'] = dt.predict(df.iloc[:,:30])

    df['weight'] = 1/len(df)

    no_of_errors = len(df[df.label != df.prediction])

    total_errors = no_of_errors/len(df)

    alpha = 0.5 * np.log((1-total_errors)/total_errors)

    df['weight_updated'] = df.loc[df.label != df.prediction].weight * np.exp(alpha)
    df.weight_updated = df['weight_updated'].fillna(df[df.label == df.prediction].weight * np.exp(-alpha))

    df.weight_updated = df.weight_updated/df.weight_updated.sum()

    p = 0
    for i in range(len(df)):
        df.loc[i,'ranges'] = df.loc[i,'weight_updated'] + p
        p = df.loc[i,'ranges']

    resampled = pd.DataFrame(columns=df.columns[:31])
    for i in range(len(df)):
        index = df[df.ranges == df[np.random.rand()31].values[0])  

    df = resampled

    return [df, dt]

The above function returns the resampled DataFrame and the trained model from each iteration. Upon execution, it stores the final resampled DataFrame along with the list of trained models across all iterations.

df = copy.copy()

models = []    

try:
    for iter in range(20):        
        ada = adaboost(df)
        df = ada[0]    
        models.append(ada[1])
        print('Decision stamp {0}'.format(iter+1))

except Exception:
    pass

We have obtained 10 decision stumps (weak learners), which will now be used collectively to make future predictions. These models will be applied to the same dataset on which we initially observed an accuracy of 95%. Let's aggregate the outputs from all these models to evaluate the performance of the boosted ensemble.

pred = np.zeros(len(df))
for i in range(len(models)):    
    pred += models[i].predict(copy.iloc[:,:-1])

pred

These values represent the aggregated predictions from all the models. Since each model outputs either a 0 or 1, a value like 2 in the array indicates that 2 models classified the instance as class 1, while a value of 6 means 6 out of the 10 models predicted class 1 for that particular instance.

Based on the number of models used, a threshold is set at half that number. If the aggregate output for any data point exceeds this threshold, it is classified as 1; otherwise, it is classified as 0. In our case, since we used 10 models, any output value greater than 5 is considered class 1, and the rest are classified as class 0.

threshold = len(models)/2
vec = np.vectorize(lambda x: 1 if x>threshold else 0)
final_prediction = vec(pred)
final_prediction

Now, using the output above, we calculate the accuracy.

copy['final_prediction'] = final_prediction

print('Accuracy =',accuracy_score(copy.label, copy.final_prediction))

Accuracy = 0.9753954305799648
This time we obtained an accuracy of 98%.

Thus, we can see that we have successfully improved the accuracy using Boosting.

Point to note: This entire process is purely for demonstration and conceptual understanding. It is not recommended to use this manual implementation for solving real-world problems. For practical purposes, you should use the automated and optimized AdaBoostClassifier provided by the sklearn library.

I hope I was able to explain the algorithm clearly enough for you to understand and experiment with.
Your valuable feedback is always appreciated!

💡

For more insights, projects, and articles, visit my portfolio at www.tuhindutta.com.

Understanding the Sigmoid Function and Its Applications in Machine Learning

Tuhin Kumar Dutta — Sat, 11 Sep 2021 18:30:00 GMT

Let’s start with the basics—what exactly is the sigmoid function?

The sigmoid function is a mathematical function commonly used in machine learning, especially in binary classification problems and neural networks. It’s defined by the formula:

When plotted on a graph, the sigmoid function creates a smooth, S-shaped curve that asymptotically approaches 0 and 1 at the extremes.

In the realm of machine learning, the sigmoid function proves to be incredibly useful—particularly in classification tasks where outcomes are binary, with one class represented as 0 and the other as 1.

A prominent example of such an application is Logistic Regression, a fundamental algorithm used for binary classification. In this context, the sigmoid function is employed to convert the model's linear output into a probability between 0 and 1, allowing us to interpret the result as the likelihood of belonging to a particular class.

To understand this more concretely, let’s walk through a simple example using an arbitrary dataset with one independent (input) variable and one dependent (output) variable. This will help illustrate the role sigmoid plays in logistic regression.

Let’s plot the data to gain a visual understanding.

We clearly can't fit a straight line through the data points, as there is no apparent linear relationship.
However, for illustrative purposes, let’s assume a weight and intercept to fit an arbitrary linear model.

It’s evident that the linear model doesn’t help much in this case.
To address this, we apply the sigmoid function to transform the linear equation—essentially replacing xxx in the sigmoid expression with the linear combination of features (as defined in our earlier equation).

Now, let’s plot the transformed values against the independent variables to visualize the relationship.

From the graph above, we observe that the straight line has been transformed into an S-shaped sigmoid curve, with values ranging between 0 and 1.
Now, each data point can be projected onto the sigmoid curve, and we can define a threshold above which the predicted class is 1.
For this example, let’s set the threshold at 0.23.

For any new data, after applying the sigmoid transformation, if the resulting value exceeds the threshold (0.23), we classify it as belonging to class 1.
Note that, in the Logistic Regression model from the scikit-learn library, the default threshold is 0.5. Any data point with a sigmoid output greater than 0.5 is classified as class 1.

To fit the best sigmoid curve, we need to choose the optimal values for the weights and intercept. This process is similar to linear regression, where gradient descent is used to minimize the loss function. For logistic regression, the loss function is referred to as Log Loss, which is defined by the following expression:

💡

For more insights, projects, and articles, visit my portfolio at www.tuhindutta.com.