## Machine Learning Note: Support Vector Machine (1)

18 November 2019

### 1. Intuition of SVM

Assume that we have some data points of two different classes which are linear separable. In the case of two dimensional data, we can use a line $y = w^Tx + b$ to separate it. If the data are high dimensional, we can use a hyperplane to separate it.

For this linear classification problem, different algorithms will find different optimal solutions.

For Support Vector Machine, the goal is to find a decision boundary (line/hyperplane) such that the distance from the decision boundary to each support vectors are equal.

The SVM is also called Max-Margin Classfier, because its objective is to maximize the distance between support vectors and decision boundary.

Let $x_1, x_2$ be the support vectors of their corresponding class, their distance to the boundary $w^Tx+b=0$ is $d_1$, $d_2$.

$w$ is a vector which is perpendicular to the decision line. The support vectors are the data point in a class who are closest to the boundary. It is obvious that we have:

The 1 and -1 is the label of the class, the vanilla SVM handles the binary classification problem so the class label is either 1 or -1.

Now we do some transformation on the above equation sets:

From the figure illustration, we can see that this is equals to the distance between two support vector, i.e.:

### 2. Mathematical Model of SVM

#### 2.1 The original SVM and its dual problem

The goal of SVM is to maximize the margin, i.e. maximize $d_1+d_2 = \frac{2}{|w|_2}$. This can be converted to a minimization problem where we want to minimize the demoninator$|w|_2$. Thus, the Support Vector Machine is formulated as:

The constraint requires that all data points should be classify to their corresponding classes.

Now we will try to solve this optimization problem using Lagrangian. We fisrt rewrite constraint as:

Based on Lagrangian, the problem can be formulated as:

The object function becomes:

This objective is not easy to solve as we are dealing with two parameters and $\alpha_i$ are inequality constraint. So we consider its dual form:

According to the weak duality, $d^*\le p^*$. Also, once the support vector is given, any data point above/below the support vector is a strictly feasible solution. According to the Slater’s condition, the strong duality holds, i.e. $p^*=d^*$. Hence, solving the dual problem leads to the same optimal solution of the original problem.

#### 2.2 Solving the dual problem

We first solve the inner minimization problem. Based on the fact that the derivative of $\mathcal{L}$ w.r.t. to $w$ and $b$ is 0 at the optimal point. We start by calculating the derivative of $\mathcal{L}$ w.r.t. $w$ and $b$.

Since $\nabla_w\mathcal{L}(w,b, \alpha) = 0$, we have:

The derivative w.r.t. to $b$ is:

By substituting $w$ to the Lagrangian we have:

The deduction is a little bit complicated, click below to see step-by-step deduction.

## Step-by-step Deduction

$$ \begin{align} \mathcal{L}(w, b, \alpha) & = \frac{1}{2}\|w\|^2 - \sum^n_{i=1}\alpha_i [y_i(w^Tx_i + b) - 1] \\ & = \frac{1}{2}w^Tw - \sum^n_{i=1}\alpha_iy_iw^Tx_i - \sum^n_{i=1}\alpha_i y_i b + \sum^n_{i=1}\alpha_i \\ & = \frac{1}{2}w^T\sum^n_{i=1}\alpha_i y_i x_i - w^T\sum^n_{i=1}\alpha_i y_i x_i - \sum^n_{i=1}\alpha_i y_i b + \sum^n_{i=1}\alpha_i \\ & = -\frac{1}{2}w^T\sum^n_{i=1}\alpha_i y_i x_i - \sum^n_{i=1}\alpha_i y_i b + \sum^n_{i=1}\alpha_i \\ & = -\frac{1}{2}\sum^n_{i=1}(\alpha_i y_i x_i)^T\sum^n_{i=1}\alpha_i y_i x_i - \sum^n_{i=1}\alpha_i y_i b + \sum^n_{i=1}\alpha_i \\ & = -\frac{1}{2}\sum^n_{i=1}\alpha_i y_i x_i^T\sum^n_{i=1}\alpha_i y_i x_i - \sum^n_{i=1}\alpha_i y_i b + \sum^n_{i=1}\alpha_i \\ & = -\frac{1}{2}\sum^n_{i, j=1}\alpha_i y_i x_i^T\alpha_j y_j x_j - \sum^n_{i=1}\alpha_i y_i b + \sum^n_{i=1}\alpha_i \\ & = \sum^n_{i=1}\alpha_i - \frac{1}{2}\sum^n_{i, j=1}\alpha_i \alpha_j y_i y_j x_i^T x_j \end{align} $$

Now the Lagrangian only includes a single parameters $\alpha$. Solving the optimal $\alpha$ reveals the optimal $w$ and $b$. Thus, the optimization problem becomes:

Once we get the optimal $\alpha^*$, we can get $w^*$ by the fact that $w = \sum^n_{i=1}\alpha_i y_i x_i$.

Then we can get $b^*$ by

The Lagrange multiplier $\alpha$ can be solved by SMO algorithm.

#### 2.3 Classification with SVM

Once we solved the optimization problem and obtained the SVM parameters, we can use it to classify new data point. Given a data point $x$, the decision is made by the sign of the following:

where $<a,b>$ is the inner product between vector $a$ and vector $b$.

One may think that this computation is expensive, as given a new data point, we need to calculate the inner product between $x$ and every training data. If we have 1 billions of training data, this process will take a while to finish. However, the design of SVM actually utilizes a very beautiful property: remember that the strong duality holds for SVM? This implies that KKT condition also holds as it is a necessary condition for strong duality.

The KKT condition includes **Complementary Slackness** ($\alpha^*g_i(w^*)=0$) which says:

We can see that, when $g_i(w) = -y_i(w^Tx_i + b) + 1 \le 0$, the Lagrange multiplier $\alpha=0$. This means that only the **support vectors** has the actual influence in the final decision process. Thus, we only need to calculate the inner products between new data point and support vectors.

### 3. SVM With Slack Variable

#### 3.1 Outlier

In the real-world application scenario, sometimes we have some noisy data point that locates in an unusual position which we call them **outliers**.

Solving the original problem for all data point including these outliers may have negative influence to the learned model. In the figure above, there is a blue point that locates in an unusual location, and the original SVM will shift its decision boundary to the left because of this outlier.

A worst case is that if this blue point is located in the center of white points, then we will not be able to find a hyperplane to separate these two classes.

#### 3.2 Slack variable

To handle this problem, we will modify the original problem such that we allow the data point can be off the center within a certain distance.

The dual problem:

We can see that for the dual problem, the only difference is that the constraint for $\alpha_i$ now has an upper bound $C$.

Also the Complementary Slackness in the KKT condition becomes:

#### 3.2.1 Extension: Hinge Loss and SVM

Consider a Support Vector Machine with slack variables, it’s constraints is:

Which can be summarized as:

Hinge loss is wildly used in neural networks as it can be optimized by gradient descent. Optimizing Hinge loss has the same effect as SVM.

- Hinge Loss is convex.
- The gradient when $x<0$ is small, so it has fewer penalty to wrong classification.
- The loss is 0 when $x\ge 1$, i.e. if a classifiction is correct with enough confidence, then it will stop optimizing for this data.
- Not differentiable when $x=0$, need to do piecewise derivation.