The Pure Theory Behind Support Vector Machines I’m extremely agitated today. I dunno why. Maybe because there was some convulsion in the peaceful tidings of the house I live in, or the fact that I’m kinda hungry at the moment. Anyways, I don’t have time for chitchat. Let’s get to the studying.

The following is taken from Foundations of Machine Learning by Rostamyar, et al.

Support Vector Machines are the most theoretically well motivated and practically most effective classification algorithms in modern machine learning.

Consider an input space $\mathcal{X}$ that is a subset of $\mathbb{R}^N$ with $N \geq 1$, and the output or target space $\mathcal{Y}=\{-1, +1\}$, and let $f : \mathcal{X} \rightarrow \mathcal{Y}$ be the target function. Given a hypothesis set $\mathcal{H}$ of functions mapping $\mathcal{X}$ to $\mathcal{Y}$, the binary classification task is formulated as follows:

The learner receives a training sample $S$ of size $m$ drawn independently and identically from $\mathcal{X}$ to some unknown distribution $\mathcal{D}$, $S = ((x_1, y_1), \ldots, (x_m, y_m)) \in (\mathcal{X}\times\mathcal{Y})^m$, with $y_i = f(x_i)$ for all $i \in [m]$. The problem consists of determining a hypothesis $h \in \mathcal{H}$, a binary classifier, with small generalization error: The probability that hypothesis set is not the target function is our error rate.

$R_{\mathcal{D}} = \underset{x\sim\mathcal{D}}{\mathbb{P}} [h(x) \neq f(x)].$

Different hypothesis sets $\mathcal{H}$ can be selected for this task. Hypothesis sets with smaller complexity provide better learning guarantees, everything else being equal. A natural hypothesis set with relatively small complexity is that of a linear classifier, or hyperplanes, which can be defined as follows:

$\mathcal{H}= \{x \rightarrow sign(w.x+b) : w \in \mathbb{R}^N, b \in r\}$

The learning problem is then referred to as a linear classification problem. The general equation of a hyperplane in $\mathbb{R}^N$is $w.x+b=0$ where $w\in\mathbb{R}^N$ is a non-zero vector normal to the hyperplane $b\in\mathbb{R}$ a scalar. A hypothesisol. of the form $x\rightarrow sign(w.x+b)$ thus labels positively all points falling on one side of the hyperplane $w.x+b=0$ and negatively all others.

From now until we say so, we’ll assume that the training sample $S$ can be linearly separated, that is, we assume the existence of a hyperplane that perfectly separates the training samples into two populations of positively and negatively labeled points, as illustrated by the left panel of figure below. This is equivalent to the existence of $(\boldsymbol{w}, b) \in (\mathbb{R}^N – \boldsymbol{\{0\}}) \times \mathbb{R}$such that:

$\forall i \in [m], \quad y_i(\boldsymbol{w}.x_i + b) \geq 0$

But, as you can see above, there are then infinitely many such separating hyperplane. Which hyperplane should a learning algorithm select? The definition of SVM solution is based on the notion of geometric margin.

Let’s define what we just came up with: The geometric margin $\rho_h(x)$ pf a linear classifier $h:\rightarrow \boldsymbol{w.x} + b$ at a point $x$ is its Euclidean distance to the hyperplane $\boldsymbol{w.x}+b=0$:

$\rho_h(x) = \frac{w.x+b}{||w||_2}$

The geometric margin of $\rho_h$ of a linear classifier h for a sample $S = (x_1, …, x_m)$ is the minimum geometric margin over the points in the sample, $\rho_h = min_{i\in[m]} \rho_h(x_i)$, that is the distance of hyperplane defining h to the closest sample points.

So what is the solution? It is that, the separating hyperplane with the maximum geometric margin is thus known as maximum-margin hyperplane. The right panel of the figure above illustrates the maximum-margin hyperplane returned by SVM algorithm is the separable case. We will present later in this chapter a theory that provides a strong justification for the solution. We can observe already, however, that the SVM solution can also be viewed as the safest choice in the following sense: a test point is classified correctly by separating hyperplanes with geometric margin $\rho$ even when it falls within a distance $\rho$ of the training samples sharing the same label: for the SVM solution, $\rho$ is the maximum geometric margin and thus the safest value.

We now derive the equations nd optimization problem that define the SVM solution. By definition of the geometric margin, the maximum margin of $\rho$ of a separating hyperplane is given by:

$\rho = \underset{w,b : y_i(w.x_i+b) \geq 0}{max}\underset{i\in[m]}{min}\frac{|w.x_i=b}{||w||} = \underset{w,b}{max}{min}\frac{y_(w.x_i+b)}{||w||}$

The second quality follows from the fact that, since sample is linearly separable, for the maximizing pair $(w, b), y_i(w.x_i+b)$ must be non-negative for al $i\in[m]$. Now, observe that the last expression is invariant to multiplication of $(w, b)$ by a positive scalar. Thus, we can restrict ourselves to pairs $(\boldsymbol{w},b)$ scaled such that $min_{i\in[m]}(\boldsymbol{w}.x_i+b) = 1$:

$\rho = \underset{min_{i\in[m]}y_i(w.x_i+b)=1}{max}\frac{1}{||w||} = \underset{\forall i \in[m],y_i(w.x_i+b) \geq }{max}\frac{1}{||w||}$

Figure below illustrates the solution $(w, b)$ of the maximization we just formalized. In addition to the maximum-margin hyperplane, it also shows the marginal hyperplanes, which are the hyperplanes parallel to the separating hyperplane and passing through the closest points on the negative or positive sides.

Since maximizing $1/||w||$ is equivalent to minimizing $\frac{1}{2}||w||^2$, in view of the equation above, the pair $(\boldsymbol{w}, b)$ returned by the SVM in the separable case is the solution of the following convex optimization problem:

$\underset{w, b}{min}\frac{1}{2}||w||^2$$\text{subject to}: y_i(\boldsymbol{w}.x_i+b) \geq 1, \forall i \in[m]$

Since the objective function is quadratic and the constraints are affine (meaning they are greater or equal to) the optimization problem above is in fact a specific instance of quadratic programming (QP), a family of problems extensively studied in optimization. A variety of commercial and open-source solvers are available for solving convex QP problems. Additionally, motivated by the empirical success of SVMs along with its rich theoretical underpinnings, specialized methods have been developed to more efficiently solve this particular convex QP problem, notably the block coordinate descent algorithms with blocks of just two coordinates.

So what are support vectors? See the formula above, we note that constraints tare affine and thus qualified. The objective function as well as the affine constrains are convex and differentiable.

We introduce Lagrange variables $\alpha_i \geq 0, i\in[m]$, associated to the m constrains and denoted by $\boldsymbol{\alpha}$ the vector $(\alpha_1, \ldots, \alpha_m)^T$. The Lagrangian can then be defiend for all $\boldsymbol{w}\in\mathbb{R}^N,b\in\mathbb{R}$, and $\boldsymbol{\alpha}\in\mathbb{R}_+^m$ by:

$\mathcal{L}(\boldsymbol{w},b,\boldsymbol{\alpha} = \frac{1}{2}||w||^2 – \sum_{i = 1}^{m}\alpha_i[y_i(w.x_i+b) -1]$

Support vectors fully define the maximum-margin hyperplane or SVM solution, which justifies the name of the algorithm. By definition, vectors not lying on the marginal hyperplanes do not affect the definiton of these hyperplanse – in their absence, the solution the solution to the SVM problem is unique, the support vectors are not. In dimensiosn $N, N+1$ points are sufficient to define a hyperplane. Thus when more then $N+1$ points lie on them marginal hyperplane, different choices are possible for $N+1$ support vectors.

But the points in the space are not always separable. In most practical settings, the training data is not linearly separable, which implies that for any hyperplane $\boldsymbol{w.x}+b=0$, there exists $x_i \in S$ such that:

$y_i[\boldsymbol{w.x_i}+b] \ngeq 1$

Thus, the constrains imposed on the linearly separable case cannot be hold simultaneously. However, a relaxed version of these constraints can indeed hold, that is, for each $i\in[m]$, there exists $\xi_i \geq 0$ such that:

$y_i[\boldsymbol{w.x_i}+b] \ngeq 1-\xi_i$

The variables $\xi_i$ are known as slack variables and are commonly used in optimization to define relaxed versions of constraints. Here, a slack variable $\xi_i$ measures the distance by which vector $x_i$ violates the desires inequality, $y_i(\boldsymol{w.x_i} + b) \geq 1 . This figure illustrates this situation: For hyperplane$y_i(w.x_i+b) = 1 $, a vector$x_i$with$x_i > 0$can be viewed as an outlier. Each$x_i\$ must be positioned on the correct side the appropriate marginal hyperplane. Here’s the formula we use to optimize the non-separable cases:

$\underset{w, b, \xi}{min} \frac{1}{2}||w||^2 + C\sum_{i=1}^{m}\xi_i^p$$\text{subject to} \quad y_i(w.x_i+b) \geq 1-\xi_i \wedge \xi_i \geq 0, i\in[m]$

Okay! Alright! I think I understand it now. That’s enough classification for today. I’m going to study something FUN next. Altough I’m a bit drowsy… No matter! I have some energy drinks at home. Plus I have some methamphetamine which I have acquired to boost my eenrgy… Nah, kidding. I’m a cocaine man!