SVM Introduction
Support vector machines (in short SVMs) are a binary classification machine learning algorithm. SVMs try to find an $n$-dimensional hyperplane, separating two classes from each other. The hyperplane should maximize the margin between both classes, because of this the generalization potential of SVMs is superior to most other similar classification algorithms. This makes support vector machines a very popular choice for a wide range of use cases.
SVMs use a linear function $f(x) = wx+b$ that should separate two classes ($-1$ and $1$). With this knowledge we can formulate two conditions the function should fulfill:
- $f(x) \leq -1$ for $y_i = -1$ (first class)
- $f(x) > 1$ for $y_i = 1$ (second class)
We can combine both conditions into $y_i * (w*x_i+b) \geq 1$. This formulation is called the Primal Form.
Mathematical Concepts
There are multiple different types of support vector machine implementations, with varying degrees of complexity. Because this is an introduction, we will implement a hard-margin linear support vector machine in its primal form, a simple type of svm.
Hard- & Soft-Margin SVMs
In most cases classes are not linearly separable, this is a problem for hard-margin SVMs. And even if the data is linearly separable, a modell that would allow for some errors during training would achieve better generalization, as demonstrated in the example below.
{:.image-caption} In this case the single red outlier, determines the decision boundary for the whole modell. The modell would preform much better, if it would ignore this data-point during training. (source)
Loss
We need to figure out how to formulate our loss function $J$. If a data-point is correctly classified $J$ should be $0$, but if the point is incorrectly classified the loss should be as big as the distance to the decision boundary $f(x)=0$. We can write this in the form of $J = 1 - y * f(x)$.
We need to keep in mind, that the goal of SVMs is not only to find an arbitrary function that separates two classes, but the one that maximizes $\frac{2}{||w||}$ (the distance between boundary and points).
We want to maximize $\frac{2}{||w||}$, but $J$ gets minimized, so we need to minimize $w$ in order to get our desired maximization. The resulting cost function $J$ is now a combination of both terms $J=||w||^2 + max(0, 1 - y_i * f(x))$.
Hyperparameter $\lambda$
We can now introduce the hyperparameter $\lambda$, a scalar of the regularization term $||w||^2$. $\lambda$ represents the importance of boundary width during training. We can apply it to our loss function $J$ $J=\lambda ||w||^2 + max(0, 1 - y_i * f(x))$.
Gradient
We need to derive $w$ and $b$ from $J$ for cases $y_i * f(x) \geq 1$ (correct classification) and $y_i * f(x) < 1$ (incorrect classification).
Case $y_i * f(x) < 1$:
$J'=\lambda||w||^2 + 1 - y_i * (w*x_i - b)$
- $\frac{\partial J}{\partial w} = \lambda 2 w - y_i x_i$
- $\frac{\partial J}{\partial b} = -y_i$
Case $y_i * f(x) \geq 1$:
$J'=\lambda||w||^2 + 0$
- $\frac{\partial J}{\partial w} = \lambda 2 w$
- $\frac{\partial J}{\partial b} = 0$
Optimization
Now we can utilize the gradients to perform gradient descent to minimize $J$ as much as possible.
we define $\alpha$ to be our learning rate
- $w'= w - \alpha dw$
- $b'= b - \alpha db$
We need to chose derivatives $dw$ and $db$, based on the condition $y_i * f(x) \geq 1$.
Implementation
We will implement the algorithm in a very simple way, so we can focus on the principles.
We are going to use Python and the Math package numpy
.
Note
In combination withnumpy
arrays the@
operator represents matrix multiplication. So the expressionsnp.dot(x, y)
andx @ y
are equal.
|
|
Trying it out
We are going to use the SciKit learn framework, because it provides some useful utility functions.
|
|
Labels: [-1 1 1 1 -1 1 1 -1 -1 -1 1 -1 1 -1 -1 1 -1 1 1 -1 1 1 -1 -1
-1 1 1 -1 -1 1 -1 1 -1 1 -1 -1 1 1 1 -1 1 -1 1 1 1 1 1 -1
-1 1 -1 1 -1 1 -1 -1 -1 1 -1 -1 1 1 -1 -1 1 1 -1 -1 1 -1 1 1
1 -1 -1 1 1 -1 1 1 -1 -1 -1 -1 1 -1 -1 -1 1 1 1 -1 -1 1 -1 -1
1 -1 1 1]
Predictions: [-1 1 1 1 -1 1 1 -1 -1 -1 1 -1 1 -1 -1 1 -1 1
1 -1 1 1 -1 -1 -1 1 1 -1 -1 1 -1 1 -1 1 -1 -1
1 1 1 -1 1 -1 1 1 1 1 1 -1 -1 1 -1 1 -1 1
-1 -1 -1 1 -1 -1 1 1 -1 -1 1 1 -1 -1 1 -1 1 1
1 -1 -1 1 1 -1 1 1 -1 -1 -1 -1 1 -1 -1 -1 1 1
1 -1 -1 1 -1 -1 1 -1 1 1]
Score: 1.0
As we can see the model correctly classifies all data-points correctly. Keep in mind, that not all randomly generated datasets are linearly separable, so you may not get perfect results.
Visualizing
We can also visualize all data-points and draw the corresponding decision boundary to get a feel for how our model performs. The plot shown below confirms our results of the chapter above.
Both linear separable classes can be perfectly separated by the here shown decision boundary ($f(x)=0$).
Kernel Methods
So what about datasets with non-linearly separable classes.
Suppose we work with a circle datasets (make_circles()
), the worst case for this algorithm.
The linear SVM reaches an accuracy of $\sim 0.5$, so the algorithm is not even better than just guessing.
We can augment the dataset with an additional feature, lets take the dot product of each data-point $x$. The dot-product $x \cdot x$ returns a single number, that we can use as our additional feature. We can plot all $3$ features now and we see, that the data-points are perfectly linearly separable with respect to the new feature ($z$-Axis).
But in most cases taking just the dot product doesn’t yield an improvement right away. So what about squaring the data-point $x^2$, suppose we have $2$ features per data-point $[x_1, x_2]^2$. We generate a whole bunch of new features, as you can image the number of additional features get increasingly large with the size of $x$. Augmenting a dataset into an extremely high dimensional space won’t be feasible because of memory and computational limitations. Luckily there is a workaround to this problem, the so called Kernel Trick.
Suppose we use a function $k(x, x')$, $x$ is a vector and $k$ returns vector product (scalar) $k: (R^n, R^n) \rightarrow R$. The function $\phi(x)$ converts low dimensional data into the high dimensional feature-space. The kernel function now just returns the dot product of the feature-space vectors $k(x, y)= \phi (x) \cdot \phi (y)$. Lets rewrite our classification and loss function so, that they utilize feature-space.
$w$ must have the same dimensions as the feature space
- $f(x)=w\phi (x)+b$
- $J=||w||^2 + max(0, 1-y_i(w\phi (x) - b)$
Lagrange Dual Formulation
In order to utilize kernel functions we need to use the dual form of the SVM algorithm. With the dual form it is not necessary to train $w$ or $b$, instead we only need to optimize $\alpha$ parameters.
$f(x)= \sum^N_i \alpha_i y_i k(x_i,x) + b$
Conditions
- $0 \leq \alpha_i \leq C\ \forall i$
- $\sum_i \alpha_i y_i=0$
LIBSVM
Usually we don’t want implement the SVM algorithm on our own, because there are many good SVM/SVR libraries available. We already used the package SciKit-Learn in chapter Implementation, it provides a convenient python interface to LIBSVM.
Caution
You need to be careful that all input data to your SVM modell is scaled properly.
scaler = sklearn.preprocessing.StandardScaler()
X = scaler.fit_transform(X)
SciKit-Learn provides python classes sklearn.svm.SVC
and sklearn.svm.SVR
.
If you are working on a regression problem you should use the latter.
|
|
Important Hyperparameters:
- $C$: penalty for classification violations during training for soft margin SVMs.
(source)
kernel
: the kernel type $\phi$ used for generating the feature spacelearning_rate
: optimization step sizegamma
: (rbf, poly, sigmoid) kernel coefficient