Data Science for Hackers: Machine Learning and Statistics: a unified perspective

In the previous article, we showed that using generalized linear model (GLM), we can predict not only continuous variable, but also categorical variable. Prediction of categorical variable is usuallly referred to as classification. In this article we are going to look into a different method of classification: support vector machine (SVM).

Primal form

Methods in GLM are all based on maximizing likelihood. However, SVM tackles the problem from a different perspective: maximizing margin. Let’s start with the simplest case: linearly separable classification problem. In this case, margin is defined as the distance of the closest point to the separation plane, which is also what SVM is trying to maximize. First, let’s put distance from a point to a plane into mathematical form:
A plane is defined as $\theta_{+1}^T\mathbf{x_{+1}} = 0$. If we separate the constant part out, we have: $\theta^T\mathbf{x} + \theta_0 = 0$. Given any two points $\mathbf{x}$ and $\mathbf{x'}$ on the plane, we can see that $\theta^T(\mathbf{x - x'}) = 0$, and thus we say that vector $\theta$ is orthogonal to the plane $\theta^T\mathbf{x} + \theta_0 = 0$. Given another data point $\mathbf{x}_i$, the distance to the plane would be the projection of $\mathbf{x}_i$ onto the orthogonal direction of the plane, which is $\theta$. Thus we have: \[distance = \frac{\theta^T}{||\theta||}||\mathbf{x}^{(i)} - \mathbf{x}||\] \[ = \frac{1}{||\theta||}|\theta^T\mathbf{x}^{(i)} + \theta_0 - \theta^T\mathbf{x} - \theta_0|\] \[ = \frac{1}{||\theta||}||\theta^T\mathbf{x}^{(i)} + \theta_0|\]
From the definition of the plane $\theta^T\mathbf{x} + \theta_0 = 0$, we can see that if we scale $\theta$ and $\theta_0$ at the same time, the definition of the plane doesn’t change. Thus we can normalize $\theta$ and $\theta_0$ such that: \[ argmin_i \frac{1}{||\theta||}|\theta^T\mathbf{x}^{(i)} + \theta_0| = 1\] where $i = 1...N$ for N data points. Thus we have defined the problem as: \[ Max \frac{1}{||\theta||} : argmin_i |\theta^T\mathbf{x}^{(i)} + \theta_0| = 1\] The colon is read as “subject to”, and the equation after colon is the constraint. For completely linearly separable case, data $y^{(i)}$ has the same sign as $\theta^T\mathbf{x}^{(i)} + \theta_0$, and thus $|\theta^T\mathbf{x}^{(i)} + \theta_0| = y^{(i)}(\theta^T\mathbf{x}^{(i)} + \theta_0)$, so the problem becomes: \[ Min \frac{1}{2}\theta^T\theta : y^{(i)}(\theta^T\mathbf{x}^{(i)} + \theta_0) \geq 1, \forall i=1...N\]
This is form is called the primal form of the SVM, and it is a quadratic programming problem because it is minimizing a qudratic term while all the constraints are linear. We can use standard quadratic programming package quadprog to solve the problem. The variable has K+1 dimension where K is the dimension of $\theta$, and N constraints where N is the number of data points.

library("quadprog"); library(ggplot2)
data(iris); train <- iris
train$y <-ifelse(train[,5]=="setosa", 1, -1)
X <- as.matrix(train[,c("Petal.Length", "Petal.Width")])
N <- dim(X)[1]
X1 = cbind(rep(1, N), X)
K = dim(X1)[2]
y <- as.matrix(train$y)

# prepare the matrix for quadratic programming
D = matrix(c(1e-6,0,0,0,1,0,0,0,1), nrow = K, ncol = K) # need 1e-6 to make positive definite
A = t((diag(y[,1])) %*% X1)
d = matrix(rep(0, K), nrow = K)
b0 = matrix(rep(1, N), nrow = K)
sol = solve.QP(D, d, A, b0, factorize = FALSE)
qpline = c(sol$solution[1], sol$solution[2]) / abs(sol$solution[3])

# plot the results
plt <- ggplot(train, aes(x=Petal.Length, y=Petal.Width)) + 
  ggtitle("Solving the SVM QP") +
  geom_point(aes(fill=factor(y)), size=3, pch=21) +
  geom_abline(intercept=qpline[1], slope=qpline[2], size=1) +
  scale_fill_manual(values=c("red","blue"), guide='none')+
  scale_color_manual(values=c("green", "yellow", "black"))
print(plt)

Dual form

However, this primal form is only good for linearly separable case. If we want to make a curved separation boundary, we have to introduce basis function to expand variable $\theta$ from K dimension to higher dimensions, and depending on the complexity of the basis function, the dimension K can become very large, and solving such problem in primal form would be difficult. Here we introduce the dual form of SVM, which has the nice property that the problem complexity only depends on N, regardless the dimension of $\theta$ and thus $\theta$ can be expanded to higher dimensions with complex basis functions.
To obtain the dual form of the problem, we re-formaulate our problem into Lagrangian formulation: \[ Min_{\theta, \theta_0} L(\theta, \theta_0, \alpha) = \frac{1}{2}\theta^T\theta - \sum_i^N\alpha^{(i)}(y^{(i)}(\theta^T \mathbf{x}^{(i)} + \theta_0)-1) : \alpha^{(i)} \geq 0\] Note that minimization is w.r.t $\theta$ and $\theta_0$, then we have: \[ \nabla_{\theta} L = \theta - \sum_i^N\alpha^{(i)}y^{(i)}\mathbf{x}^{(i)} = 0\] \[ \frac{\partial L}{\partial \theta_0} = \sum_i^N\alpha^{(i)}y^{(i)} = 0\] substituting these two equations, we have: \[ Max_{\alpha} L(\alpha) = \vec{1}^T\alpha - \frac{1}{2}[diag(\mathbf{y})\alpha]^TQ[diag(\mathbf{y})\alpha]\] where $Q_{i,j} = \mathbf{x}_i^T\mathbf{x_j}$
This is the dual form of the primal problem, which is converted from optimizing a function w.r.t $\theta$ and $\theta_0$ to optimizing a function w.r.t $\alpha$. Note that the primal problem is minimizing the Lagrangian $L(\theta, \theta_0, \alpha)$ w.r.t $\theta$ and $\theta_0$, but the dual form is maximizing the Lagrangian $L(\alpha)$. Why is that? The original target function is $f(\theta) = \frac{1}{2}\theta^T\theta$, and the Lagrangian $L(\theta, \theta_0, \alpha)$ is $f(\theta)$ subtracting some non-negative term since $\alpha^{(i)} \geq 0$ and $(\theta^T \mathbf{x}^{(i)} + \theta_0)-1 \geq 0$. Thus $L(\theta, \theta_0, \alpha) \leq f(\theta)$. When we minimize $L(\theta, \theta_0, \alpha)$, we only find a lower bound for the min value of $f(\theta)$, and when we substitute the value of $\theta$ and $\theta_0$ which minimizes $L(\theta, \theta_0, \alpha)$ back into $L(\theta, \theta_0, \alpha)$, we obtained $L(\alpha)$ and now we have to maximize $L(\alpha)$ in the hope that the $Max_{\alpha} L(\alpha) = Min_{\theta} f(\theta)$. Fortunately, this is true for convex functions, and thus if we maximize $L(\alpha)$, we are guaranteed to obtain the min of $f(\theta)$. This is the so called Strong Duality.
Now that we obtained the dual form of the problem, we need to look at the constraints. Two obvious constraints are $\frac{\partial L}{\partial \theta_0} = \sum_i^N\alpha^{(i)}y^{(i)} = 0$, or $\alpha^T\mathbf{y} = 0$, and $\alpha \geq 0$. The other constraint is an very import one: the Karush-Kuhn-Tucker conditions \[ \alpha^{(i)}(y^{(i)}(\theta^T\mathbf{x}^{(i)}-1)) = 0, \forall i = 1...N\] The intuition of KKT condition is quite straightforward: if the constraint of the primal problem is not really constraining, i.e., $y^{(i)}(\theta^T\mathbf{x}^{(i)}-1) > 0$, then the corresponding Lagrangian multiplier $\alpha^{(i)}$ must be 0, because there is no need to constrain through $\alpha^{(i)}$. But if $\alpha^{(i)} > 0$, that means there is a real need to constrain, and thus this point must be an edge point such that $y^{(i)}(\theta^T\mathbf{x}^{(i)}-1) = 0$, which is called a slack.
We summarize the dual form of the SVM optimization problem as follows: \[ Min_{\alpha} L(\alpha) = -\vec{1}^T\alpha + \frac{1}{2}\alpha^T[diag(\mathbf{y})Qdiag(\mathbf{y})]\alpha: \alpha^{(i)}(y^{(i)}(\theta^T\mathbf{x}^{(i)}-1)) = 0, \alpha^T\mathbf{y} = 0, \alpha \geq 0\] where $Q_{i,j} = \mathbf{x}_i^T\mathbf{x_j}$, and after solving $\alpha$, the final solution is $\theta = \sum_i^N\alpha^{(i)}y^{(i)}\mathbf{x}^{(i)}$
With the optimization problem and the constraints defined, we can put the dual form SVM into a quadratic programming solver and we should get the same results as the primal.

# build the system matrices
Q <- sapply(1:N, function(i) y[i]*t(X)[,i])
D <- t(Q)%*%Q
d <- matrix(1, nrow = N)
b0 <- rbind( matrix(0, nrow=1, ncol=1) , matrix(0, nrow = N, ncol=1) )
A <- t(rbind(matrix(y, nrow=1, ncol=N), diag(nrow=N)))

# call the QP solver:
sol <- solve.QP(D + 5e-4*diag(N), d, A, b0, meq=1, factorized=FALSE)
qpsol <- matrix(sol$solution, nrow=N)

findLine <- function(a, y, X){
  nonzero <-  abs(a) > 1e-5
  W <- rowSums(sapply(which(nonzero), function(i) a[i]*y[i]*X[i,]))
  b <- mean(sapply(which(nonzero), function(i) X[i,]%*%W- y[i]))
  slope <- -W[1]/W[2]
  intercept <- b/W[2]
  return(c(intercept,slope))
}
qpline <- findLine( qpsol, y, X)
# plot the results
plt <- ggplot(train, aes(x=Petal.Length, y=Petal.Width)) + 
  ggtitle("Solving the SVM QP") +
  geom_point(aes(fill=factor(y)), size=3, pch=21) +
  geom_abline(intercept=qpline[1], slope=qpline[2], size=1) +
  scale_fill_manual(values=c("red","blue"), guide='none')+
  scale_color_manual(values=c("green", "yellow", "black"))
print(plt)

Generic quadratic programming algorithms usually have problem solving large scale problems. When N > 100,000, many quadratic packages would have issue solving the problem. Many algorithms are developed to better solve the quadratic programming problem for SVM, such as Sequential minimal optimization (SMO), which breaks this problem into a series of smallest possible sub-problems that are then solved analytically, and various heuristics are used to choose the pair of multipliers so as to accelerate the rate of convergence. SMO is widely used for training support vector machines and is implemented by the popular LIBSVM tool.

the Kernel trick

The primary benefit of using dual form instead of the primal form to solve SVM problem is that it is easy to extend the decision boundary from linear to nonlinear using Kernel functions because in the dual form the quadratic matrix $Q_{i,j} = \mathbf{x}_i^T\mathbf{x}_j$ is only related to the inner product of $\mathbf{x}_i^T\mathbf{x}_j$. If we want to make the decision boundary nonlinear, i.e., we want to expand $\mathbf{x}$ by some basis function to an n dimensional space so that $\mathbf{z} = \Phi(\mathbf{x}) = \sum_{k=1}^n\beta_k\phi(\mathbf{x})_k$, we hope that we can only focus on a so called kernel function $K(\mathbf{x}, \mathbf{x}') = \mathbf{z^Tz} = K(\mathbf{x^Tx'})$, so that we only need to deal with the inner product $\mathbf{x}_i^T\mathbf{x}_j$ and function $K(\mathbf{x^Tx'}) = \mathbf{z^Tz}$, and we don’t have to deal with the basis function $\Phi(\mathbf{x})$, because $\Phi(\mathbf{x})$ could have very high dimension and hard to compute. Most common kernel function are polynomial kernal function: $K(\mathbf{x^Tx'}) = (1 + \mathbf{x^Tx'})^n$, which corresponds to a n dimensional space, and radial kernal function $K(\mathbf{x^Tx'}) = exp(-\gamma||\mathbf{x - x'}||^2)$, which corresponds to an infinite dimension space. Let’s take a closer look at the radial kernal by using the Tylor Expansion: \[K(\mathbf{x^Tx'}) = exp(-\gamma||\mathbf{x - x'}||^2)\] \[ = exp(-\mathbf{x}^2)exp(-\mathbf{x'}^2)\sum_{k=0}^\infty\frac{2^k(\mathbf{x})^k(\mathbf{x'})^k}{k!}\] if we define the basis function to be $\phi(\mathbf{x})_k = \sqrt{\frac{2^k}{k!}}exp(-\mathbf{x}^2)(\mathbf{x})^k$, we can write the radial Kernel function as: \[K(\mathbf{x^Tx'}) = \sum_{k=0}^\infty\phi(\mathbf{x})_k^T\phi(\mathbf{x'})_k\] Thus we proved that radial basis Kernel corresponds to an infinite dimension space. Let’s take a look at our final hypothesis expressed by Kernel: \[g(\mathbf{x}) = sign(\theta^T\mathbf{z} + \theta_0) = sign(\sum_{i=1}^N \alpha^{(i)}y^{(i)}\mathbf{(z^{(i)})^Tz} + \theta_0) = sign(\sum_{i=1}^N \alpha^{(i)}y^{(i)}K(\mathbf{(x^{(i)})^Tx)} + \theta_0)\] which is only related to the Kernel function, but not $\mathbf{z} = \Phi(\mathbf(x))$.
So we can clearly see the advantage of the dual form problem: regardless of how many dimensions $\mathbf{z}$ or the original $\mathbf{x}$ may have, dual representation only has as many parameters as there are data points N. This also makes SVM suitable for solving data with large number of features, such as in text processing.

Soft Margin

We have shown the the primal form of the SVM problem is the following: \[ Min \frac{1}{2}\theta^T\theta : y^{(i)}(\theta^T\mathbf{x}^{(i)} + \theta_0) \geq 1, \forall i=1...N\]
This definition of problem does not allow any violation of margins, which only works for separable cases. If the problem is inherently inseparable, there must be violations of the margins. To deal with the inseparable problems, we modify the primal form to add a term representing the violation of the margins: \[ Min \frac{1}{2}\theta^T\theta + C\sum_{i=1}^N\xi^(i): y^{(i)}(\theta^T\mathbf{x}^{(i)} + \theta_0) \geq 1 - \xi^{(i)}, \xi^{(i)} \geq 0, \forall i=1...N\] $ ^{(i)} = max(0, 1 - y^{(i)}(T^{(i)} + _0))$ is the loss function of the soft margin SVM, which is usually called a hinge loss function. So in the new primal problem we are maximize the margin and the at the same time minimizing the violations. We choose a hinge loss function for soft margin SVM because 1). we only penalize the violation of margins, but we don’t give credits to not violating the margin; 2). the linear format of hinge loss function will result in something very interpretable in the dual form.
Putting the primal into dual format, and skipping all the math deductions, we can proove the dual format of soft margin SVM is: We summarize the dual form of the SVM optimization problem as follows: \[ Min_{\alpha} L(\alpha) = -\vec{1}^T\alpha + \frac{1}{2}\alpha^T[diag(\mathbf{y})Qdiag(\mathbf{y})]\alpha: \alpha^{(i)}(y^{(i)}(\theta^T\mathbf{x}^{(i)}-1)) = 0, \alpha^T\mathbf{y} = 0, 0 \leq \alpha \leq C\] where $Q_{i,j} = \mathbf{x}_i^T\mathbf{x_j}$. The only difference from the hard margin SVM dual form is that $\alpha$ has a upper limit constraint C. Because of the upper limit C, now there are two types of support vectors in soft margin SVM: 1). corresponding $\alpha^{(i)} = 0$, which is the support vector in the hard margin SVM; 2). corresponding $\alpha^{(i)} = C$, which corresponds to data points violating the margins. The optimum parameter C can be determined by cross-validation

Support Vectors and VC dimension

In statistical learning theory, or sometimes computational learning theory, the (VC dimension)[https://en.wikipedia.org/wiki/VC_dimension] (for Vapnik-Chervonenkis dimension) is a measure of the capacity (complexity, expressive power, richness, or flexibility) of a statistical classification algorithm, defined as the cardinality of the largest set of points that the algorithm can shatter. Informally, the capacity of a classification model is related to how complicated it can be, and the number of support vectors is closely related to the VC dimension. If you find that a large portion of of your data points are support vectors, say 30%, it usually means that your model complexity is very high and the decision boundary can be very wiggly.

Multiple classes

The algorithms above only seperates y if y has two classes. There are many ways to extend the basic algorithm to classify multiple classes, and the most popular ones are “1-against-1” method and “”1-against-rest" method. The popular Python sklearn packages implements its sklearn.svm.SVC, which is based on the C++ library libsvm, uses the “1-against-1” method. The advantages of the “1-against-1” method can be found in: C.-W. Hsu and C.-J. Lin. A comparison of methods for multi-class support vector machines, IEEE Transactions on Neural Networks, 13(2002), 415-425

Resources: www.robots.ox.ac.uk/~az/lectures/ml/

Data Science for Hackers

Saturday, August 1, 2015

Machine Learning and Statistics: a unified perspective | 3 Support Vector Machine