Saturday, July 25, 2015

Machine Learning and Statistics: a unified perspective | 1.3 Linear Regression Regularization

In the previous article, we have discussed how to find the parameters \(\theta\) to minimize the least sqaure loss function, but we have not discussed on the constraints of \(\theta\). As discussed in the introduction, the \(Constrait(\theta)\) is usually used to limit the complexity of the model, especially there is a large number of features K. Large number of features K can lead to high variance of the model. In the extreme case that the number of parameters K equals the number of data points N, an exact solution can be found and Loss function would be come 0, but this kind of model don’t offer any useful predictive power. In this article, we will evaluate what methods can be used to limit the complexity of the linear regression model.

Ridge regression

We can consider adding the following \(Constrait(\theta)\): \(\theta^T\theta = \sum_{j=1}^K \theta_i^2 \leq t\), which sets an uppper limit on the normal of \(\theta\). We can equivalently write the minimization of Loss function in the Lagrangian multiplier format consider such constrait: \[\hat{\theta} = \theta \ni \nabla_{\theta}(L(\theta) - \lambda Constraint(\theta)) = 0\] \[ = argmin_{\theta} \{\mathbf{(y - X\theta)^T(y - X\theta) + \lambda \theta^T\theta}\}\]

Using equation \(\nabla_{\theta}(L(\theta) - \lambda Constraint(\theta)) = 0\), we can easily solve \(\theta\): \[ \hat{\theta} = \mathbf{(X^TX + \lambda I)^{-1}X^Ty}\] Comparing with the linear regression, the only difference is that the term \(\mathbf{\lambda I}_{K \times K}\) is added to the \(\mathbf{X^TX}\)

Using Singular Value Decomposition, we can better understand the essence of ridge regression. \[ \mathbf{X\hat{\theta} = X(X^TX + \lambda I)^{-1}X^Ty}\] \[ = \mathbf{UD(D^2 + \lambda I)^{-1}DU^Ty}\] \[ = \sum_{i=1}^K \mathbf{u}_j \frac{d_j^2}{d_j^2 + \lambda}\mathbf{u}_j^T\mathbf{y}\] where \(d_j\) are the singular values of \(\mathbf{X}\). From this result, we can see that Like linear regression, ridge regression computes the coordinates of \(\mathbf{y}\) with respect to the orthonormal basis \(\mathbf{U}\), but it shrinks these coordinates by the factors . This means that a greater amount of shrinkage is applied to the coordinates of basis vectors with smaller \(d_j^2\). Note that this shrinkage is one the orthonormal basis \(\mathbf{U}\), not on \(\mathbf{X}\). The regression coefficients \(\theta\) of \(\mathbf{X}\) will not necessarily shrink more on the variables with smaller \(\theta_i\)

Using ridge regression we have one more parameter to tune: \(\lambda\). In order to determine the optimum \(\lambda\). The larger \(\lambda\) corresponds to smaller parameter coefficients, and thus less model complexity, but the corresponding estimation error may increase due to the increase of bias. To trade off between model complexity and estimation error, we usually use certain information criterion such as AIC or BIC: \[ AIC = -2ln(likelihood) + 2df = n \times ln(RSS/n) + 2df\] \[ BIC = -2ln(likelihood) + 2df = n \times ln(RSS/n) + 2df \times ln(n)\] Where RSS is the sum of square error and df is the degree of freedom. BIC represents the log-likelihood of data. We can see from the definition that BIC penalize model complexity more than AIC.

In order to use AIC or BIC to determine the optimum value of \(\lambda\), we must have a definition of degree of freedom. For least square linear regression, degree of freedom is simply the number of parameter used, which is also: \[\mathbf{tr(H) = tr(UU^T) = tr(U^TU) = tr(I)} = K\]

Since we have obtained \(\hat(H)\) for ridge regression, we can similarly define the degree of free for ridge regression to be: \[df = \mathbf{tr(H_{ridge})} = \sum_i \frac{d_i^2}{d_i^2 + \lambda}\] We can see that larger \(\lambda\) leads to smaller df and thus less model complexity.

In the following example I will use the prostate example data from the text book Elements of Statistical Learning

pcancer <- read.table("prostate_data.txt", header=TRUE)
train <- pcancer[which(pcancer$train),1:9]
calibrate <- pcancer[-which(pcancer$train),1:9]
plot(train)

Now let’s use AIC criterion to choose the best \(\lambda\)

lambda_seq = seq(0,5,0.1)
X = as.matrix(train[, 1:8]); y = train[,9]; N = nrow(X); K = ncol(X)
s = svd(X); d = s$d; colors = rainbow(8)
AIC_seq = c(); RSS_seq = c(); theta_seq = c(); df_seq = c()
for (lambda in lambda_seq) {
  theta = solve(t(X) %*% X + lambda * diag(K)) %*% t(X) %*% y
  theta_seq = cbind(theta_seq, theta)
  RSS = sum((y - X %*% theta)^2); RSS_seq = c(RSS_seq, RSS)
  df = sum(d^2/(d^2 + lambda)); df_seq = c(df_seq, df)
  AIC_seq = c(AIC_seq, N * log(RSS/N) + 2*df)
}
par(mfrow = c(1, 2))
plot(lambda_seq, df_seq, xlab = expression(lambda), ylab = "Effective DF")
plot(lambda_seq, RSS_seq, xlab = expression(lambda), ylab = "RSS")

plot(lambda_seq, AIC_seq, xlab = expression(lambda), ylab = "AIC")
matplot(lambda_seq, t(theta_seq), xlim=c(0,6), type="l",xlab=expression(lambda), 
    ylab=expression(hat(theta)), col=rainbow(8), lty=1, lwd=2, main="Ridge coefficients")
lambda.ridge <- lambda_seq[which.min(AIC_seq)]
abline(v=lambda.ridge, lty=2); abline(h=0, lty=2)
text(rep(5, 8), theta_seq[,ncol(theta_seq)], rownames(theta_seq), pos=4, col=rainbow(8))

We can see that as \(\lambda\) increases, effective degree of freedom decreases while RSS increase. Trading off between the two using AIC criterion, we found that the optimum \(\lambda\) is around 1.7

Lasso Regression and more general constraint

Ridge regression uses the \(Constrait(\theta): \sum_{j=1}^K \theta_i^2 \leq t\). We can make this constraint more general: \(Constrait(\theta): \sum_{j=1}^K \theta_i^q \leq t\), and when q = 1, this kind of constraint is given a special name called Lasso. q = 1 has a special role because 1 is the smallest value that this constraint can still be solved effectively with a time complexity of about \({NK^2}\), as solving least square linear regression using QR decomposition. However, when q = 1, there is no close form solution like that in ridge regression. We have to use some advanced optimization techniques such as quadratic programming problem, which we will discuss in more details later.

An interesting proper of Lasso is that when the regularization parameter \(lambda\) become large enough, some coefficients in \(\theta\) will become exactly 0. Thus Lasso can serve as a model selection tool. It can be further proven that all \(q \leq 1\) constraints can push certain parameters to 0 and serve as model selection tools.

Since there is no closed form solution for Lasso, we will use the R package “lars” to demonstrate the results for Lasso.

library(lars); par(mfrow = c(1,1))
## Loaded lars 1.2
model.lasso <- lars(X, y, type="lasso")
lambda.lasso <- c(model.lasso$lambda,0)
theta <- coef(model.lasso)
matplot(lambda.lasso, theta, xlim=c(8,-2), type="o", pch=20, xlab=expression(lambda), 
    ylab=expression(hat(theta)), col=colors)
text(rep(-0, 9), theta[9,], colnames(X), pos=4, col=colors)
abline(v=lambda.lasso[4], lty=2); abline(h=0, lty=2)

theta.lasso <- theta[4,]
resid.lasso <- train$lpsa - predict(model.lasso, as.matrix(train[,1:8]), s=4, type="fit")$fit
rss.lasso <- sum(resid.lasso^2)/(67-4)

Notice how each parameter reduce to 0 as \(\lambda\) increase.

Stepwise Selection

Stepwise selection is a greedy algorithm for selecting a best subset of variables using information criterion such as AIC or BIC. It can be done by keep adding or dropping variables that can best improve the results by AIC or BIC standard. It is equivalent to a q = 0 situation. This algorithm is quite straight forward and we will simply use the step function in R to demonstrate this algorithm:

model.ls <- lm(lpsa ~ ., data=train)
model.backward <- step(model.ls, direction="backward")
## Start:  AIC=-37.13
## lpsa ~ lcavol + lweight + age + lbph + svi + lcp + gleason + 
##     pgg45
## 
##           Df Sum of Sq    RSS     AIC
## - gleason  1    0.0109 29.437 -39.103
## <none>                 29.426 -37.128
## - age      1    0.9886 30.415 -36.914
## - pgg45    1    1.5322 30.959 -35.727
## - lcp      1    1.7683 31.195 -35.218
## - lbph     1    2.1443 31.571 -34.415
## - svi      1    3.0934 32.520 -32.430
## - lweight  1    3.8390 33.265 -30.912
## - lcavol   1   14.6102 44.037 -12.118
## 
## Step:  AIC=-39.1
## lpsa ~ lcavol + lweight + age + lbph + svi + lcp + pgg45
## 
##           Df Sum of Sq    RSS     AIC
## <none>                 29.437 -39.103
## - age      1    1.1025 30.540 -38.639
## - lcp      1    1.7583 31.196 -37.216
## - lbph     1    2.1354 31.573 -36.411
## - pgg45    1    2.3755 31.813 -35.903
## - svi      1    3.1665 32.604 -34.258
## - lweight  1    4.0048 33.442 -32.557
## - lcavol   1   14.8873 44.325 -13.681
scope <- list(upper=~lcavol+lweight+age+lbph+svi+lcp+gleason+pgg45, lower=~.)
model.forward <- step(lm(lpsa ~ 1, data=train), scope, direction="forward")
## Start:  AIC=26.29
## lpsa ~ 1
## 
##           Df Sum of Sq    RSS      AIC
## + lcavol   1    51.753 44.529 -23.3736
## + svi      1    29.859 66.422   3.4199
## + lcp      1    23.042 73.239   9.9657
## + lweight  1    22.668 73.614  10.3071
## + pgg45    1    19.328 76.953  13.2799
## + gleason  1    11.290 84.992  19.9368
## + lbph     1     6.657 89.625  23.4930
## + age      1     4.989 91.292  24.7279
## <none>                 96.281  26.2931
## 
## Step:  AIC=-23.37
## lpsa ~ lcavol
## 
##           Df Sum of Sq    RSS     AIC
## + lweight  1    7.4367 37.092 -33.617
## + lbph     1    4.5363 39.992 -28.572
## + svi      1    2.2160 42.313 -24.794
## <none>                 44.529 -23.374
## + pgg45    1    1.1055 43.423 -23.058
## + gleason  1    0.1045 44.424 -21.531
## + lcp      1    0.0610 44.468 -21.465
## + age      1    0.0329 44.496 -21.423
## 
## Step:  AIC=-33.62
## lpsa ~ lcavol + lweight
## 
##           Df Sum of Sq    RSS     AIC
## + svi      1   2.18410 34.908 -35.683
## + pgg45    1   1.65781 35.434 -34.680
## <none>                 37.092 -33.617
## + lbph     1   1.07668 36.015 -33.590
## + gleason  1   0.43335 36.658 -32.404
## + age      1   0.27462 36.817 -32.115
## + lcp      1   0.00206 37.090 -31.621
## 
## Step:  AIC=-35.68
## lpsa ~ lcavol + lweight + svi
## 
##           Df Sum of Sq    RSS     AIC
## + lbph     1   2.09275 32.815 -37.825
## <none>                 34.908 -35.683
## + pgg45    1   0.83360 34.074 -35.302
## + lcp      1   0.63405 34.274 -34.911
## + gleason  1   0.30106 34.607 -34.263
## + age      1   0.19562 34.712 -34.059
## 
## Step:  AIC=-37.83
## lpsa ~ lcavol + lweight + svi + lbph
## 
##           Df Sum of Sq    RSS     AIC
## <none>                 32.815 -37.825
## + pgg45    1   0.74555 32.069 -37.365
## + age      1   0.53056 32.284 -36.917
## + lcp      1   0.49216 32.323 -36.838
## + gleason  1   0.17815 32.637 -36.190

Elastic net

For q > 1, \(|\theta|^q\) is differentiable at 0, and thus it is easy to optimize for minimum. However, when q > 1, the constraint can not exactly set coefficents to 0. Zou and Hastie (2005) introduced the elasticnet penalty \(\lambda \sum_i^K(\alpha \theta_j^2 + (1-\alpha)|\theta_j|)\), which is a different compromise between Ridge and Lasso. We will discuss Elastic net in more details later.

No comments:

Post a Comment