Sunday, July 19, 2015

Machine Learning and Statistics: a unified perspective | 1.1 Linear Regression

Among all the learning models, linear regression is one of the simplest, yet most popular one. Because of its simplicity, many people don’t realize that linear regression is actually a type of machine learning. In this article and the following few articles, we are going to look at how to do linear regression from a few different angles, and we will try to extend the models of machine learning based on linear regression.

Optimization point of view

Let’s start with the simplest one variable linear regression. As I defined in the previous article, learning is defined as given data set (\(\mathbf{Y, X}\)), find a function \(\hat{y}\) = \(f(\mathbf{\theta, x})\), such that by choosing appropriate value of \(\mathbf{\theta}\), some loss function of \(L(\mathbf{y}, f(\mathbf{\theta, X}))\) can be minimized. In the case of one variable linear regression, \(\mathbf{y}\) is {\(y^{(i)}\)} and \(\mathbf{X}\) is {\(x^{(i)}\)}, where \(i = 1...N\). Estimation function is \(\hat{y} = f(\theta, \mathbf{x}) = \theta^T\mathbf{x} = \beta x + \alpha\), and in this case \(\theta = (\beta, \alpha)\), while \(\mathbf{x} = (x, 1)\). Loss function is defined as \[L(\mathbf{y}, f(\mathbf{\theta, X})) = (y-\mathbf{X\theta})^T(y-\mathbf{X\theta}) = \sum_{i=1}^N (y^{(i)} - \beta x^{(i)} - \alpha)^2\], and the goal of linear regression is to find \(\beta\) and \(\alpha\) so that the loss function is minimized.

There are many ways to find the optimized \(\theta = (\beta, \alpha)\) for minimum loss function, but because this loss function is particularly simple in the sense that it is in the form of a inner product of two vectors, analytical solution can be found relatively easily. We know that when a function reaches its minimum, its derivative must equal zero, and thus we have: \[\frac{\partial{L}}{\partial{\theta}}=0\] \[\Rightarrow -2\mathbf{X^Ty} + 2\mathbf{X^TX\theta} = 0\] \[\Rightarrow \theta = \mathbf{(X^TX)^{-1}X^Ty}\]

Put \(\mathbf{X} = (\mathbf{x}, 1)\) into the equation above, we can solve for \(\theta = (\beta, \alpha)\). In this particular case, because the second column in $ is a vector of 1, it can be proven that \(\beta\) can be written in the form of covariance/variance: \(\beta = cov(\mathbf{y, x})/var(\mathbf{x})\)

Geometric point of view

Besides minimizing the square sum of the error vector \(\mathbf{y-X\theta = y - \hat{y}}\), we can view the linear regression results from another point of view: the geometric point of view. From the expression of \(\theta\), we know that \(\mathbf{\hat{y} = X\theta = X(X^TX)^{-1}X^Ty = Hy}\). Using this expression of \(\mathbf{\hat{H}}\), it is straightforward to proove that \(\mathbf{H^TH = H = H^T}\), and thus we can further proove that: \[\mathbf{\hat{y}^T(y - \hat{y}) = (Hy)^T(y - Hy) = 0}\] We can view \(\mathbf{\hat{y}, y}\) as vectors, this result has a geometic interpretation that the role of matrix \(\mathbf{H}\) is to project vector \(\mathbf{y}\) into the column space of \(\mathbf{X}\), and the projected vector is \(\mathbf{\hat{y}}\). This interpretation makes good sense. We can not find a linear combination of the column vectors of \(\mathbf{X}\) to represent \(\mathbf{Y}\) exactly because \(\mathbf{X}\) has more rows than columns. If column vectors of \(\mathbf{X}\) can represent \(\mathbf{y}\) exactly, the problem is a matrix inversion problem, not a regression problem. Then the goal of regression is to find a vector in the column space of \(\mathbf{X}\) which can mimick vector \(\mathbf{y}\) as closely as possible, and if the closeness to \(\mathbf{y}\) is measured by the Eucleadian distance, this problem is the least square linear regression problem, and the solution should clearly be simply projecting vector \(\mathbf{y}\) into the column space of \(\mathbf{X}\) to minimize the distance.

Based on the geometric point of view, we can simplify the least square linear regression by orthogonalize the column space of \(\mathbf{X}\) using the Gram-Schmit process. The Gram-Schmit process, or QR decomposition, is a process which orthogonalize the column vectors of \(\mathbf{X}\) into K unit length vectors orthogonal to each other. In mathematical terms, we say that \(\mathbf{X}\) can be decomposed into: \[\mathbf{X = QR}\] Where Q is a \(N \times K\) orthogonal matrix such that \(\mathbf{Q^TQ} = I_{K \times K}\), and R is a \(K \times K\) upper triangular matrix. We can prove that: \[\mathbf{\hat{\beta} = R^{-1}Q^Ty}\] \[\mathbf{\hat{y} = QQ^Ty}\]

The interpretation is straightforward from geometric point of view: \(\mathbf{Q^Ty}\) is a vector with each of its coefficient the projected length of \(\mathbf{y}\) on to the orthogonal basis of \(\mathbf{Q}\), and such vectors multiplied by the orthogonal basis \(\mathbf{Q}\) is the representation / projection of \(\mathbf{y}\) in the \(\mathbf{Q}\) space.

Matrix inversion point of view

The data of linear regressio is given in the form of \(\mathbf{X\theta = y}\). if \(\mathbf{X}\) is a N x N full rank matrix, \(\theta\) can be solved by simply inverse the matrix \(\mathbf{X}\): \(\mathbf{\theta = X^{-1}y}\). In linear regression problems, \(\mathbf{X}\) is an N x K matrix, and usually the number of data points N is larger than the number of features K. So \(\mathbf{X}\) is not directly inversible, but if we can find a psuedo inverse of \(\mathbf{X^+}\), such that \(\mathbf{X^+X = I}_{K \times K}\), then \(\theta\) can be solved by \(\mathbf{\theta = X^+y}\). How do we find such a psuedo inverse of \(\mathbf{X}\)?

Singular Value Decomposition (SVD) states that any matrix can be decomposed into multiplication of 3 matrices: \[\mathbf{X = U\Sigma V^T}\] where \(\mathbf{U}_{N \times N}\) and \(\mathbf{V}_{K \times K}\) are a unitary matrices and \(\mathbf{\Sigma}_{N \times M}\) is a diagonal with non-negative real numbers on the diagonal being the square roots of the non-zero eigenvalues of both \(\mathbf{X^TX}\) and \(\mathbf{XX^T}\). A proof of SVD can be found here. Without going into the math details, the interpretation of SVD is very useful: any matrix operation on a vector can be interpreted as the results of 3 steps: rotation in the K x K space(\(\mathbf{V^Y}\)), scale and extend the vector from K x K space into the N x N space(\(\mathbf{\Sigma}\)), rotation in the N x N space(\(\mathbf{U}\)).

With SVD, it is straightforward to show that if \(\mathbf{X}\)’s column space is full rank, \(\mathbf{X^+X = I}_{K \times K}\), where \(\mathbf{X^+}\) is \(\mathbf{V\Sigma^+ U^T}\), with \(\mathbf{\Sigma^+}\) being the diagonal matrix with the diagnol element the inverse of \(\mathbf{\Sigma}\)

Using the definition of \(\mathbf{\Sigma^+}\), we can easily prove that \(\mathbf{X^TXX^+ = X^T}\), and thus: \[\mathbf{X^+ = (X^TX)^{-1}X^T}\] which is equivalent to the previous two views. In this case: \[\mathbf{\hat{y} = X(X^TX)^{-1}X^Ty = UU^Ty}\] quite similar to the results we got in the geometric point of view. But \(\mathbf{U}\) is a different othogonal basis from \(\mathbf{Q}\)

Goodness of fit

Usually the goodness of fit is measured by \(R^2\), which is defined as: \[R^2 = 1 - \frac{SS_{res}}{SS_{tot}} = 1 - \frac{\sum_{i=1}^N(y^{(i)} - \hat{y}^{(i)})^2}{\sum_{i=1}^N(y^{(i)} - \bar{y})^2}\] In other words, \(R^2\) measure of how much of the variance in y is explained by the fitted value \(\hat{y}\). However, I prefer the “squared correlation” definition, which states that \(R^2 = cor(\mathbf{\hat{y}, y})^2\), because it gets directly at what is usually my primary concern: prediction. But why would \(R^2 = cor(\mathbf{\hat{y}, y})^2\) hold? Let’s try to derive the squared correlation correlation from the basic definition of \(R^2\). Without losing generality, I will normalize the data so that the average of both \(\mathbf{\hat{y}}\) and \(\mathbf{y}\) are 0, thus we have: \[R^2 = 1 - \frac{\sum_{i=1}^N(y^{(i)} - \hat{y}^{(i)})^2}{\sum_{i=1}^N(y^{(i)})^2}\] \[ = 1 - \frac{\mathbf{(y - X\theta)^T(y - X\theta)}}{\mathbf{y^Ty}}\] \[ = 1 - \frac{\mathbf{y^Ty -2(X\theta)^Ty + \theta^TX^TX\theta}}{\mathbf{y^Ty}}\] From the geometric interpretation of the least square linear regression, we would know that: \[\mathbf{\hat{y}^T\hat{y} = \theta^TX^TX\theta = \hat{y}^Ty = (X\theta)^Ty}\] Substituting this expression into this equation, we have: \[R^2 = 1 - \mathbf{\frac{y^Ty - (X\theta)^T(X\theta)}{y^Ty}}\] \[ = \mathbf{\frac{(X\theta)^T(X\theta)}{y^Ty}}\] At the same time, we have: \[cor(\mathbf{\hat{y}, y)^2 = \frac{((X\theta)^Ty)^2}{((X\theta)^TX\theta)y^Ty}}\] \[ = \mathbf{\frac{((X\theta)^TX\theta)^2}{((X\theta)^TX\theta)y^Ty}}\] \[ = \mathbf{\frac{(X\theta)^T(X\theta)}{y^Ty}} = R^2\]

practice

Once we understand what’s underneath linear regression we can do our own calculations to obtain the results of linear regression problems, and compare the results to the standard linear regression packages. Here we will use the cat data as an example.Cat heart weight has is related to cat body weight, and we will try to regress heart weight against body weight. First, let’s use R’s standard linear regression package lm to obtain the linear regression results.

library(MASS); data(cats)
fit = lm(Hwt ~ Bwt, data = cats)
summary(fit)
## 
## Call:
## lm(formula = Hwt ~ Bwt, data = cats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5694 -0.9634 -0.0921  1.0426  5.1238 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.3567     0.6923  -0.515    0.607    
## Bwt           4.0341     0.2503  16.119   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.452 on 142 degrees of freedom
## Multiple R-squared:  0.6466, Adjusted R-squared:  0.6441 
## F-statistic: 259.8 on 1 and 142 DF,  p-value: < 2.2e-16
plot(cats$Bwt, cats$Hwt, xlab = "Body Weight", ylab = "Heart Weight")
abline(fit, col = 'red')

Now let’s use the equation of \(\theta\) we derived above to calculate regression coefficients, and compare that to the lm results:

y = cats$Hwt
X = cbind(rep(1, nrow(cats)), cats$Bwt)
theta = solve(t(X) %*% X) %*% t(X) %*% y
print(theta)
##            [,1]
## [1,] -0.3566624
## [2,]  4.0340627

We can see that the obtained coefficients for body weight and intercept are exactly the same as the lm results. Now let’s use the geometric point of view, or QR decomposition to solve for the coefficients:

qr.result = qr(X)
Q = qr.Q(qr.result); R = qr.R(qr.result)
theta = solve(R) %*% t(Q) %*% y
print(theta)
##            [,1]
## [1,] -0.3566624
## [2,]  4.0340627

Again we got the exact results as the lm. We will use SVD to solve the regression problem again:

s = svd(X)
U = s$u; D_plus = diag(1/s$d); V = s$v;
theta = V %*% D_plus %*% t(U) %*% y
print(theta)
##            [,1]
## [1,] -0.3566624
## [2,]  4.0340627

Finally let’s calculate \(R^2\) from correlation of fitted value of heart weight and the heart weight value in the data set.

y_hat = X %*% theta
print(paste("Calculated R^2 from correlation:", sprintf("%.4f", cor(y_hat, y)^2)))
## [1] "Calculated R^2 from correlation: 0.6466"

This \(R^2\) agreed exactly with the \(R^2\) calculated from lm.

Interpretation of \(\theta\)

Previously we have shown that for single variable linear regression, \(\beta = cov(\mathbf{y, x})/var(\mathbf{x})\). However, for multi-variate linear regression, \(\theta_i = cov(\mathbf{y, x}_i)/var(\mathbf{x}_i)\) if and only if \(cov(\mathbf{x}_i, \mathbf{x}_j) = 0\), \(\forall i, j = 1...N, i \neq j\). When the column vectors of \(\mathbf{X}\) are correlated, \(|\theta_i|\) does not necessarily represent relative importance of \(\mathbf{x}_i\). Let’s use 2 variable linear regression as an example. Assume that the correlation between \(\mathbf{x}_1\) and \(\mathbf{x}_2\) are \(\rho\), and we normalize \(\mathbf{x}_1\) and \(\mathbf{x}_2\) so that both of their standard deviation is 1. We can easily prove that \(\theta_1 = \frac{cov(\mathbf{x}_1, \mathbf{y}) - \rho \times cov(\mathbf{x}_2, \mathbf{y})}{1 - \rho^2}\), and \(\theta_2 = \frac{cov(\mathbf{x}_2, \mathbf{y}) - \rho \times cov(\mathbf{x}_1, \mathbf{y})}{1 - \rho^2}\). We can see that even if \(cov(\mathbf{x}_2, \mathbf{y}) > 0\), it is still possible that \(\theta_2 < 0\), and thus \(\theta_i\) can not be simply used to determine the correlation between \(\mathbf{x}_i\) and \(\mathbf{y}\). In multiple linear regression, in order to decide the relative importance of \(\mathbf{x}_i\) from \(\theta_i\), several criteria must be satisfied:

  • \(\mathbf{x}_i\) must be normalized to mean of 0 and standard deviation of 1
  • \(cov(\mathbf{x}_i, y), \forall i \in 1...N\) must have the same sign

Let’s take an example using the diamond data.

library(ggplot2); data("diamonds")
head(diamonds)
##   carat       cut color clarity depth table price    x    y    z
## 1  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
## 2  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
## 3  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
## 4  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
## 5  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
## 6  0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48

We will use the numerical features of diamonds (carat, depth, table, x, y, z) to fit the price. From the covariance matrix we can see that the carat or weight of the diamond is highly correlated with its dimension x, y and z.

y = diamonds$price
X = as.matrix(diamonds[, c("carat", "depth", "table", "x", "y", "z")])
cor(X)
##            carat       depth      table           x           y          z
## carat 1.00000000  0.02822431  0.1816175  0.97509423  0.95172220 0.95338738
## depth 0.02822431  1.00000000 -0.2957785 -0.02528925 -0.02934067 0.09492388
## table 0.18161755 -0.29577852  1.0000000  0.19534428  0.18376015 0.15092869
## x     0.97509423 -0.02528925  0.1953443  1.00000000  0.97470148 0.97077180
## y     0.95172220 -0.02934067  0.1837601  0.97470148  1.00000000 0.95200572
## z     0.95338738  0.09492388  0.1509287  0.97077180  0.95200572 1.00000000

Common sense tells us that the price of the diamonds must be possitively correlated with diamond’s weight and dimensions. However, due to the high correlations of weight and dimensions, let’s see what happens to the coefficients of each parameter if we do multiple linear regression

fit = lm(y ~ X)
summary(fit)
## 
## Call:
## lm(formula = y ~ X)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23878.2   -615.0    -50.7    347.9  12759.2 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 20849.316    447.562  46.584  < 2e-16 ***
## Xcarat      10686.309     63.201 169.085  < 2e-16 ***
## Xdepth       -203.154      5.504 -36.910  < 2e-16 ***
## Xtable       -102.446      3.084 -33.216  < 2e-16 ***
## Xx          -1315.668     43.070 -30.547  < 2e-16 ***
## Xy             66.322     25.523   2.599  0.00937 ** 
## Xz             41.628     44.305   0.940  0.34744    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1497 on 53933 degrees of freedom
## Multiple R-squared:  0.8592, Adjusted R-squared:  0.8592 
## F-statistic: 5.486e+04 on 6 and 53933 DF,  p-value: < 2.2e-16

It’s interesting to see that the coefficent for the dimension x is negative. However, this is due to the high correlation between dimension x and carat. We can not interpret the result as diamond price is negatively correlated with dimension x. If we want to see the contribution of dimension x on top of carat, we should regression x against carat and use the residual as the variable in the multiple linear regression.

No comments:

Post a Comment