Data Science for Hackers: Machine Learning and Statistics: a unified perspective

In the previous article, we derived the solution for the optimum parameter \(\theta = (\theta_1 ... \theta_K)\). However, because \(\mathbf{y}\) are random variables, parameter \(\theta\) are random variables as well. Understanding the variable of \(\theta\) is important for us to evaluate our confidence in the optimum values of \(\theta\) we obtained through linear regression.

\[\mathbf{\hat{\theta} = (X^TX)^{-1}X^Ty = Wy}\]

From this expression, we can see that \(\hat{\theta}\) is a linear combination of \(\mathbf{y}\), and thus the variance of \(\hat{\theta}\) is related to \(\mathbf{y}\). So let’s calculate \(\mathbf{y}\)’s variance first. Note that here \(\mathbf{y}\) is treated as a random variable with a expected value of \(\mathbf{X\hat{\theta}}\) with an expected variance of \(\sigma^2\). In other words, \(y \sim N(\mathbf{\theta^Tx}, \sigma^2)\). When we calculate the variance of \(\hat{\theta}\), we are assuming that \(\hat{\theta}\) is a random variable and we are calculating the variance of this random variable, and thus in this context, when we calculate \(\mathbf{y}\)’s variance, we can calculating the variance of the random variable \(\mathbf{y}\), and thus \(Var(\mathbf{y}) \neq [SST_{tot} = \sum_i(y^{(i)} - \bar{y})^2]\). To put this into mathematical equations, we have: \[Var(\mathbf{y}) = E[(\mathbf{y - X\hat{\theta}})] = E(\epsilon^2) = \sigma^2\]

Variance \(\sigma^2\) needs to be estimated from the sample, and an unbias estimator for \(\sigma^2\) is: \[(N-p-1)\hat{\sigma}^2 \sim \sigma^2\chi_{N-p-1}^2\] a chi-square distrubition of N-p-1 degrees of freedom, where p is the number of parameters used in the linear regression.

Thus the variance of \(\theta\) is: \[Var(\hat{\theta}) = Var(\mathbf{Wy}) = \mathbf{W}[Var(\mathbf{y})]\mathbf{W^T = (X^TX)^{-1}\hat{\sigma}^2}\]

In statistics, we often need to test the significance of the parameters \(\hat{\theta}\). With the variance of parameters calculated, it is straightforward to do the t-tests on \(\hat{\theta}\) because the t-value for \(\theta\) is simply \(\mathbf{t = \hat{\theta}} / tr[Var(\hat{\theta})^(1/2)]\). Note that \(\mathbf{t}\) is a vector with the same size of vector \(\hat{\theta}\).

Often we need to test for the significance of groups of coefficients simultaneously, and in this case we will use F statistics: \[F = \frac{(RSS_0 - RSS_1) / (p_1 - p_0)}{RSS_1 / (N-p_1-1)}\] where RSS1 is the residual sum-of-squares for the least squares fit of the bigger model with p1+1 parameters, and RSS0 the same for the nested smaller model with p0+1 parameters, having p1 - p0 parameters constrained to be 0. The F statistic measures the change in residual sum-of-squares per additional parameter in the bigger model, and it is normalized by an estimate of \(\sigma^2\)

Let’s use this equation to calculate the t statistics and F statistics of \(\hat{\theta}\) and compare that with the results obtained by lm in R. In the previous article, we have used the cats data and obtained the following results:

library(MASS); data(cats)
fit = lm(Hwt ~ Bwt, data = cats)
summary(fit)

## 
## Call:
## lm(formula = Hwt ~ Bwt, data = cats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5694 -0.9634 -0.0921  1.0426  5.1238 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.3567     0.6923  -0.515    0.607    
## Bwt           4.0341     0.2503  16.119   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.452 on 142 degrees of freedom
## Multiple R-squared:  0.6466, Adjusted R-squared:  0.6441 
## F-statistic: 259.8 on 1 and 142 DF,  p-value: < 2.2e-16

Now let’s use the equation we deduced to calculate the t-statistics of \(\hat{\theta}\):

y = cats$Hwt
X = cbind(rep(1, nrow(cats)), cats$Bwt)
theta = solve(t(X) %*% X) %*% t(X) %*% y
sigma = sqrt(sum(residuals(fit)^2) / (length(y) - 2))
var_theta = solve(t(X) %*% X) * sigma^2
t = theta / sqrt(diag(var_theta))
print(t)

##            [,1]
## [1,] -0.5152019
## [2,] 16.1193908

RSS0 = sum((y - mean(y))^2)
RSS1 = sum(residuals(fit)^2)
p1 = 1; p0 = 0; N = length(y)
F = ((RSS0 - RSS1) / (p1 - p0)) / (RSS1 / (N - p1 - 1))
print(sprintf("F stats is: %.2f",F))

## [1] "F stats is: 259.83"

We can see that the t statistics and F statistics calculated by this equation is exactly the same as the t statistics and F statistics obtained by the lm of R.

The deduction for variance of parameter \(\theta\) is particularly easy in the case of least square linear regression. However, with more sophicated models, the deduction of the expression of variance for parameters can be difficult. Luckily, with the improvement of modern computers, we can calculate the t-stats using numerical methods, and one of the numerical method is bootstrap. The idea of bootstrap is that we randomly take a subset of the data sample and estimate parameters using that sample, and do this many times, We can get a distribution of the parameters. t-stats can be calculated by the average of the parameter samples devided by the standard deviation of the parameter sample. We will use the package boot to demonstrate how to use bootstrap method to calculate t-stats.

library(boot)
# function to obtain R-Squared from the data 
hat_theta <- function(formula, data, indices) {
  d <- data[indices,] # allows boot to select sample 
  fit <- lm(formula, data=d)
  return(summary(fit)$coef[, 1])
} 
# bootstrapping with 1000 replications 
results <- boot(data=cats, statistic=hat_theta, 
    R=1000, formula=Hwt~Bwt)

# view results
par(mfrow = c(1,2))
hist(results$t[,1], breaks = 20, xlab = "intercept", main = "Intercept")
hist(results$t[,2], breaks = 20, xlab = "slope", main = "Slope")

print(sprintf("t stats for intercept: %.2f", mean(results$t[,1]) / sd(results$t[,1])))

## [1] "t stats for intercept: -0.47"

print(sprintf("t stats for slope: %.2f", mean(results$t[,2]) / sd(results$t[,2])))

## [1] "t stats for slope: 13.29"

The t-stats obtained through bootstrap are not exactly the same as the results obtained using equations, but it is a reasonable estimate especially in the case when explicit equation is difficult to deduct.

Data Science for Hackers

Wednesday, July 22, 2015

Machine Learning and Statistics: a unified perspective | 1.2 Linear Regression Variance Analysis

No comments:

Post a Comment