Monday, July 27, 2015

Machine Learning and Statistics: a unified perspective | 2 Generalized Linear Model

In previous articles we discussed linear regressions and the ways to calculate the optimum parameter \(\theta\) for minimum Loss function as well as ways to regularize the parameter \(\theta\). In linear regression, the variable to be estimated \(\mathbf{y}\), is a continuous variable which is expected to have a Guassian distribution \(y^{(i)} = f(\theta, \mathbf{x}) = \theta^T x\). Now hope to generalize the approaches we used in linear regression and hope it can be applied to more general situations, when \(\mathbf{y}\) is not Gaussian distribution. It turns out that similar minimization techniques to calculate \(\theta\) can be applied \(\mathbf{y}\) with distributions in a range of functions named exponential family.

This article absorbed a lot ideas from Machine Learning: A Probabilistic Perspective, Chapter 9, page 281. Please read this chapter if you want more details, or you want a comparison on the views.

Exponential Family

a probability distribution function (pdf) \(p(\mathbf{y|\eta})\) is said to be in the exponential family if it is of the form: \[p(\mathbf{y|\eta}) = \frac{1}{Z(\eta)}h(\mathbf{y})exp[\eta^T\phi(\mathbf{y})]\] \[ = h(\mathbf{y})exp[\eta^T\phi(\mathbf{y}) - A(\eta)]\] where \(Z(\eta)\) is called the partition function, which is to normalize \(p(\mathbf{y}|\eta)\) to integrate to 1, and \(A(\eta)\) is the log partition function. h(y) is the a scaling constant, often 1. If \(\phi(\mathbf{y}) = \mathbf{y}\), we say it is a natural exponential family.

Let’s take a look at a couple of examples:

Univariate Gaussian: this corresponds to the y distribution in linear regression \[N(x|\mu, \sigma^2) = \frac{1}{(2\pi\sigma^2)^{1/2}}exp[-\frac{(y-\mu)^2}{2\sigma^2}]\] \[ = \frac{1}{(2\pi\sigma^2)^{1/2}}exp[-\frac{y^2}{2\sigma^2} + \frac{\mu y}{2\sigma^2} - \frac{\mu^2}{2\sigma^2}]\] \[ = exp[\eta^T \phi(\mathbf{y}) - A(\eta)]\] where \(\eta = (\frac{\mu}{\sigma^2}, -\frac{1}{2\sigma^2})\), \(\phi(\mathbf{y}) = (y, y^2)\).

Multinoulli: this corresponds to the y distribution in logistical regression, which we will discuss soon. In this case response y is not continuous, but has several levels: \(Cat(y|\mu) = \Pi_{k=1}^K\mu_k^{y_k}\), where \(y_k = I(y=k)\) \[Cat(y|\mu) = \Pi_{k=1}^K\mu_k^{y_k} = exp[\sum_{k=1}^K y_k log(\mu_k)]\] \[ = exp[\sum_{k=1}^{K-1} y_k log(\frac{\mu_k}{\mu_K}) + log(\mu_K)]\] where \(\mu_K = 1 - \sum_{k=1}^{K-1} \mu_k\). Thus \[Cat(y|\eta) = exp[\eta^T \phi(\mathbf{y}) - A(\eta)]\] where \(\eta = (log(\mu_1 / \mu_K), ..., log(\mu_{K-1} / \mu_K))\), \(\phi(y) = (I(y = 1), ..., I(y = K-1))\). We can recover the mean parameters from the canonical parameters using: \(\mu_k = \frac{exp(\eta_k)}{1 + \sum_{j=1}^{K-1}exp(\eta_j)}\), \(\mu_K = 1 - \frac{\sum_{j=1}^{K-1}exp(\eta_j)}{1 + \sum_{j=1}^{K-1}exp(\eta_j)} = \frac{1}{1 + \sum_{j=1}^{K-1}exp(\eta_j)}\)

Bernoulli: is a special case of Multinoulli where K = 2. In this case \(\mu = \frac{exp(\eta)}{1 + exp(\eta)} = \frac{1}{1 + exp(-\eta)}\)

In all of our models, \(\mu\) is related to \(\eta\), i.e., \(\mu = g(\eta)\). We are buildings functions \(f(\theta, \mathbf{x})\) to estimate the mean value of y, which is \(\mu\). For linear models, we will assume that \(\mu = g^{-1}(\eta) = f(\theta, \mathbf{x}) = f(\gamma(\theta^T\mathbf{x}))\)

ML and MAP estimation of Exponential family functions

One of the appealing properties of generalized linear models (GLMs) is that the optimum parameters \(\theta\) which minimizes the loss function, defined as the negative log likelihood, can be calculated using exactly the same methods. The negative log-likelihood has the following form: \[ L(\vec{\theta}) = -log[p(\mathbf{y} | \vec{\eta}(\vec{\theta}), \mathbf{X})] = -\vec{\eta}(\vec{\theta})^T\phi(\mathbf{y}) + A(\vec{\eta}(\vec{\theta}))\] \(\eta\) is related to \(\mu\), and \(\mu\) is estimated by \(\theta^T\mathbf{x}\) in linear models, and that is why loss function is a function of \(\theta\). To minimize \(L(\vec{\theta})\), the derivative of \(L(\vec{\theta})\) must be 0. thus we have: \[ \nabla_{\theta} L = \frac{\partial L}{\partial \vec{\theta}} = (\frac{\partial L}{\partial \vec{\eta}}\frac{\partial \vec{\eta}}{\partial \vec{\mu}}\frac{\partial \vec{\mu}}{\partial \vec{\gamma}})^T\mathbf{X} = 0\] If we simply choose \(\gamma\) = \(\eta\), we have: \[ \Rightarrow \mathbf{X}^T(\phi(\mathbf{y}) - \frac{\partial A}{\partial \vec{\eta}}) = 0\]

For exponential family pdf, the partition function \(A(\vec{\eta})\) has the following very nice property: \[ \frac{\partial A}{\partial \vec{\eta}} = E[\phi(\mathbf{y})]\] \[ \frac{\partial^2 A}{\partial \vec{\eta}^2} = cov[\phi(\mathbf{y}), \phi(\mathbf{y})]\]

Thus we have the condition for minimizing \(\vec{\theta}\): \[ \nabla_{\theta} L = \mathbf{X}^T(\phi(\mathbf{y}) - E[\phi(\mathbf{y})]) = 0\] \[ \Rightarrow \mathbf{X}^T(\phi(\mathbf{y}) - \mu(\vec{\eta})) = 0\]

Let’s use Univariate Gaussian as an example first. Previously we have shown that \(\eta = (\frac{\mu}{\sigma^2}, -\frac{1}{2\sigma^2})\), \(\phi(\mathbf{y}) = (y, y^2)\). However, \(\eta_2 = -\frac{1}{2\sigma^2}\) is not related to \(\vec{\theta}\), we will ignore \(\eta_2\) and thus \(\phi(\mathbf{y})_2 = y^2\), thus the equation becomes: \[ \mathbf{X^T(y - X\theta) = 0}\] Which is the same as the results we previous deducted.

From this result we can see that even though \(\phi(\mathbf{y}\) may be a vector and may contain basis functions other than \(\mathbf{y}\), we are only interested in estimating \(\mathbf{y}\). Thus we can rewrite the minimum condition as: \[\nabla_{\theta} L = \mathbf{X}^T(\mathbf{y} - \mu(\mathbf{X}\theta)) = 0\]

Then let’s use Multinoulli as an example. Previously we have shown that for Multinoulli, \(\mu_k = \frac{exp(\eta_k)}{1 + \sum_{j=1}^{K-1}exp(\eta_j)}\), \(\phi(\mathbf{y}) = (I(y = 1), ..., I(y = K-1))\), \(\mu_K = \frac{1}{1 + \sum_{j=1}^{K-1}exp(\eta_j)}\). Then there will be K-1 equations for \(\frac{\partial L}{\partial \theta} = 0\), and thus \(\theta\) becomes a M-by-(K-1) matrix \(\Theta_{M \times (K-1)}\), where M is the number of features in \(\mathbf{X}_{N \times M}\) and K is the number of classes: \[\frac{\partial L}{\partial \theta} = \mathbf{X}^T[I(\mathbf{y} = j) - \frac{exp(\eta_k)}{1 + \sum_{j=1}^{K-1}exp(\eta_j)}]_j = 0\] \[ \Rightarrow \mathbf{X}^T[I(\mathbf{y} = j) - \frac{exp(\mathbf{X}\Theta_j)}{1 + \sum_{j=1}^{K-1}exp(\mathbf{X}\Theta_j)}]_j = 0\] where \(j = 1...K\)

In the case of Bernoulli, when K = 2, it becomes: \(\mathbf{X}^T[\mathbf{y} - \frac{1}{1 + exp(-\theta^T\mathbf{X})}] = 0\)

Gradient descent

Equation \(\mathbf{X}^T(\phi(\mathbf{y}) - \mu(\mathbf{X}\theta)) = 0\) is not necessarily easy to solve directly. We may need to use numerical methods such as Newton’s method. For Newton’s method, the parameters \(\vec{\theta}\) are updated with the following rule: \[ \vec{\theta} := \vec{\theta} - \frac{\nabla_{\theta} L}{\nabla_{\theta}^2 L}\] \[ = \vec{\theta} - \mathbf{H}^{-1}\nabla_{\theta}L\] Where \(\mathbf{H} = \nabla_{\theta}^2 L = \mathbf{X^TDX}\), and \(\mathbf{D}\) is a diagnoal matrix with \(\mathbf{D}_{ii} = \frac{d\mu^{(ii)}}{d\theta^{(ii)}}\). \(\mathbf{H}^{-1}\) is a fairly optimum learning rate which usually lead to quadratic convergence. However, if loss function is not diffrentiable in the second order, we can always use some fix parameter \(\alpha\) to replace \(\mathbf{H}^{-1}\) and \(\theta\) should still converge, although may be with a lower speed of convergence. This method is called gradient descent.

If we re-write the gradient method in scalar format, we have: \[ \theta_j := \theta_j - \alpha\sum_{i=1}^N x_j^{(i)}(y^{(i)} - \mu^{(i)})\] This needs to sum through all the data points from 1 to N, which could be an expensive calculation when N is large. However, it is possible to update \(\theta\) in the following way: \[ \theta_j := \theta_j - \alpha[x_j^{(i)}(y^{(i)} - \mu^{(i)})], i = 1...N\] This is an online algorithm called Stochastic Gradient Descent, which allows almost infinite amount of data to be calculated even if the data set is so large that it doesn’t fit into the memory. Usually the convergence of this moethod will show more stochastic behavior, it will still convergence eventually, and it would converge faster than the full gradient descent especially when data set number N is large

Mini-batch gradient descent would use a sub set of data to update \(\theta\), so it is somewhere between gradient descent and stochastic gradient descent.

Regularization

Similar to regularization in linear regression, the Languragian multiplier for the constraints can be easily added to the Loss function. For example, if the constraint is: \(\sum_i \theta_i^2 \leq t\), then the corresponding loss function would be: \[ L(\vec{\theta}) = -\vec{\eta}(\vec{\theta})^T\phi(\mathbf{y}) + A(\vec{\eta}(\vec{\theta})) - \lambda \theta^T\theta\] and corresponding minimization condition would be: \[ \nabla_{\theta} L = \mathbf{X}^T(\mathbf{y} - \mu(\mathbf{X}\theta)) - \lambda \theta = 0\]

Example: Don’t overfit in Kaggle competition

The R package “glmnet” (Generalized Linear Model with lasso or elasticnet regularization) is the a great package covering most useful functions in the exponential family, and the estimated parameters can be regularized through elastic net, which is an efficient algorithm which can trade off between Ridge and Lasso as we discussed in the previous article. To demonstrate the use of glmnet, we use the . data from a Kaggle competition named Don’t overfit. The data set is a simulated data set with 200 variables and 20,000 cases, but you only get given the Target value of 250 cases - the task is to build a model that gives the best predictions on the remaining 19,750 cases. Target value are either 1 or 0, which is a Bernoulli distribution and we should use logistical regression to fit the Target value. However, With 200 features and only 250 data points, it is likely to overfit the model without proper parameter selection and regularization. I am using the solution provided by Zach Mayer to demonstrate the use of glmnet to fit logistical regression, but I skipped the feature selection part. We will discuss feature selection later.

setwd("C:/gitProjects/Kaggle")
library('caTools'); library('caret'); library('glmnet')
## Loading required package: lattice
## Loading required package: ggplot2
## Loading required package: Matrix
## Loading required package: foreach
## Loaded glmnet 2.0-2
library('ipred'); library('e1071')
Data <- read.csv("overfitting.csv", header=TRUE)
############################
# Load the Data, choose target, create train and test sets
############################

Data <- read.csv("overfitting.csv", header=TRUE)

#Choose Target
Data$Target <- as.factor(ifelse(Data$Target_Practice ==1,'X1','X0'))
Data$Target_Evaluate = NULL
Data$Target_Leaderboard = NULL
Data$Target_Practice = NULL
xnames <- setdiff(names(Data),c('Target','case_id','train'))

#Order
Data <- Data[,c('Target','case_id','train',xnames)]

#Split to train and test
trainset = Data[Data$train == 1,]
testset = Data[Data$train == 0,]

#Remove unwanted columns
trainset$case_id = NULL
trainset$train = NULL

Now we have a dataframe trainset with 200 features and 1 target value of either “X0” or “X1”. We need to fit the data with logistical regression and regularize it with proper elastic net parameters. Optimum parameter for regularization can be chosen through grid search and bootstrap sampling, and we use ROC as the performance measure to select best regularization parameter.

MyTrainControl=trainControl(
        method = "boot",
        number=25,
        returnResamp = "all",
        classProbs = TRUE,
        summaryFunction=twoClassSummary
        )
X <- trainset[,xnames]
y <- trainset$Target
model <- train(Target ~ .,data=trainset,method='glmnet',
    metric = "ROC",
    tuneGrid = expand.grid(.alpha=c(0,1),.lambda=seq(0,.25,by=0.005)),
    trControl=MyTrainControl)
plot(model, metric='ROC', type='p')

Finally let’s evaluate the fitted results using the metric AUC. We can find that even with this very simple model without using parameter selection, we achieved an AUC of 0.86

test <- predict(model, newdata=testset, type  = "prob")
colAUC(test, testset$Target, plotROC = T)

##                  X0        X1
## X0 vs. X1 0.8678578 0.8678578