Tuesday, August 4, 2015

Machine Learning and Statistics: a unified perspective | 4 Decision Trees

Previously we have discussed using Generalized Linear Model(GLM) to estimate continuous variables (linear regressions) and categorical variables (logistical regression). GLM minimizes Loss function with respect to parameters \(\theta\), and such \(\theta\) is global in the sense that \(\theta\) doesn’t change with the value of variable \(\mathbf{x}\). GLM models are less prone to overfitting when effective degrees of freedom is not large, but it may not deal with local nonlinearities properly. Consider a k-nearest-neighbor (KNN) algorithm: Data point \((y, \mathbf{x})\)’s target value y will be assigned to the mean of the nearest k data point to \(\mathbf{x}\). This algorithm can adapt to high local nonlinearities, but it can also have high variance and significantly overfit the data.

For models using adaptive local functions, we can write the model in the following form: \[f(\mathbf{x}) = E[y|\mathbf{x}] = \sum_{m=1}^M w_mI(\mathbf{x} \in R_m) = \sum_m w_m\phi(\mathbf{x; v}_m)\] where \(R_m\) is the m-th region, \(w_m\) is the mean response weight, and more generally, \(\phi(\mathbf{x; v_m})\) is a local basis function which is not necessarily \(I(\mathbf{x} \in R_m)\). For decision trees, \(\mathbf{v}_m\) encodes the choice of variable to split on, and the threshold value. This makes it clear that a decision tree based model is just a an adaptive basis-function model, where the basis functions define the regions, and the weights specify the response value in each region.

Build a Tree

Finding the optimal partitioning of the data is NP-complete (Hyafil and Rivest 1976), so it is common to use the greedy procedure locally optimal MLE. The split procedure is the following: \[ (j^*, t^*) = arg min_{j \in 1...K} min_{t \in T_j} [cost({\mathbf{x}, y : x_{i,j} \leq t}) + cost({\mathbf{x}, y : x_{i,j} \geq t})]\]

For cost functions, we have the following considerations:

  • Regression: \(cost(\mathcal{D}) = \sum_{i \in \mathcal{D}}(y^{(i)} - \bar{y})^2\)
  • Classification: if dependent variable y has C classes, we define the class-conditional probabilities as \(\pi_c = \frac{1}{|\mathcal{D}|}\sum_{i \in \mathcal{D}}I(y^{(i)} = c)\), where \(|\mathcal{D}|\) is the number of data points in data set \(\mathcal{D}\). There are a few similar measures of cost:
    • Misclassification rate: define most probable class label as \(\hat{y}_c = argmax_c \hat{\pi}_c\), then the corresponding error rate is \(1 - \pi_{\hat{y}}\)
    • Entropy: \(H(\pi) = -\sum_{c=1}^C \pi_clog(\pi_c)\). Note that minimizing entropy is the same as maximizing information gain.
    • Gini index, or the expected error rate: \(\sum_{c=1}^C \pi_c(1-\pi_c)\)

The 3 measures for classification error are similar. Let’s use two class case as an example. Assume the probability of class 1 is p, then:

  • misclassifcation rate = \(1-max(p, 1-p)\)
  • Entropy = \(-plog(p) - (1-p)log(1-p)\)
  • Gini index = \(2p(1-p)\)

All 3 measures is 0 when p = 0 or p = 1, and maximizes when p = 1/2.

Let’s take an example of how to build a tree. Here we use the R package rpart and its example data Kyphosis. First we plot the data of kyphosis below. Data set Kyphosis has a dependent variable kyphosis of two categories: present and absent, and 3 varaibles Age, Number and Start. For more details, please refer to the (rpart documentation)[https://cran.r-project.org/web/packages/rpart/vignettes/longintro.pdf].

# Classification Tree with rpart
library(rpart); library(ggplot2)
plt <- ggplot(kyphosis, aes(x=Age, y=Start)) + 
     ggtitle("kyphosis data") +
     geom_point(aes(fill=factor(Kyphosis)), size=3, pch=21)
print(plt)

We build a tree use all 3 parameters. To avoid overfitting, we need to set some limits on the complexity of the tree. In the rpart.control function below, minsplit is the minimum number of observations that must exist in a node in order for a split to be attempted. cp is the complexity parameter which means that any split that does not decrease the overall lack of fit by a factor of cp is not attempted. We can see a decision tree is built using the code below.

fit <- rpart(Kyphosis ~ Age + Number + Start,
    method="class", data=kyphosis, control = rpart.control(minsplit = 20, cp = 0.01))
plot(fit, uniform=TRUE, 
    main="Classification Tree for Kyphosis")
text(fit, use.n=TRUE, all=TRUE, cex=.8)

To illustrate how cp limits the depth of the tree, we print how the cp parameter changes as the split number goes up. As shown below, the last split, which corresponds to a tree of 4 splits, has a cp of 0.01, which is the limit we set in rpart.control. Thus the split stops at 4. Optimum value of cp can be determined by cross-validation.

printcp(fit) # display the results 
## 
## Classification tree:
## rpart(formula = Kyphosis ~ Age + Number + Start, data = kyphosis, 
##     method = "class", control = rpart.control(minsplit = 20, 
##         cp = 0.01))
## 
## Variables actually used in tree construction:
## [1] Age   Start
## 
## Root node error: 17/81 = 0.20988
## 
## n= 81 
## 
##         CP nsplit rel error  xerror    xstd
## 1 0.176471      0   1.00000 1.00000 0.21559
## 2 0.019608      1   0.82353 0.94118 0.21078
## 3 0.010000      4   0.76471 1.00000 0.21559

Another criteria that can be used to prune is tree is to define a cost complexity criterion like the following. Let T be the number of nodes in a tree: \[N_t = number\{x_i \in R_t\}\] \[\hat{c}_t = 1/N_t\sum_{x_i \in R_t} y_i\] \[Q_t = \sum_{x_i \in R_t} (y_i - \hat{c}_t)^2\] Then the cost complexity is: \[ C_{\alpha} = \sum_{t=1}^T Q_t + \alpha T\] where \(\alpha\) is a coefficient which can be determined by cross-validation

The advantanges of decisions trees include:

  • they are easy to interpret based on the split results
  • they can easily handle mixed discrete and continuous inputs
  • they are insensitive to monotone transformations of the inputs because the split points are based on ranking the data points
  • they perform automatic variable selection
  • they are relatively robust to outliers
  • they scale well to large data sets

However, decision tree have an important disadvantage: overfitting due to the greedy nature of the tree construction algorithm, even with careful pruning. A related problem is that trees areunstable: small changes to the input data can have large effects on the structure of the tree. Solution to this problem will be discussed in the following section.

Bagging and Random Forest

An important solution to the problem of simple decision tree is to using boosting approach to create a large amount of sample data, create a decision tree for each of them, and average among all the trees. This technique is called bagging, which stands for “bootstrap aggregating”.

Unfortunately, simply re-running the same learning algorithm on different subsets of the data can result in highly correlated predictors, which limits the amount of variance reduction that is possible. To decorrelate the base learners, a randomly chosen subset of input data sets as well as a randomly chosen subset of the features are combined to create decision trees. This technique is called Random Forest. Let’s take a look at an example of Random forest as shown below using the same data set of Kyphosis:

library(randomForest)
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
fit <- randomForest(Kyphosis ~ Age + Number + Start,   data=kyphosis)
print(fit) # view results 
## 
## Call:
##  randomForest(formula = Kyphosis ~ Age + Number + Start, data = kyphosis) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 1
## 
##         OOB estimate of  error rate: 20.99%
## Confusion matrix:
##         absent present class.error
## absent      60       4   0.0625000
## present     13       4   0.7647059

Unfortunately, methods that use multiple trees lose their nice interpretability properties. Fortunately, various post-processing measures can be applied, and we can still see the relative importance of features from Random Forest.

importance(fit) # importance of each predictor
##        MeanDecreaseGini
## Age            8.837115
## Number         5.305533
## Start          9.797556

Boosting

Boosting is a greedy algorithm for fitting adaptive basis-function by applying the weak learner sequentially to weighted versions of the data, where more weight is given to examples that were misclassified by earlier rounds. In math terms, boosting is a forward stepwise procedure: \[ \hat{\theta}_m = argmin_{\theta_m} \sum_{i=1}^N L(y^{(i)}, f_{m-1}(x^{(i)}) + T(x^{(i)}, \theta_m))\]

If the Loss function is an exponential function \(L(y, f(x)) = exp(-yf(x))\), the solution is particularly simple, and this stepwise method using exponential loss function is referred to as Adaboost. It can be relatively easily prooven that the solution for Adaboost is: \(T(x^{(i)}, \theta_m) = \alpha_m k_m(x^{(i)})\) where \(k_m(x) = argmin_{k_m}\sum_i w_m^{(i)}exp(-y^{(i)}k_m(x^{(i)}))\) is the decision tree that minimize the loss of data set \((\mathbf{y, X})\) with weight \(w_m^{(i)} = exp(-y^{(i)}f_{m-1}(x^{(i)}))\). A realization of Adaboost algorithm is shown below. The in-sample error for data set kyphosis reduce to 0 when iteration is 10.

adaboost <- function(y, X, ntree = 100) {
  library(rpart)
  y = sapply(y, function(x){if (x == levels(x)[1]) return(-1) else return(1)})
  N = length(y)
  f = rep(0, N)
  in_sample_error = c()
  for (ii in 1:ntree) {
    w = exp(-y * f); w = w / sum(w)
    fit = rpart(y ~ ., X, w, method = "class")
    g = -1 + 2*(predict(fit, X)[,2] > 0.5)
    e = sum(w * ((y * g) < 0))
    alpha = 0.5*log((1-e)/e)
    f = f + alpha*g
    in_sample_error = c(in_sample_error, sum((y*f) < 0)/N)
  }
  return(in_sample_error)
}

y = kyphosis$Kyphosis; X = kyphosis[, 2:4]
in_sample_error = adaboost(y, X, 20)
plot(1:20, in_sample_error, xlab = "Iteration", ylab = "In sample error")

Exponential loss function is easy to solve because it is analogous to the square loss in least square regression. However, it is prone to overfitting because it puts too much weight out the outlier. More robust loss functions such as Huber loss or Multinomial Deviance can be used, but numerical solutions are needed for more general loss functions. Gradient boost algrithm is developed to solve the problem: induce a tree \(T(\mathbf{x}, \theta)\) at the m-th iteration whose predictions \(\mathbf{t}_m\) are as close as possible to the negative gradient of each data point of the loss function \(g_{im} = [\frac{\partial L(y^{(i)}, f(x^{(i)}))}{\partial f(x^{(i)})}]\): \[ \hat{\theta_m} = argmin_{\theta}\sum{i=1}^N(-g_{im}-T(x^{(i)}, \theta))^2\]

Since the loss function in Gradient Boost can be of many different kinds, Gradient Boost can be used for either classification or regression. R package gbm implements Gradient Boost.

Comparison with Linear Regression

Tree based methods have the following Pros and Cons:

Pros:

  • handling of mixed data types
  • hanlding of nonlinearity in response
  • Insensitive to monotone transformations of inputs
  • automatic feature selection / ranking

Cons:

  • Can not generate linear combinations of features, and can not interpret the sign of correlation between variable and response
  • Prone to overfitting and needs proper choice of tree parameters

Linear Regression has the following Pros and Cons:

Pros:

  • Less prone to overfitting due to its simple linearity nature
  • Easy interpretation on the sign of correlation between variable and response.

Cons:

  • Relative importance of variables in multiple linear regression can only be achieved with normalized input data and same expected sign of correlation with response.
  • Can not capture nonlinearity