Sunday, May 7, 2017

Machine Learning and Statistics: a unified perspective | 5.1 Gradient Boosting - Linear

linear_gradient_boost
Gradient Boosting is a generic method to create a strong learner using additive weak learners. Each weak learner is created by fitting the residual from the previous step. The mathematical formulation of the problem is the following:
The goal of the fitting process is to find a function F(x) such that $F(x) = argmin_{F(x)} L(y, F(x)) + \Omega(F) = argmin_{F(x)} \sum_{i = 1}^{n}L(y_i, F(x_i)) + \Omega(F)$, where $L(y, F(x))$ is the loss function and $\Omega(F)$ is the regularization function. We find out F(x) by breaking F(x) down into sum of M functions such that $F(x) = \sum_{m = 1}^{M} h_m(x)$. Each $h_m(x)$ is derived by fitting the residual from the previous residual of $y - F_{m-1}(x)$ such that $h_m(x) = F_m(x) - F_{m-1}(x) = argmin_{h_m(x) \in \mathcal{H}} \sum_{i = 1}^{n} L(y_i, F_{m-1}(x_i) + h_m(x_i)) + \Omega(F_{m-1} + h_m)$. To solve the minimization problem, gradient boosting took the idea from Newton's method. If $L(y, F(x))$ is a convex function, we can minimize $L(y, F(x))$ by sequentially minimizing its local second order Talor's expansion, i.e., $h_m(x) = argmin_{h_m(x) \in \mathcal{H}} \sum_{i = 1}^{n} [L(y_i, F_{m-1}(x_i)) + grad_i\cdot h_m(x_i) + hess_i\cdot h_m(x_i)^2] + \Omega(F_{m-1} + h_m)$, where $grad_i = \frac{\partial L}{\partial F_{m-1}(x_i)}$ is the first order derivative and $hess_i = \frac{\partial ^2 L}{\partial F_{m-1}(x_i)^2}$ is the second order derivative, which are all independent of $h_m(x)$. If $h_m(x)$ is parameterized by $d\theta$ (which represents a change in the existing parameter $\theta_{m-1}$), then the optimization problem changes from a function optimization to a parameter optimization: $h_m(x) = argmin_{d\theta} \sum_{i = 1}^{n} [grad_i\cdot h_m(d\theta, x_i) + \frac{1}{2}hess_i\cdot h_m(d\theta, x_i)^2] + \Omega(\theta_{m-1} + d\theta)$
To start the iteration, we define $F_{-1}(x) \equiv 0$ and $h_0(x) = \gamma (constant)$ so that $h_0(x) = F_0(x) = argmin_{\gamma} \sum_{i = 1}^{n} L(y_i, \gamma)$. This would be easy to calculate regardless of the form of the loss function. For further iterations, $h(x) \in \mathcal{H}$ where $\mathcal{H}$ is a family of functions called base (or weak) learners. Usually for linear problems, $h(x) = \theta^Tx + b$, while for nonlinear problems $h(x)$ would most commonly be some type of decision tree since tree is a very generic non-parametric form of nonlinear functions. But other types of nonlinear base learners are also possible. In this article we will focus on the linear gradient boosting problem, which is relatively simple. We use the package (XGBoost)[https://xgboost.readthedocs.io/en/latest//] for demonstration. In this very simple example below, we regress [10,20,30] to [1,2,3], which should result in a linear model of $F(x) = \theta x + b$, where $\theta$ is expected to be 10 and $b$ is expected to be 0
In [2]:
import xgboost as xgb
import pandas as pd
import numpy as np
In [12]:
import pandas as pd
import xgboost as xgb

df = pd.DataFrame({'x':[1,2,3], 'y':[10,20,30]})
X_train = df.drop('y',axis=1)
Y_train = df['y']
T_train_xgb = xgb.DMatrix(X_train, Y_train)

params = {"objective": "reg:linear", "booster":"gblinear", "base_score":0}
gbm = xgb.train(dtrain=T_train_xgb,params=params, num_boost_round=20)
Y_pred = gbm.predict(xgb.DMatrix(pd.DataFrame({'x':[4,5]})))
print(Y_pred)
[ 39.2363205  48.7781105]
In [11]:
gbm.get_dump()
Out[11]:
['bias:\n1.06916\nweight:\n9.54179\n']
We can see that after 20 iterations, xgboost using reg:linear objective and gblinear booster calculated $\theta = 9.54179$ and $b = 1.06916$, which is close to the theoretical solution. Let's take a closer look at how linear boosting get this result. For linear boosting, $F(x) = \theta^T \mathbf{x} + b = \sum_{m = 1}^{M} h_m(x) = (\theta_{m-1} + d\theta)^T \mathbf{x} + (b_{m-1} + db)$. Here $\mathbf{x}$ can be a vector of size k, which means that x has k features, and thus $\theta$ will be a vector of size k as well. In the particular example above, we have k = 1. For loss function we use L2 loss and elastic net regularization, i.e., $L(y, F(x)) = \sum_{m = 1}^{M}(y_i - F(x_i))^2$ and $\Omega(\theta, b) = \lambda \theta^2 + \alpha |\theta| + \lambda_b b^2$. Based on the previous deduction, the additive function $h_m(x)$ at each step is: $h_m(x) = argmin_{d\theta, db} G(\theta_{m-1} + d\theta, b_{m-1} + db) = argmin_{d\theta, db} \sum_{i = 1}^{n} [grad_i\cdot (d\theta^T \mathbf{x}_i + db) + \frac{1}{2}hess_i\cdot (d\theta^T \mathbf{x}_i + b)^2] + \frac{1}{2}\lambda ||\theta_{m-1} + d\theta||^2 + \alpha ||\theta_{m-1} + d\theta|| + \frac{1}{2}\lambda_b (b_{m-1} + db)^2$, where $grad_i = -(y_i - F_{m-1}(x_i))$ and $hess_i = 1$.
For $\frac{\partial G}{\partial d\theta} = 0$, we have: $\sum [grad_i \mathbf{x}_i + hess_i \mathbf{x}_i^2 d\theta + hess_i \mathbf{x}_i b] + \lambda d\theta + \lambda \theta_{m-1} + \alpha \cdot sgn(\theta_{m-1} + d\theta) = 0$. Thus $d\theta = -\frac{\sum grad_i \mathbf{x}_i + hess_i \mathbf{x}_i b + \lambda \theta_{m-1} + \alpha \cdot sgn(\theta_{m-1} + \theta)}{\sum hess_i \mathbf{x}_i^2 + \lambda}$
For $\frac{\partial G}{\partial b} = 0$, similarly, we can get: $db = -\frac{\sum grad_i + \lambda_b \cdot b_{m-1}}{\sum hess_i + \lambda_b}$
For m = 0, we have $\theta_0 = b_0 = 0$, and thus $grad_{1,2,3} = -10, -20, -30$, and $hess_{1,2,3} = 1, 1, 1$. We use $\lambda = 0, \lambda_b = 0, \alpha = 0$. We update b first because the update of b, $db$ is independent of $\theta$ but the update of $\theta$, $d\theta$ is dependent of $b$. Thus $db = -\frac{-10 - 20 - 30}{1 + 1 + 1} = 20$, and thus $d\theta = -\frac{(-10*1 + -20*2 + -30*3) + (1*1*20 + 1*2*20 + 1*3*20)}{1*1*1 + 1*2*2 + 1*3*3} = \frac{20}{14} = 1.42857$
In [ ]:
 

No comments:

Post a Comment