Gradient Boosting is a generic method to create a strong learner using additive weak learners. Each weak learner is created by fitting the residual from the previous step. The mathematical formulation of the problem is the following:
The goal of the fitting process is to find a function F(x) such that F(x)=argminF(x)L(y,F(x))+Ω(F)=argminF(x)∑ni=1L(yi,F(xi))+Ω(F), where L(y,F(x)) is the loss function and Ω(F) is the regularization function. We find out F(x) by breaking F(x) down into sum of M functions such that F(x)=∑Mm=1hm(x). Each hm(x) is derived by fitting the residual from the previous residual of y−Fm−1(x) such that hm(x)=Fm(x)−Fm−1(x)=argminhm(x)∈H∑ni=1L(yi,Fm−1(xi)+hm(xi))+Ω(Fm−1+hm). To solve the minimization problem, gradient boosting took the idea from Newton's method. If L(y,F(x)) is a convex function, we can minimize L(y,F(x)) by sequentially minimizing its local second order Talor's expansion, i.e., hm(x)=argminhm(x)∈H∑ni=1[L(yi,Fm−1(xi))+gradi⋅hm(xi)+hessi⋅hm(xi)2]+Ω(Fm−1+hm), where gradi=∂L∂Fm−1(xi) is the first order derivative and hessi=∂2L∂Fm−1(xi)2 is the second order derivative, which are all independent of hm(x). If hm(x) is parameterized by dθ (which represents a change in the existing parameter θm−1), then the optimization problem changes from a function optimization to a parameter optimization: hm(x)=argmindθ∑ni=1[gradi⋅hm(dθ,xi)+12hessi⋅hm(dθ,xi)2]+Ω(θm−1+dθ)
To start the iteration, we define F−1(x)≡0 and h0(x)=γ(constant) so that h0(x)=F0(x)=argminγ∑ni=1L(yi,γ). This would be easy to calculate regardless of the form of the loss function. For further iterations, h(x)∈H where H is a family of functions called base (or weak) learners. Usually for linear problems, h(x)=θTx+b, while for nonlinear problems h(x) would most commonly be some type of decision tree since tree is a very generic non-parametric form of nonlinear functions. But other types of nonlinear base learners are also possible. In this article we will focus on the linear gradient boosting problem, which is relatively simple. We use the package (XGBoost)[https://xgboost.readthedocs.io/en/latest//] for demonstration. In this very simple example below, we regress [10,20,30] to [1,2,3], which should result in a linear model of F(x)=θx+b, where θ is expected to be 10 and b is expected to be 0
In [2]:
import xgboost as xgb
import pandas as pd
import numpy as np
In [12]:
import pandas as pd
import xgboost as xgb
df = pd.DataFrame({'x':[1,2,3], 'y':[10,20,30]})
X_train = df.drop('y',axis=1)
Y_train = df['y']
T_train_xgb = xgb.DMatrix(X_train, Y_train)
params = {"objective": "reg:linear", "booster":"gblinear", "base_score":0}
gbm = xgb.train(dtrain=T_train_xgb,params=params, num_boost_round=20)
Y_pred = gbm.predict(xgb.DMatrix(pd.DataFrame({'x':[4,5]})))
print(Y_pred)
In [11]:
gbm.get_dump()
Out[11]:
We can see that after 20 iterations, xgboost using reg:linear objective and gblinear booster calculated θ=9.54179 and b=1.06916, which is close to the theoretical solution. Let's take a closer look at how linear boosting get this result. For linear boosting, F(x)=θTx+b=∑Mm=1hm(x)=(θm−1+dθ)Tx+(bm−1+db). Here x can be a vector of size k, which means that x has k features, and thus θ will be a vector of size k as well. In the particular example above, we have k = 1. For loss function we use L2 loss and elastic net regularization, i.e., L(y,F(x))=∑Mm=1(yi−F(xi))2 and Ω(θ,b)=λθ2+α|θ|+λbb2. Based on the previous deduction, the additive function hm(x) at each step is: hm(x)=argmindθ,dbG(θm−1+dθ,bm−1+db)=argmindθ,db∑ni=1[gradi⋅(dθTxi+db)+12hessi⋅(dθTxi+b)2]+12λ||θm−1+dθ||2+α||θm−1+dθ||+12λb(bm−1+db)2, where gradi=−(yi−Fm−1(xi)) and hessi=1.
For ∂G∂dθ=0, we have: ∑[gradixi+hessix2idθ+hessixib]+λdθ+λθm−1+α⋅sgn(θm−1+dθ)=0. Thus dθ=−∑gradixi+hessixib+λθm−1+α⋅sgn(θm−1+θ)∑hessix2i+λ
For ∂G∂b=0, similarly, we can get: db=−∑gradi+λb⋅bm−1∑hessi+λb
For m = 0, we have θ0=b0=0, and thus grad1,2,3=−10,−20,−30, and hess1,2,3=1,1,1. We use λ=0,λb=0,α=0. We update b first because the update of b, db is independent of θ but the update of θ, dθ is dependent of b. Thus db=−−10−20−301+1+1=20, and thus dθ=−(−10∗1+−20∗2+−30∗3)+(1∗1∗20+1∗2∗20+1∗3∗20)1∗1∗1+1∗2∗2+1∗3∗3=2014=1.42857
In [ ]:
No comments:
Post a Comment