Loading [MathJax]/jax/output/HTML-CSS/jax.js

Saturday, October 28, 2017

Binary search and the C++ API

C++ binary search API

  • template< class ForwardIt, class T, class Compare > ForwardIt std::upper_bound( ForwardIt first, ForwardIt last, const T& value, Compare comp ): Returns an iterator pointing to the first element in the range [first, last) that is greater than value, or last if no such element is found. Note comp is default to operator<
  • template< class ForwardIt, class T, class Compare > ForwardIt std::lower_bound( ForwardIt first, ForwardIt last, const T& value, Compare comp ):  Same  as upper_bound() except that it is "not less than"
  • template< class ForwardIt, class T, class Compare >
    std::pair<ForwardIt,ForwardIt> equal_range( ForwardIt first, ForwardIt last, const T& value, Compare comp ):
    Returns a range containing all elements equivalent to value in the range [first, last). The range [first, last) must be partitioned with respect to comparison with value, i.e. it must satisfy all of the following requirements:
    • partitioned with respect to element < value or comp(element, value)
    • partitioned with respect to !(value < element) or !comp(value, element)
    • for all elements, if element < value or comp(element, value) is true then !(value < element) or !comp(value, element) is also true
  • set, multiset, map, multimap all have their own upper_bound() and lower_bound() method returning corresponding forward iterators.
  • multimap, multiset, unordred_multimap and ordered_multiset has equal_range() method
   multiset<int> s{1,2,3,3,3,3,5,7};
    auto it = s.lower_bound(3);
    cout << (*it) << endl;
    it++; it++; it++; it++;
    cout << (*it) << endl;

    vector<int> v{1,2,3,3,3,3,5,7};
    auto vit = lower_bound(v.begin(),v.end(), 3);
    cout << (vit-v.begin()) << endl;
    cout << upper_bound(v.begin(), v.end(), 3) - v.begin() << endl;
    auto range = equal_range(v.begin(), v.end(), 3);
    cout << (range.first - v.begin()) << "," << (range.second - v.begin()) << endl;


For a sorted array in ascending order:
  • Implementing upper_bound():
    int l = 0, r = v.size();
    const int target = 3;
    while (l < r) {
        auto mid = l + (r - l)/2;
        if (v[mid] > target) r = mid;
        else l = mid + 1;
    }
    cout << r << endl;

  • Implementing reverse_upper_bound():
    l = -1, r = v.size()-1;
    while (l < r) {
        auto mid = r - (r - l)/2;
        if (v[mid] < target) l = mid;
        else r = mid - 1;
    }
    cout << l << endl;

  • Implementing lower_bound():
     v[reverse_upper_bound() + 1] == target ? upper_bound() : reverse_upper_bound() + 1;

  • Implementing reverse_lower_bound():
     v[upper_bound() - 1] == target ? reverse_upper_bound() : upper_bound() - 1;

  • Implementing equal_range():
     range.first = lower_bound(), range.second = upper_bound();

Sunday, May 7, 2017

Machine Learning and Statistics: a unified perspective | 5.1 Gradient Boosting - Linear

linear_gradient_boost
Gradient Boosting is a generic method to create a strong learner using additive weak learners. Each weak learner is created by fitting the residual from the previous step. The mathematical formulation of the problem is the following:
The goal of the fitting process is to find a function F(x) such that F(x=argminF(x)L(y,F(x))+Ω(F=argminF(x)ni=1L(yi,F(xi))+Ω(F), where L(y,F(x)) is the loss function and Ω(F) is the regularization function. We find out F(x) by breaking F(x) down into sum of M functions such that F(x)=Mm=1hm(x). Each hm(x) is derived by fitting the residual from the previous residual of yFm1(x) such that hm(x)=Fm(x)Fm1(x)=argminhm(x)Hni=1L(yi,Fm1(xi)+hm(xi))+Ω(Fm1+hm). To solve the minimization problem, gradient boosting took the idea from Newton's method. If L(y,F(x)) is a convex function, we can minimize L(y,F(x)) by sequentially minimizing its local second order Talor's expansion, i.e., hm(x)=argminhm(x)Hni=1[L(yi,Fm1(xi))+gradihm(xi)+hessihm(xi)2]+Ω(Fm1+hm), where gradi=LFm1(xi) is the first order derivative and hessi=2LFm1(xi)2 is the second order derivative, which are all independent of hm(x). If hm(x) is parameterized by dθ (which represents a change in the existing parameter θm1), then the optimization problem changes from a function optimization to a parameter optimization: hm(x)=argmindθni=1[gradihm(dθ,xi)+12hessihm(dθ,xi)2]+Ω(θm1+dθ)
To start the iteration, we define F1(x)0 and h0(x)=γ(constant) so that h0(x)=F0(x)=argminγni=1L(yi,γ). This would be easy to calculate regardless of the form of the loss function. For further iterations, h(x)H where H is a family of functions called base (or weak) learners. Usually for linear problems, h(x)=θTx+b, while for nonlinear problems h(x) would most commonly be some type of decision tree since tree is a very generic non-parametric form of nonlinear functions. But other types of nonlinear base learners are also possible. In this article we will focus on the linear gradient boosting problem, which is relatively simple. We use the package (XGBoost)[https://xgboost.readthedocs.io/en/latest//] for demonstration. In this very simple example below, we regress [10,20,30] to [1,2,3], which should result in a linear model of F(x)=θx+b, where θ is expected to be 10 and b is expected to be 0
In [2]:
import xgboost as xgb
import pandas as pd
import numpy as np
In [12]:
import pandas as pd
import xgboost as xgb

df = pd.DataFrame({'x':[1,2,3], 'y':[10,20,30]})
X_train = df.drop('y',axis=1)
Y_train = df['y']
T_train_xgb = xgb.DMatrix(X_train, Y_train)

params = {"objective": "reg:linear", "booster":"gblinear", "base_score":0}
gbm = xgb.train(dtrain=T_train_xgb,params=params, num_boost_round=20)
Y_pred = gbm.predict(xgb.DMatrix(pd.DataFrame({'x':[4,5]})))
print(Y_pred)
[ 39.2363205  48.7781105]
In [11]:
gbm.get_dump()
Out[11]:
['bias:\n1.06916\nweight:\n9.54179\n']
We can see that after 20 iterations, xgboost using reg:linear objective and gblinear booster calculated θ=9.54179 and b=1.06916, which is close to the theoretical solution. Let's take a closer look at how linear boosting get this result. For linear boosting, F(x)=θTx+b=Mm=1hm(x)=(θm1+dθ)Tx+(bm1+db). Here x can be a vector of size k, which means that x has k features, and thus θ will be a vector of size k as well. In the particular example above, we have k = 1. For loss function we use L2 loss and elastic net regularization, i.e., L(y,F(x))=Mm=1(yiF(xi))2 and Ω(θ,b)=λθ2+α|θ|+λbb2. Based on the previous deduction, the additive function hm(x) at each step is: hm(x)=argmindθ,dbG(θm1+dθ,bm1+db)=argmindθ,dbni=1[gradi(dθTxi+db)+12hessi(dθTxi+b)2]+12λ||θm1+dθ||2+α||θm1+dθ||+12λb(bm1+db)2, where gradi=(yiFm1(xi)) and hessi=1.
For Gdθ=0, we have: [gradixi+hessix2idθ+hessixib]+λdθ+λθm1+αsgn(θm1+dθ)=0. Thus dθ=gradixi+hessixib+λθm1+αsgn(θm1+θ)hessix2i+λ
For Gb=0, similarly, we can get: db=gradi+λbbm1hessi+λb
For m = 0, we have θ0=b0=0, and thus grad1,2,3=10,20,30, and hess1,2,3=1,1,1. We use λ=0,λb=0,α=0. We update b first because the update of b, db is independent of θ but the update of θ, dθ is dependent of b. Thus db=1020301+1+1=20, and thus dθ=(101+202+303)+(1120+1220+1320)111+122+133=2014=1.42857
In [ ]: