Saturday, October 28, 2017

Binary search and the C++ API

C++ binary search API

  • template< class ForwardIt, class T, class Compare > ForwardIt std::upper_bound( ForwardIt first, ForwardIt last, const T& value, Compare comp ): Returns an iterator pointing to the first element in the range [first, last) that is greater than value, or last if no such element is found. Note comp is default to operator<
  • template< class ForwardIt, class T, class Compare > ForwardIt std::lower_bound( ForwardIt first, ForwardIt last, const T& value, Compare comp ):  Same  as upper_bound() except that it is "not less than"
  • template< class ForwardIt, class T, class Compare >
    std::pair<ForwardIt,ForwardIt> equal_range( ForwardIt first, ForwardIt last, const T& value, Compare comp ):
    Returns a range containing all elements equivalent to value in the range [first, last). The range [first, last) must be partitioned with respect to comparison with value, i.e. it must satisfy all of the following requirements:
    • partitioned with respect to element < value or comp(element, value)
    • partitioned with respect to !(value < element) or !comp(value, element)
    • for all elements, if element < value or comp(element, value) is true then !(value < element) or !comp(value, element) is also true
  • set, multiset, map, multimap all have their own upper_bound() and lower_bound() method returning corresponding forward iterators.
  • multimap, multiset, unordred_multimap and ordered_multiset has equal_range() method
   multiset<int> s{1,2,3,3,3,3,5,7};
    auto it = s.lower_bound(3);
    cout << (*it) << endl;
    it++; it++; it++; it++;
    cout << (*it) << endl;

    vector<int> v{1,2,3,3,3,3,5,7};
    auto vit = lower_bound(v.begin(),v.end(), 3);
    cout << (vit-v.begin()) << endl;
    cout << upper_bound(v.begin(), v.end(), 3) - v.begin() << endl;
    auto range = equal_range(v.begin(), v.end(), 3);
    cout << (range.first - v.begin()) << "," << (range.second - v.begin()) << endl;


For a sorted array in ascending order:
  • Implementing upper_bound():
    int l = 0, r = v.size();
    const int target = 3;
    while (l < r) {
        auto mid = l + (r - l)/2;
        if (v[mid] > target) r = mid;
        else l = mid + 1;
    }
    cout << r << endl;

  • Implementing reverse_upper_bound():
    l = -1, r = v.size()-1;
    while (l < r) {
        auto mid = r - (r - l)/2;
        if (v[mid] < target) l = mid;
        else r = mid - 1;
    }
    cout << l << endl;

  • Implementing lower_bound():
     v[reverse_upper_bound() + 1] == target ? upper_bound() : reverse_upper_bound() + 1;

  • Implementing reverse_lower_bound():
     v[upper_bound() - 1] == target ? reverse_upper_bound() : upper_bound() - 1;

  • Implementing equal_range():
     range.first = lower_bound(), range.second = upper_bound();

Sunday, May 7, 2017

Machine Learning and Statistics: a unified perspective | 5.1 Gradient Boosting - Linear

linear_gradient_boost
Gradient Boosting is a generic method to create a strong learner using additive weak learners. Each weak learner is created by fitting the residual from the previous step. The mathematical formulation of the problem is the following:
The goal of the fitting process is to find a function F(x) such that $F(x) = argmin_{F(x)} L(y, F(x)) + \Omega(F) = argmin_{F(x)} \sum_{i = 1}^{n}L(y_i, F(x_i)) + \Omega(F)$, where $L(y, F(x))$ is the loss function and $\Omega(F)$ is the regularization function. We find out F(x) by breaking F(x) down into sum of M functions such that $F(x) = \sum_{m = 1}^{M} h_m(x)$. Each $h_m(x)$ is derived by fitting the residual from the previous residual of $y - F_{m-1}(x)$ such that $h_m(x) = F_m(x) - F_{m-1}(x) = argmin_{h_m(x) \in \mathcal{H}} \sum_{i = 1}^{n} L(y_i, F_{m-1}(x_i) + h_m(x_i)) + \Omega(F_{m-1} + h_m)$. To solve the minimization problem, gradient boosting took the idea from Newton's method. If $L(y, F(x))$ is a convex function, we can minimize $L(y, F(x))$ by sequentially minimizing its local second order Talor's expansion, i.e., $h_m(x) = argmin_{h_m(x) \in \mathcal{H}} \sum_{i = 1}^{n} [L(y_i, F_{m-1}(x_i)) + grad_i\cdot h_m(x_i) + hess_i\cdot h_m(x_i)^2] + \Omega(F_{m-1} + h_m)$, where $grad_i = \frac{\partial L}{\partial F_{m-1}(x_i)}$ is the first order derivative and $hess_i = \frac{\partial ^2 L}{\partial F_{m-1}(x_i)^2}$ is the second order derivative, which are all independent of $h_m(x)$. If $h_m(x)$ is parameterized by $d\theta$ (which represents a change in the existing parameter $\theta_{m-1}$), then the optimization problem changes from a function optimization to a parameter optimization: $h_m(x) = argmin_{d\theta} \sum_{i = 1}^{n} [grad_i\cdot h_m(d\theta, x_i) + \frac{1}{2}hess_i\cdot h_m(d\theta, x_i)^2] + \Omega(\theta_{m-1} + d\theta)$
To start the iteration, we define $F_{-1}(x) \equiv 0$ and $h_0(x) = \gamma (constant)$ so that $h_0(x) = F_0(x) = argmin_{\gamma} \sum_{i = 1}^{n} L(y_i, \gamma)$. This would be easy to calculate regardless of the form of the loss function. For further iterations, $h(x) \in \mathcal{H}$ where $\mathcal{H}$ is a family of functions called base (or weak) learners. Usually for linear problems, $h(x) = \theta^Tx + b$, while for nonlinear problems $h(x)$ would most commonly be some type of decision tree since tree is a very generic non-parametric form of nonlinear functions. But other types of nonlinear base learners are also possible. In this article we will focus on the linear gradient boosting problem, which is relatively simple. We use the package (XGBoost)[https://xgboost.readthedocs.io/en/latest//] for demonstration. In this very simple example below, we regress [10,20,30] to [1,2,3], which should result in a linear model of $F(x) = \theta x + b$, where $\theta$ is expected to be 10 and $b$ is expected to be 0
In [2]:
import xgboost as xgb
import pandas as pd
import numpy as np
In [12]:
import pandas as pd
import xgboost as xgb

df = pd.DataFrame({'x':[1,2,3], 'y':[10,20,30]})
X_train = df.drop('y',axis=1)
Y_train = df['y']
T_train_xgb = xgb.DMatrix(X_train, Y_train)

params = {"objective": "reg:linear", "booster":"gblinear", "base_score":0}
gbm = xgb.train(dtrain=T_train_xgb,params=params, num_boost_round=20)
Y_pred = gbm.predict(xgb.DMatrix(pd.DataFrame({'x':[4,5]})))
print(Y_pred)
[ 39.2363205  48.7781105]
In [11]:
gbm.get_dump()
Out[11]:
['bias:\n1.06916\nweight:\n9.54179\n']
We can see that after 20 iterations, xgboost using reg:linear objective and gblinear booster calculated $\theta = 9.54179$ and $b = 1.06916$, which is close to the theoretical solution. Let's take a closer look at how linear boosting get this result. For linear boosting, $F(x) = \theta^T \mathbf{x} + b = \sum_{m = 1}^{M} h_m(x) = (\theta_{m-1} + d\theta)^T \mathbf{x} + (b_{m-1} + db)$. Here $\mathbf{x}$ can be a vector of size k, which means that x has k features, and thus $\theta$ will be a vector of size k as well. In the particular example above, we have k = 1. For loss function we use L2 loss and elastic net regularization, i.e., $L(y, F(x)) = \sum_{m = 1}^{M}(y_i - F(x_i))^2$ and $\Omega(\theta, b) = \lambda \theta^2 + \alpha |\theta| + \lambda_b b^2$. Based on the previous deduction, the additive function $h_m(x)$ at each step is: $h_m(x) = argmin_{d\theta, db} G(\theta_{m-1} + d\theta, b_{m-1} + db) = argmin_{d\theta, db} \sum_{i = 1}^{n} [grad_i\cdot (d\theta^T \mathbf{x}_i + db) + \frac{1}{2}hess_i\cdot (d\theta^T \mathbf{x}_i + b)^2] + \frac{1}{2}\lambda ||\theta_{m-1} + d\theta||^2 + \alpha ||\theta_{m-1} + d\theta|| + \frac{1}{2}\lambda_b (b_{m-1} + db)^2$, where $grad_i = -(y_i - F_{m-1}(x_i))$ and $hess_i = 1$.
For $\frac{\partial G}{\partial d\theta} = 0$, we have: $\sum [grad_i \mathbf{x}_i + hess_i \mathbf{x}_i^2 d\theta + hess_i \mathbf{x}_i b] + \lambda d\theta + \lambda \theta_{m-1} + \alpha \cdot sgn(\theta_{m-1} + d\theta) = 0$. Thus $d\theta = -\frac{\sum grad_i \mathbf{x}_i + hess_i \mathbf{x}_i b + \lambda \theta_{m-1} + \alpha \cdot sgn(\theta_{m-1} + \theta)}{\sum hess_i \mathbf{x}_i^2 + \lambda}$
For $\frac{\partial G}{\partial b} = 0$, similarly, we can get: $db = -\frac{\sum grad_i + \lambda_b \cdot b_{m-1}}{\sum hess_i + \lambda_b}$
For m = 0, we have $\theta_0 = b_0 = 0$, and thus $grad_{1,2,3} = -10, -20, -30$, and $hess_{1,2,3} = 1, 1, 1$. We use $\lambda = 0, \lambda_b = 0, \alpha = 0$. We update b first because the update of b, $db$ is independent of $\theta$ but the update of $\theta$, $d\theta$ is dependent of $b$. Thus $db = -\frac{-10 - 20 - 30}{1 + 1 + 1} = 20$, and thus $d\theta = -\frac{(-10*1 + -20*2 + -30*3) + (1*1*20 + 1*2*20 + 1*3*20)}{1*1*1 + 1*2*2 + 1*3*3} = \frac{20}{14} = 1.42857$
In [ ]: