Gradient Boosting Tree Basic
- start with a weak and simpler learner (e.g. mean()), get a prediction
- use a lost function J to compute the error between y_true and y_predict
- Loss function: L2(Y,y)
- Objective function: sum(L2(Y,y)) for all examples.
- create K boosting trees, each tree create a tree to predict J
- combine k trees together for the prediction
Sudo code:
1 |
|
GBDT in SKlearn
- Objective function
- use 1st Taylor expansion
- Base Tree: CART
- No reg term
- No sparse handling, so split finding is slow
XGBoost
- Objective Function = sum(L2(Y,y)) for all examples + reg term
- use 2nd Taylor expansion, objective function can computed using first and second derivative of lost function.
- Base Tree: CART and Linear Regressor
- reg term = mu(num_leaf) + lambdasum(weight of each leaf^2)
- calculate gain for each split
- Greedy Algorithm for split finding
- Approximate Algorithm for split finding
- Global: use same split point for all trees
- local: use different split point for different trees
- Sparse Aware algorithm for better nan values in features.
- Computation optimization
- use block storage: Most time consuming: sort feature values for every split finding.
- support parallel computing
- Memeory consume is still large
lightGBM
- Gradient-based One-side sampling
- Greedy Bundling
- Merge Exclusive Features, create bins
- Computation optimization
- features parallel
- Reduce scatter