Gradient Boosting Tree Basic

start with a weak and simpler learner (e.g. mean()), get a prediction
use a lost function J to compute the error between y_true and y_predict

Loss function: L2(Y,y)
Objective function: sum(L2(Y,y)) for all examples.

create K boosting trees, each tree create a tree to predict J
combine k trees together for the prediction

Sudo code:

y = mean(X,Y) # define a very simple base tree, here we use mean
errorJ = Y - y # create a lost function errorJ
For i in range(K): # K boosted trees
  learner(K) = regression(X,errorJ)
  predicterrorJ(K) = learner(K).predict(X)
  # create a learner K, use a method, here is regression. fit the X to the error J.
  alpha(K) = 1
  # define the learing rate of this learner: step size.
  # smaller alpha, more trees, better predictions.

  errorJ = errorJ - alpha(K)*predicterrorJ(K)

GBDT in SKlearn

Objective function
- use 1st Taylor expansion
Base Tree: CART
No reg term
No sparse handling, so split finding is slow

XGBoost

Objective Function = sum(L2(Y,y)) for all examples + reg term
- use 2nd Taylor expansion, objective function can computed using first and second derivative of lost function.
Base Tree: CART and Linear Regressor
reg term = mu(num_leaf) + lambdasum(weight of each leaf^2)
calculate gain for each split
Greedy Algorithm for split finding
Approximate Algorithm for split finding
- Global: use same split point for all trees
- local: use different split point for different trees
Sparse Aware algorithm for better nan values in features.
Computation optimization
- use block storage: Most time consuming: sort feature values for every split finding.
- support parallel computing
Memeory consume is still large

lightGBM

Gradient-based One-side sampling
Greedy Bundling
Merge Exclusive Features, create bins
Computation optimization
- features parallel
- Reduce scatter