Skip to content

Gradient Boosted Models

data-science, ML

Snippet

What is Boosting?

  • Type of Ensemble Models. For a graphical view, check out Tree-Based Methods
  • Basically, you start from a base stump (single Y/N boundary/question)
  • A boosted ensemble is the first tree + learning_rate * second_tree
  • Use a Loss Function (for XGBoost, Loss function is popular)
  • By the way, XGBoost is just a library implemented of a gradient boosted model.

Pros and Cons

  • Pros
    • Powerful and accurate, more so often than random forest
    • Good at handling complex, non-linear relationships
    • Good at dealing with imbalanced data
  • Cons
    • Slower to train, since trees must be built sequentially
    • Prone to overfitting if data is noisy
    • Harder to tune hyperparameters

Notes:

  • Gradient boosting for linear regression doesn't work
    • Boosting shines when there is no terse functional form around. Boosting decision trees lets the functional form of the regressor/classifier evolve slowly to fit the data, often resulting in complex shapes one could not have dreamed up by hand and eye. When a simple functional form is desired, boosting is not going to help you find it (or at least is probably a rather inefficient way to find it).
  • Different kinds of models have different advantages. The boosted trees model is very good at handling tabular data with numerical features, or categorical features with fewer than hundreds of categories. Unlike linear models, the boosted trees model are able to capture non-linear interaction between the features and the target.

Example

import pandas as pd
import numpy as np
import sklearn
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import xgboost as xgb
from xgboost import XGBClassifier, plot_tree
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
from sklearn.tree import export_graphviz
mnist = pd.read_csv("sample_data/mnist_train_small.csv")
x = mnist[mnist.columns.difference(["6"])]
y = mnist["6"]
x_train_val, x_test, y_train_val, y_test = train_test_split(x, y, test_size=0.2)
x_train, x_val, y_train, y_val = train_test_split(x_train_val, y_train_val, test_size=0.2)
model = RandomForestRegressor(50, max_depth=15, max_features=15)
model.fit(x_train, y_train)
print(model.score(x_val, y_val))
model2 = XGBClassifier(objective='multiclass:softmax', learning_rate = 0.1,
max_depth = 1, n_estimators = 330)
model2.fit(x_train, y_train)
preds = model2.predict(x_test)
print(sum(preds==y_test)/len(y_test))

[Image: boosting.pdf]

Gradient Boosted Tree-Based Methods:

  • Each tree is typically a weak learner, meaning it performs relatively poorly on its own but contributes to the overall performance of the ensemble.
  • The trees are usually shallow and consist of only a few splits.
  • Sequentially built
    • Bagging methods like Random Forest build each tree independently
    • Boosting they are built sequentially, with each tree correcting mistakes of its predecessors.
  • Each subsequent tree focuses on misclassification instances, giving more weight to those instances to try to classify correctly.
  • Gradient Boosting
    • Specific type of boosting where each tree fits on residual errors of previous trees, fitting negative gradient of loss function that is optimized.
    • Generalizes boosting method by allowing optimization of an arbitrary differentiable loss function.

==Setting parameters like subsample ratios, column subsampling, regularization terms.==

Limitations of SHAPley scores