Introducing the CatBoost

AI Maverick
2 min readJul 2, 2022

--

Gradient Boosting Machines are among the well-known machine learning models, which are helpful in predicting different problems, including weather, web search, traffic, etc. The idea is to build an ensemble model by considering several weak learners that are the decision tree regression in most studies. For the current decision tree approaches, the datasets include the categorical feature, they need to be transformed into numerical values to be usable by the trees. The main idea of the CatBoost (categorical boosting) [1] is to treat the categorical features in a novel approach.

The purpose of this article is to review and introduce the CatBoost [1] model and library.

| kaggle | GitHub | Buy a coffee |

As we have already mentioned, the CatBoost [1] tries to transform the categorical features into numerical values while reducing the overfitting issue. They [1] introduced a data shuffling procedure that returns different permutations of the categorical features. First, for the primary values of the categorical features, it assigns a random float value. Next, it takes into account the counting of the previous values in the same class and its probability of the positive class (binary problem) as the nominator and total counting as the denominator of the fraction and computes a new float value (by considering the weight of course). The regression, instead of probability, considers the average value. This process is repeated several times to have different values of prior (the probability).

Moreover, to reduce the overfitting and consider all the combinations of the categorical features so that there would not be information leakage, it augments the decision trees to assume all the possible combinations in the dataset.

Model training

For the training process, CatBoost follows the same idea as the Gradient Boosting Machines or GBM/GBDT, which is predicting the residual of the loss function with a decision tree. One innovation here is that CatBoost first builds M permutation from the dataset and transforms the categorical features, next it fits individual decision trees on each permutation. Therefore, the permutations of the training set are the same as the first permutation list. It uses several permutations to empower the model performance. Note that the CatBoost computes the gradient of the optimization loss function over a sample of a random permutation and trains the next model with a different permutation. Therefore, the data distribution is changed for each boosting step. In addition, the CatBoost uses the Oblivious Decision Trees [2], which grow symmetrically and have lower final depth. For the symmetrical tree with the same feature in each region, it transforms all the features into the binary feature to speed up the decision-making.

Related links

  • CatBoost — the new generation of gradient boosting — Anna Veronika Dorogush. link
  • Neural Oblivious Decision Ensembles(NODE) — A State-of-the-Art Deep Learning Algorithm for Tabular Data. link
  • Full Documentation. link
  • GitHub repository. link
  • CatBoost Example. link

References

[1] Dorogush, Anna Veronika, Vasily Ershov, and Andrey Gulin. “CatBoost: gradient boosting with categorical features support.” arXiv preprint arXiv:1810.11363 (2018).

[2] Kohavi, Ron. “Bottom-up induction of oblivious read-once decision graphs.” European Conference on Machine Learning. Springer, Berlin, Heidelberg, 1994.

--

--

No responses yet