Feature selection is the procedure of selecting proper features as the machine learning model’s inputs. Most of the well-known methods are considering the relationship between the dependent and independent variables and apply statistical tests to find the most robust relation between the input and output variables.
The feature selection process is divided into supervised (considering the target feature) and unsupervised (ignoring the target) approaches. Moreover, the supervised model includes various strategies, for instance;
- Using statistical tests to pick the features.
- Scoring the strength between the variables.
- Moreover, there are some machine learning models such as CatBoost that apply the feature selection with built-in feature selection.
Advantages of implementing the feature selection
- Drop non-informative features.
- Reduce dataset dimension and size, which makes the training procedure faster.
- Improve the precision of the model.
- Rank the features by their importance.
1. Statistical tests
This category includes the methods that keep the features that return the best performance in the statistical tests. The statistical test measures the linear dependency between pair variables. We can refer following methods in this category
- Select K Best
The two main parameters of the Kbest algorithm are the scoring function which is deferred regarding the problem and the k number of the selected features from the highest rank.
fs = SelectKBest(score_func=f_regression)
X_new = fs.fit_transform(x_train, y_train)
You may change the k number to the percentile with the SelectPercentile method as well.
- Generic Univariate Select
Generic Univariate Select is the developed model feature selection as it brings parameter tuning and considers the best parameter.
fs = GenericUnivariateSelect(f_regression, mode='k_best', param=10)
X_Generic = fs.fit_transform(x_train, y_train)
2. Sequential procedure
Sequential feature selection is a greedy algorithm that, at each iteration, it adds a new feature to the subset of the selected feature. It considers the score function of the machine learning model to select the desired feature. Although the initialized subset of the features depends on the greedy strategy, if it is set to backward, the model tries to remove one feature at each step and if it is set to forward, it adds a new feature at each epoch.
Note that the ML model is not fitted and the algorithm fits the model through the cross-validation and sequential manner.
Sequential Feature Selection could be applied in a pipeline to build input before implementing the model
model_reg = Pipeline([('fs', SequentialFeatureSelector(RandomForestRegressor())),
3. Recursive Feature Elimination
Recursive Feature Elimination is a very common method to eliminate and drop features. It works by fitting an estimator and computing the importance of each feature. The feature with less importance drops at each step. Like the sequential model, it also considers the estimator, but here we do not have any cross-validation approach, therefore the process would be quicker.
4. Dropping low-variance features
This is somehow the benchmark for the feature selection. Of course that the parameter of the model would be the threshold, and if the variance of the feature does not meet the parameter, removes it.
The goal of this article was to briefly introduce the Feature Selection process and talk about its categories and different approaches for each group. The python examples for some of the methods are also added.