Boruta Algorithm
Introduction
The Boruta algorithm was invented by Miron B. Kursa and Witold R. Rudnicki of University of Warsaw. It is a feature selection method for finding all relevant features using Random Forest. Different from other feature selection wrapper methods that select minimal optimal set of features such as Recursive Feature Elimination, finding all relevant features is particularly important for understanding a black-box machine learning model. The all relevant features also try to hint at causality for some observable behavior.
Kursa and Rudnicki named Boruta after the diety of forest in Slavic mythology since it’s a wrapper method using Random Forest.
The Boruta R Package was introduced in 2010 and ported to Python by Daniel Homola . The Boruta algorithm was designed to work with Random Forest and still works well with other ensemble tree methods such as Extra Trees and Gradient Boosted Trees. However, I found XGBoost and LightGBM tend to select a very minimal subset of features, even less than using them in Recursive Feature Elimination. Thus I would recommend not using XGBoost or LightGBM with Boruta.
Boruta Algorithm
The Boruta algorithm is quite ingenious and works as the following:
1. Extend the information by adding shadow copies of all features (always extended by at least 5 shadow features).
2. Shuffle the newly added shadow features in step 1 to remove their correlation with the target variable.
3. Run a Random Forest classifier on the extended feature data set and gather the Z score computed.
4. Find the maximum Z score among the shadow features added (MZSA), and then assign a hit to every feature that scored better than MZSA.
5. For each attribute with undetermined importance, perform a two-sided test of equality with the MZSA, e.g. $$Z-score_{actual}$$ > $$Z-score_{shadow}$$
6. Tag the features with importance significantly lower than MZSA as **unimportant** and remove them from the dataset.
7. Tag the features with importance significantly higher than MZSA as **important**.
8. Remove shadow attributes.
9. Repeat the procedure until all features have been confirmed or rejected, or the algorithm reached the set number of iterations.
Boruta in Python
BorutaPy
is one of the packages in the scikit-learn-contrib
collections. It can be installed via pip install Boruta
. It follows Scikit-learn’s superb API format.
Tip: BorutaPy
takes only numpy
arrays, not pandas
dataframes. Please remember to extract the values from the dataframe before shoving them through BorutaPy.
import pandas
from sklearn.ensemble import RandomForestClassifier
from boruta_py import boruta_py
# load X and y
X = pd.read_csv('my_X_table.csv', index_col=0).values
y = pd.read_csv('my_y_vector.csv', index_col=0).values
# define random forest classifier, with utilising all cores and
# sampling in proportion to y labels
forest = RandomForestClassifier(n_jobs=-1, class_weight='auto')
# define Boruta feature selection method
feat_selector = boruta_py.BorutaPy(forest, n_estimators='auto', verbose=2)
# find all relevant features
feat_selector.fit(X, y)
# check selected features
feat_selector.support_
# check ranking of features
feat_selector.ranking_
# call transform() on X to filter it down to selected features
X_filtered = feat_selector.transform(X)
References:
Feature Selection with the Boruta Package
BorutaPy: An all relevant feature selection method based on Random Forest estimators