Variable Importance on diabetes dataset

Variable Importance estimates the influence of a given input variable to the prediction made by a model. To assess variable importance in a prediction problem, Breiman[1] introduced the permutation approach where the values are shuffled for one variable/column at a time. This permutation breaks the relationship between the variable of interest and the outcome. Following, the loss score is checked before and after this substitution for any significant drop in the performance which reflects the significance of this variable to predict the outcome. This ease-to-use solution is demonstrated, in the work by Strobl et al.[2], to be affected by the degree of correlation between the variables, thus biased towards truly non-significant variables highly correlated with the significant ones and creating fake significant variables. They introduced a solution for the Random Forest estimator based on conditional sampling by performing sub-groups permutation when bisecting the space using the conditioning variables of the buiding process. However, this solution is exclusive to the Random Forest and is costly with high-dimensional settings. CHAMMA et al.[3] introduced a new model-agnostic solution to bypass the limitations of the permutation approach under the use of the conditional schemes. The variable of interest does contain two types of information: 1) the relationship with the remaining variables and 2) the relationship with the outcome. The standard permutation, while breaking the relationship with the outcome, is also destroying the dependency with the remaining variables. Therefore, instead of directly permuting the variable of interest, the variable of interest is predicted by the remaining variables and the residuals of this prediction are permuted before reconstructing the new version of the variable. This solution preserves the dependency with the remaining variables.

In this example, we compare both the standard permutation and its conditional variant approaches for variable importance on the diabetes dataset for the single-level case. The aim is to see if integrating the new statistically-controlled solution has an impact on the results.

References

Imports needed for this script

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy.stats import norm
from sklearn.base import clone
from sklearn.datasets import load_diabetes
from sklearn.linear_model import RidgeCV
from sklearn.metrics import r2_score, root_mean_squared_error
from sklearn.model_selection import KFold

from hidimstat.cpi import CPI
from hidimstat.loco import LOCO
from hidimstat.permutation_importance import PermutationImportance

Load the diabetes dataset

Fit a baseline model on the diabetes dataset

We use a Ridge regression model with a 10-fold cross-validation to fit the diabetes dataset.

n_folds = 5
regressor = RidgeCV(alphas=np.logspace(-3, 3, 10))
regressor_list = [clone(regressor) for _ in range(n_folds)]
kf = KFold(n_splits=n_folds, shuffle=True, random_state=0)
for i, (train_index, test_index) in enumerate(kf.split(X)):
    regressor_list[i].fit(X[train_index], y[train_index])
    score = r2_score(
        y_true=y[test_index], y_pred=regressor_list[i].predict(X[test_index])
    )
    mse = root_mean_squared_error(
        y_true=y[test_index], y_pred=regressor_list[i].predict(X[test_index])
    )

    print(f"Fold {i}: {score}")
    print(f"Fold {i}: {mse}")
Fold 0: 0.33092885980301545
Fold 0: 58.57429457038258
Fold 1: 0.46114740610001137
Fold 1: 53.69518424561582
Fold 2: 0.5333394029342122
Fold 2: 54.666537166030764
Fold 3: 0.5048253747874585
Fold 3: 54.3633702441001
Fold 4: 0.5979566135054368
Fold 4: 52.287948367546456

Fit a baselien model on the diabetes dataset

We use a Ridge regression model with a 10-fold cross-validation to fit the diabetes dataset.

n_folds = 10
regressor = RidgeCV(alphas=np.logspace(-3, 3, 10))
regressor_list = [clone(regressor) for _ in range(n_folds)]
kf = KFold(n_splits=n_folds, shuffle=True, random_state=0)
for i, (train_index, test_index) in enumerate(kf.split(X)):
    regressor_list[i].fit(X[train_index], y[train_index])
    score = r2_score(
        y_true=y[test_index], y_pred=regressor_list[i].predict(X[test_index])
    )
    mse = root_mean_squared_error(
        y_true=y[test_index], y_pred=regressor_list[i].predict(X[test_index])
    )

    print(f"Fold {i}: {score}")
    print(f"Fold {i}: {mse}")
Fold 0: 0.34873505091371093
Fold 0: 56.151600911163946
Fold 1: 0.2722061930940729
Fold 1: 61.35323376775815
Fold 2: 0.5316220308691071
Fold 2: 49.33374807702299
Fold 3: 0.36967014548640154
Fold 3: 59.01492149527525
Fold 4: 0.5858181205553151
Fold 4: 51.479244269131684
Fold 5: 0.46246416851449
Fold 5: 58.3119253313517
Fold 6: 0.5235794267120801
Fold 6: 51.352999936251216
Fold 7: 0.48683083150894546
Fold 7: 56.75372711094103
Fold 8: 0.665318862395647
Fold 8: 47.26090648721779
Fold 9: 0.5514585816057873
Fold 9: 55.7236176985951

Measure the importance of variables using the CPI method

cpi_importance_list = []
for i, (train_index, test_index) in enumerate(kf.split(X)):
    print(f"Fold {i}")
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    cpi = CPI(
        estimator=regressor_list[i],
        imputation_model=RidgeCV(alphas=np.logspace(-3, 3, 10)),
        # covariate_estimator=HistGradientBoostingRegressor(random_state=0,),
        n_permutations=50,
        random_state=0,
        n_jobs=4,
    )
    cpi.fit(X_train, y_train)
    importance = cpi.score(X_test, y_test)
    cpi_importance_list.append(importance)
Fold 0
Fold 1
Fold 2
Fold 3
Fold 4
Fold 5
Fold 6
Fold 7
Fold 8
Fold 9

Measure the importance of variables using the LOCO method

loco_importance_list = []

for i, (train_index, test_index) in enumerate(kf.split(X)):
    print(f"Fold {i}")
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    loco = LOCO(
        estimator=regressor_list[i],
        random_state=0,
        n_jobs=4,
    )
    loco.fit(X_train, y_train)
    importance = loco.score(X_test, y_test)
    loco_importance_list.append(importance)
Fold 0
Fold 1
Fold 2
Fold 3
Fold 4
Fold 5
Fold 6
Fold 7
Fold 8
Fold 9

Measure the importance of variables using the permutation method

pi_importance_list = []

for i, (train_index, test_index) in enumerate(kf.split(X)):
    print(f"Fold {i}")
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    pi = PermutationImportance(
        estimator=regressor_list[i],
        n_permutations=50,
        random_state=0,
        n_jobs=4,
    )
    pi.fit(X_train, y_train)
    importance = pi.score(X_test, y_test)
    pi_importance_list.append(importance)
Fold 0
Fold 1
Fold 2
Fold 3
Fold 4
Fold 5
Fold 6
Fold 7
Fold 8
Fold 9

Define a function to compute the p-value from importance values

def compute_pval(vim):
    mean_vim = np.mean(vim, axis=0)
    std_vim = np.std(vim, axis=0)
    pval = norm.sf(mean_vim / std_vim)
    return np.clip(pval, 1e-10, 1 - 1e-10)

Analyze the results

cpi_vim_arr = np.array([x["importance"] for x in cpi_importance_list]) / 2
cpi_pval = compute_pval(cpi_vim_arr)

vim = [
    pd.DataFrame(
        {
            "var": np.arange(cpi_vim_arr.shape[1]),
            "importance": x["importance"],
            "fold": i,
            "pval": cpi_pval,
            "method": "CPI",
        }
    )
    for x in cpi_importance_list
]

loco_vim_arr = np.array([x["importance"] for x in loco_importance_list])
loco_pval = compute_pval(loco_vim_arr)

vim += [
    pd.DataFrame(
        {
            "var": np.arange(loco_vim_arr.shape[1]),
            "importance": x["importance"],
            "fold": i,
            "pval": loco_pval,
            "method": "LOCO",
        }
    )
    for x in loco_importance_list
]

pi_vim_arr = np.array([x["importance"] for x in pi_importance_list])
pi_pval = compute_pval(pi_vim_arr)

vim += [
    pd.DataFrame(
        {
            "var": np.arange(pi_vim_arr.shape[1]),
            "importance": x["importance"],
            "fold": i,
            "pval": pi_pval,
            "method": "PI",
        }
    )
    for x in pi_importance_list
]

fig, ax = plt.subplots()
df_plot = pd.concat(vim)
df_plot["pval"] = -np.log10(df_plot["pval"])
methods = df_plot["method"].unique()
colors = plt.cm.get_cmap("tab10", 10)

for i, method in enumerate(methods):
    subset = df_plot[df_plot["method"] == method]
    ax.bar(
        subset["var"] + i * 0.2,
        subset["pval"],
        width=0.2,
        label=method,
        color=colors(i),
    )

ax.legend(title="Method")
ax.set_ylabel(r"$-\log_{10}(\text{p-value})$")
ax.axhline(-np.log10(0.05), color="tab:red", ls="--")
ax.set_xlabel("Variable")
ax.set_xticklabels(diabetes.feature_names)
plt.show()
plot diabetes variable importance example
/home/runner/work/hidimstat/hidimstat/examples/plot_diabetes_variable_importance_example.py:240: MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed in 3.11. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap()`` or ``pyplot.get_cmap()`` instead.
  colors = plt.cm.get_cmap("tab10", 10)
/home/runner/work/hidimstat/hidimstat/examples/plot_diabetes_variable_importance_example.py:256: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator.
  ax.set_xticklabels(diabetes.feature_names)

Total running time of the script: (0 minutes 8.271 seconds)

Estimated memory usage: 626 MB

Gallery generated by Sphinx-Gallery