.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/plot_2D_simulation_example.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_plot_2D_simulation_example.py: Support recovery on simulated data (2D) ======================================= This example shows the advantages of spatially relaxed inference when dealing with high-dimensional spatial data. To do so, we compare several statistical methods that aim at recovering the support, i.e., predictive features. Among those methods some leverage the spatial structure of the data. For more details about the inference algorithms presented in this example or about the generative process used to simulate the data, please refer to Chevalier et al. (2021) [1]_. This example corresponds to the experiment described in details in Chevalier et al. (2021) [1]_. Shortly, to simulate the data, we draw ``n_samples`` i.i.d Gaussian vectors of size ``n_features`` and reshape them into squares (edges are equal to ``n_features ** (1/2)``). Then, to introduce some spatial structure, we apply a Gaussian filter that correlates features that are nearby. The 2D data are then flattened into a design matrix ``X`` to represent it as a regression setting and to ease the computation of the simulated target ``y`` (see below). Then, we construct the weight map ``w`` which has the same shape as the 2D data, as it contains four predictive regions in every corner of the square. Similarly as for the construction of ``X``, the map ``w`` is finally flattened into a vector ``beta``. Lastly, to derive the target ``y``, we draw a white Gaussian noise ``epsilon`` and use a linear generative model: ``y = X beta + epsilon``. The results of this experiment show that the methods that leverage the spatial structure of the data are relevant. More precisely, we show that clustered inference algorithms (e.g., CluDL) and ensembled clustered inference algorithms (e.g., EnCluDL) are more powerful than the standard inference methods (see also Chevalier et al. (2021) [1]_). Indeed, when the number of features is much greater than the number of samples, standard statistical methods are unlikely to recover the support. Then, the idea of clustered inference is to compress the data without breaking the spatial structure, leading to a compressed problem close to the original problem. This results in a more powerful spatially relaxed inference. Indeed, thanks to the dimension reduction the support recovery is feasible. However, due to the spatial compression, there is a limited (and quantifiable) spatial uncertainty concerning the shape of the estimated support. Finally, by considering several choices of spatial compression, ensembled clustered inference algorithms reduce significantly the spatial uncertainty compared to clustered inference algorithms which consider only one spatial compression. .. _References: References ---------- .. [1] Chevalier, J. A., Nguyen, T. B., Thirion, B., & Salmon, J. (2021). Spatially relaxed inference on high-dimensional linear models. arXiv preprint arXiv:2106.02590. .. GENERATED FROM PYTHON SOURCE LINES 53-56 .. code-block:: Python import matplotlib.pyplot as plt .. GENERATED FROM PYTHON SOURCE LINES 57-59 Imports needed for this script ------------------------------ .. GENERATED FROM PYTHON SOURCE LINES 59-72 .. code-block:: Python import numpy as np from sklearn.cluster import FeatureAgglomeration from sklearn.feature_extraction import image from hidimstat.clustered_inference import clustered_inference from hidimstat.desparsified_lasso import ( desparsified_lasso, desparsified_lasso_pvalue, ) from hidimstat.ensemble_clustered_inference import ensemble_clustered_inference from hidimstat.scenario import multivariate_simulation from hidimstat.stat_tools import zscore_from_pval .. GENERATED FROM PYTHON SOURCE LINES 73-81 Specific plotting functions --------------------------- The functions below are used to plot the results and illustrate the concept of spatial tolerance. If you are reading this example for the first time, you can skip this section. The following function builds a 2D map with four active regions that are enfolded by thin tolerance regions. .. GENERATED FROM PYTHON SOURCE LINES 81-113 .. code-block:: Python def weight_map_2D_extended(shape, roi_size, delta): """Build weight map with visible tolerance region""" roi_size_extended = roi_size + delta # Create four regions in the corners w = np.zeros(shape + (5,)) w[0:roi_size, 0:roi_size, 0] = 0.5 w[-roi_size:, -roi_size:, 1] = 0.5 w[0:roi_size, -roi_size:, 2] = 0.5 w[-roi_size:, 0:roi_size, 3] = 0.5 w[0:roi_size_extended, 0:roi_size_extended, 0] += 0.5 w[-roi_size_extended:, -roi_size_extended:, 1] += 0.5 w[0:roi_size_extended, -roi_size_extended:, 2] += 0.5 w[-roi_size_extended:, 0:roi_size_extended, 3] += 0.5 # round the shape a little bit for i in range(roi_size_extended): for j in range(roi_size_extended): if (i - roi_size) + (j - roi_size) >= delta: w[i, j, 0] = 0 w[-i - 1, -j - 1, 1] = 0 w[i, -j - 1, 2] = 0 w[-i - 1, j, 3] = 0 beta_extended = w.sum(-1).ravel() return beta_extended .. GENERATED FROM PYTHON SOURCE LINES 114-116 To generate a plot that exhibits the true support and the estimated supports for every method, we define the two following functions: .. GENERATED FROM PYTHON SOURCE LINES 116-155 .. code-block:: Python def add_one_subplot(ax, map, title): """Add one subplot into the summary plot""" if map is not None: im = ax.imshow(map) im.set_clim(-1, 1) ax.tick_params( axis="both", which="both", bottom=False, top=False, left=False, labelbottom=False, labelleft=False, ) ax.set_title(title) else: ax.axis("off") ax.get_xaxis().set_visible(False) ax.get_yaxis().set_visible(False) def plot(maps, titles): """Make a summary plot from estimated supports""" fig, axes = plt.subplots(3, 2, figsize=(4, 6)) for i in range(3): for j in range(2): k = i * 2 + j add_one_subplot(axes[i][j], maps[k], titles[k]) fig.tight_layout() plt.show() .. GENERATED FROM PYTHON SOURCE LINES 156-162 Generating the data ------------------- After setting the simulation parameters, we run the function that generates the 2D scenario that we have briefly described in the first section of this example. .. GENERATED FROM PYTHON SOURCE LINES 162-176 .. code-block:: Python # simulation parameters n_samples = 100 shape = (40, 40) n_features = shape[1] * shape[0] roi_size = 4 # size of the edge of the four predictive regions sigma = 2.0 # noise standard deviation smooth_X = 1.0 # level of spatial smoothing introduced by the Gaussian filter # generating the data X_init, y, beta, epsilon, _, _ = multivariate_simulation( n_samples, shape, roi_size, sigma, smooth_X, seed=1 ) .. GENERATED FROM PYTHON SOURCE LINES 177-194 Choosing inference parameters ----------------------------- The choice of the number of clusters depends on several parameters, such as: the structure of the data (a higher correlation between neighboring features enable a greater dimension reduction, i.e. a smaller number of clusters), the number of samples (small datasets require more dimension reduction) and the required spatial tolerance (small clusters lead to limited spatial uncertainty). Formally, "spatial tolerance" is defined by the largest distance from the true support for which the occurence of a false discovery is not statistically controlled (c.f. :ref:`References`). Theoretically, the spatial tolerance ``delta`` is equal to the largest cluster diameter. However this choice is conservative, notably in the case of ensembled clustered inference. For these algorithms, we recommend to take the average cluster radius. In this example, we choose ``n_clusters = 200``, leading to a theoretical spatial tolerance ``delta = 6``, which is still conservative (see Results). .. GENERATED FROM PYTHON SOURCE LINES 194-205 .. code-block:: Python # hyper-parameters n_clusters = 200 # inference parameters fwer_target = 0.1 delta = 6 # computation parameter n_jobs = 1 .. GENERATED FROM PYTHON SOURCE LINES 206-216 Computing z-score thresholds for support estimation --------------------------------------------------- Below, we translate the FWER target into z-score targets. To compute the z-score targets we also take into account for the multiple testing correction. To do so, we consider Bonferroni correction. For methods that do not reduce the feature space, the correction consists in dividing the targeted FWER target by the number of features. For methods that group features into clusters, the correction consists in dividing by the number of clusters. .. GENERATED FROM PYTHON SOURCE LINES 216-224 .. code-block:: Python # computing the z-score thresholds for feature selection correction_no_cluster = 1.0 / n_features correction_cluster = 1.0 / n_clusters thr_c = zscore_from_pval((fwer_target / 2) * correction_cluster) thr_nc = zscore_from_pval((fwer_target / 2) * correction_no_cluster) .. GENERATED FROM PYTHON SOURCE LINES 225-230 Inference with several algorithms --------------------------------- First, we compute a reference map that exhibits the true support and the theoretical tolerance region. .. GENERATED FROM PYTHON SOURCE LINES 230-234 .. code-block:: Python # compute true support with visible spatial tolerance beta_extended = weight_map_2D_extended(shape, roi_size, delta) .. GENERATED FROM PYTHON SOURCE LINES 235-241 Now, we compute the support estimated by a high-dimensional statistical infernece method that does not leverage the data structure. This method was introduced by Javanmard, A. et al. (2014), Zhang, C. H. et al. (2014) and Van de Geer, S. et al.. (2014) (full references are available at https://mind-inria.github.io/hidimstat/). and referred to as Desparsified Lasso. .. GENERATED FROM PYTHON SOURCE LINES 241-257 .. code-block:: Python # compute desparsified lasso beta_hat, sigma_hat, precision_diagonal = desparsified_lasso(X_init, y, n_jobs=n_jobs) pval, pval_corr, one_minus_pval, one_minus_pval_corr, cb_min, cb_max = ( desparsified_lasso_pvalue(X_init.shape[0], beta_hat, sigma_hat, precision_diagonal) ) # compute estimated support (first method) zscore = zscore_from_pval(pval, one_minus_pval) selected_dl = zscore > thr_nc # use the "no clustering threshold" # compute estimated support (second method) selected_dl = np.logical_or( pval_corr < fwer_target / 2, one_minus_pval_corr < fwer_target / 2 ) .. rst-class:: sphx-glr-script-out .. code-block:: none 'max_iter' has been increased to 8000 .. GENERATED FROM PYTHON SOURCE LINES 258-261 Now, we compute the support estimated using a clustered inference algorithm (c.f. :ref:`References`) called Clustered Desparsified Lasso (CluDL) since it uses the Desparsified Lasso technique after clustering the data. .. GENERATED FROM PYTHON SOURCE LINES 261-283 .. code-block:: Python # Define the FeatureAgglomeration object that performs the clustering. # This object is necessary to run the current algorithm and the following one. connectivity = image.grid_to_graph(n_x=shape[0], n_y=shape[1]) ward = FeatureAgglomeration( n_clusters=n_clusters, connectivity=connectivity, linkage="ward" ) # clustered desparsified lasso (CluDL) beta_hat, pval, pval_corr, one_minus_pval, one_minus_pval_corr = clustered_inference( X_init, y, ward, n_clusters ) # compute estimated support (first method) zscore = zscore_from_pval(pval, one_minus_pval) selected_cdl = zscore > thr_c # use the "clustering threshold" # compute estimated support (second method) selected_cdl = np.logical_or( pval_corr < fwer_target / 2, one_minus_pval_corr < fwer_target / 2 ) .. rst-class:: sphx-glr-script-out .. code-block:: none Clustered inference: n_clusters = 200, inference method = desparsified-lasso, seed = 0 Clustered inference {} hd_inference False {} .. GENERATED FROM PYTHON SOURCE LINES 284-289 Finally, we compute the support estimated by an ensembled clustered inference algorithm (c.f. :ref:`References`). This algorithm is called Ensemble of Clustered Desparsified Lasso (EnCluDL) since it runs several CluDL algorithms with different clustering choices. The different CluDL solutions are then aggregated into one. .. GENERATED FROM PYTHON SOURCE LINES 289-300 .. code-block:: Python # ensemble of clustered desparsified lasso (EnCluDL) beta_hat, pval, pval_corr, one_minus_pval, one_minus_pval_corr = ( ensemble_clustered_inference(X_init, y, ward, n_clusters, train_size=0.3) ) # compute estimated support selected_ecdl = np.logical_or( pval_corr < fwer_target / 2, one_minus_pval_corr < fwer_target / 2 ) .. rst-class:: sphx-glr-script-out .. code-block:: none Clustered inference: n_clusters = 200, inference method = desparsified-lasso, seed = 0 Clustered inference {} hd_inference False {} Clustered inference: n_clusters = 200, inference method = desparsified-lasso, seed = 1 Clustered inference {} hd_inference False {} Clustered inference: n_clusters = 200, inference method = desparsified-lasso, seed = 2 Clustered inference {} hd_inference False {} Clustered inference: n_clusters = 200, inference method = desparsified-lasso, seed = 3 Clustered inference {} hd_inference False {} Clustered inference: n_clusters = 200, inference method = desparsified-lasso, seed = 4 Clustered inference {} hd_inference False {} Clustered inference: n_clusters = 200, inference method = desparsified-lasso, seed = 5 Clustered inference {} hd_inference False {} Clustered inference: n_clusters = 200, inference method = desparsified-lasso, seed = 6 Clustered inference {} hd_inference False {} Clustered inference: n_clusters = 200, inference method = desparsified-lasso, seed = 7 Clustered inference {} hd_inference False {} Clustered inference: n_clusters = 200, inference method = desparsified-lasso, seed = 8 Clustered inference {} hd_inference False {} Clustered inference: n_clusters = 200, inference method = desparsified-lasso, seed = 9 Clustered inference {} hd_inference False {} Clustered inference: n_clusters = 200, inference method = desparsified-lasso, seed = 10 Clustered inference {} hd_inference False {} Clustered inference: n_clusters = 200, inference method = desparsified-lasso, seed = 11 Clustered inference {} hd_inference False {} Clustered inference: n_clusters = 200, inference method = desparsified-lasso, seed = 12 Clustered inference {} hd_inference False {} Clustered inference: n_clusters = 200, inference method = desparsified-lasso, seed = 13 Clustered inference {} hd_inference False {} Clustered inference: n_clusters = 200, inference method = desparsified-lasso, seed = 14 Clustered inference {} hd_inference False {} Clustered inference: n_clusters = 200, inference method = desparsified-lasso, seed = 15 Clustered inference {} hd_inference False {} Clustered inference: n_clusters = 200, inference method = desparsified-lasso, seed = 16 Clustered inference {} hd_inference False {} Clustered inference: n_clusters = 200, inference method = desparsified-lasso, seed = 17 Clustered inference {} hd_inference False {} Clustered inference: n_clusters = 200, inference method = desparsified-lasso, seed = 18 Clustered inference {} hd_inference False {} Clustered inference: n_clusters = 200, inference method = desparsified-lasso, seed = 19 Clustered inference {} hd_inference False {} Clustered inference: n_clusters = 200, inference method = desparsified-lasso, seed = 20 Clustered inference {} hd_inference False {} Clustered inference: n_clusters = 200, inference method = desparsified-lasso, seed = 21 Clustered inference {} hd_inference False {} Clustered inference: n_clusters = 200, inference method = desparsified-lasso, seed = 22 Clustered inference {} hd_inference False {} Clustered inference: n_clusters = 200, inference method = desparsified-lasso, seed = 23 Clustered inference {} hd_inference False {} Clustered inference: n_clusters = 200, inference method = desparsified-lasso, seed = 24 Clustered inference {} hd_inference False {} .. GENERATED FROM PYTHON SOURCE LINES 301-306 Results ------- Now we plot the true support, the theoretical tolerance regions and the estimated supports for every method. .. GENERATED FROM PYTHON SOURCE LINES 306-330 .. code-block:: Python maps = [] titles = [] maps.append(np.reshape(beta, shape)) titles.append("True weights") maps.append(np.reshape(beta_extended, shape)) titles.append("True weights \nwith tolerance") maps.append(np.reshape(selected_dl, shape)) titles.append("Desparsified Lasso") maps.append(None) titles.append(None) maps.append(np.reshape(selected_cdl, shape)) titles.append("CluDL") maps.append(np.reshape(selected_ecdl, shape)) titles.append("EnCluDL") plot(maps, titles) .. image-sg:: /auto_examples/images/sphx_glr_plot_2D_simulation_example_001.png :alt: True weights, True weights with tolerance, Desparsified Lasso, CluDL, EnCluDL :srcset: /auto_examples/images/sphx_glr_plot_2D_simulation_example_001.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 331-344 Analysis of the results ----------------------- As argued in the first section of this example, standard inference methods that do not compress the problem dramatically lack power. The support estimated from CluDL provides a more reasonable solution since we recover the four regions. However the shape of the estimated support is a bit rough (as it is bound to a sub-optimal clustering). Finally, the solution provided by EnCluDL is more accurate since the shape of the estimated support is closer to the true support. Also, one can note that the theoretical spatial tolerance is quite conservative. In practice, Type-1 Error guarantees seem to hold for a lower spatial tolerance. This is an additional benefit of clustering randomization. .. rst-class:: sphx-glr-timing **Total running time of the script:** (1 minutes 4.594 seconds) **Estimated memory usage:** 257 MB .. _sphx_glr_download_auto_examples_plot_2D_simulation_example.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_2D_simulation_example.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_2D_simulation_example.py ` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: plot_2D_simulation_example.zip ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_