Variable or Feature Selection for Regression and Classification

Last update: 21 Apr 2025 21:17
First version: 14 August 2019

"Variable selection" tends to be the statisticians' name; people from data mining talk about "feature selection". The idea is the same: given some big pool of variables (or features) which could be used as inputs to predict some target feature (or variable), which ones actually go in to the predictor? Say "all" can lead to computational issues, and also statistical ones. (If you throw in lots of variables which don't matter, at finite sample sizes you'll have degraded inferences about how, and how much, all the variables matter. Even if you don't care about inferring parameters [or response functions, etc.], the predictions will suffer, because you'll be over-fitting.) Of course, sometimes every available feature really does matter. (This is, annoyingly, especially likely to be the case when the features are pre-selected on the basis of strong subject-matter knowledge.)

All of this is a special case of model selection, so I incorporate all the comments, and the recommended readings, in that notebook by reference.

--- An important model selection problem which is also sometimes called "variable selection" is deciding which nodes in a graphical model are immediately connected. (This is very important, for instance, in causal model discovery.) The obvious way to go about doing this is to run variable selection (in this sense) for each node variable. This may work, but may also not be what we want, since the goal is direct (and sometimes directed!) connections. I defer those references to the notebooks just linked to.

Post-Model-Selection Inference
Regression

model selection

Genevera I. Allen, "KNIFE: Kernel Iterative Feature Extraction", arxiv:0906.4391
Leo Breiman and Philip Spector, "Submodel Selection and Evaluation in Regression: The X-Random Case", International Statistical Review 60 (1992): 291--319 [JSTOR]
Gavin Brown, Adam Pocock, Ming-Jie Zhao, Mikel Luján, "Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection", Journal of Machine Learning Research 13 (2012): 27--66
Peter Bühlmann, M. Kalisch and M. H. Maathuis, "Variable selection in high-dimensional linear models: partially faithful distributions and the PC-simple algorithm", Biometrika 97 (2010): 261--278
Peter Bühlmann and Sara van de Geer, Statistics for High-Dimensional Data: Methods, Theory and Applications [State-of-the art (2011) compendium of what's known about using the Lasso, and related methods, for model selection. Mini-review]
Pascal Lavergne and Quang H. Vuong, "Nonparametric Selection of Regressors: The Nonnested Case", Econometrica 64 (1996): 207--219 [Picking which variables belong in a regression, by looking at the error of non-parametric kernel regressions. JSTOR]
Nicolai Meinshausen and Peter Bühlmann, "Stability Selection", arxiv:0809.2932
Wesley Tansey, Victor Veitch, Haoran Zhang, Raul Rabadan, David M. Blei, "The Holdout Randomization Test: Principled and Easy Black Box Feature Selection", arxiv:1811.00645 [This is a brilliant little paper. I have some reservations about the way they use importance sampling to improve power --- I am not at all sure that getting the upper and lower bounds on the importance weights they need is really that much more feasible than just improving your estimate of the conditional density --- but that's a refinement.]

Francis Bach
- "Model-Consistent Sparse Estimation through the Bootstrap", arxiv:0901.3202 ["if we run the Lasso for several bootstrapped replications of a given sample, then intersecting the supports of the Lasso bootstrap estimates leads to consistent model selection" --- compare with the "stability selection" of Meinshausen and Buhlmann]
- "High-Dimensional Non-Linear Variable Selection through Hierarchical Kernel Learning", arxiv:0909.0844
Rina Foygel Barber, Emmanuel J. Candès, Richard J. Samworth, "Robust inference with knockoffs", arxiv:1801.03896
Mario Beraha, Alberto Maria Metelli, Matteo Papini, Andrea Tirinzoni, Marcello Restelli, "Feature Selection via Mutual Information: New Theoretical Insights", arxiv:1907.07384
Justin Bleich, Adam Kapelner, Edward I. George, Shane T. Jensen, "Variable selection for BART: An application to gene regulation", Annals of Applied Statistics 8 (2014): 1750--1781, arxiv:1310.4887
Kasper Brink-Jensen, Claus Thorn Ekstrom, "Inference for feature selection using the Lasso with high-dimensional data", arxiv:1403.4296
Tom Burr, Herb Fry, Brian McVey, Eric Sander, Joseph Cavanaugh and Andrew Neath, "Performance of Variable Selection Methods in Regression Using Variations of the Bayesian Information Criterion", Communications in Statistics - Simulation and Computation 37 (2008): 507--520
Xin Chen, Changliang Zou, and R. Dennis Cook, "Coordinate-independent sparse sufficient dimension reduction and variable selection", Annals of Statistics 38 (2010): 3696--3723
Laëtitia Comminges and Arnak S. Dalalyan, "Tight conditions for consistency of variable selection in the context of high dimensionality", Annals of Statistics 40 (2012): 2667--2696
Laurie Davies, Lutz Dümbgen, "A Model-free Approach to Linear Least Squares Regression with Exact Probabilities and Applications to Covariate Selection", arxiv:1906.01990
Jianqing Fan and Runze Li, "Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties", Journal of the American Statistical Association 96 (2001): 1348--1360 [PDF reprint via Prof. Fan]
Jianqing Fan, Richard Samworth, Yichao Wu, "Ultrahigh dimensional variable selection: beyond the linear model", arxiv:0812.3201
Mladen Kolar, Han Liu, "Optimal Feature Selection in High-Dimensional Discriminant Analysis", arxiv:1306.6557 [Heard the talk...]
Nicole Kraemer, "On the Peaking Phenomenon of the Lasso in Model Selection", arxiv:0904.4416
Pascal Lavergne, Samuel Maistre, Valentin Patilea, "A Significance Test for Covariates in Nonparametric Regression", arxiv:1403.7063
Hugh Miller and Peter Hall, "Local polynomial regression and variable selection", arxiv:1006.3342
Martin Wahl, "Variable selection in high-dimensional additive models based on norms of projections", arxiv:1406.0052
Adriano Zanin Zambom, Michael G. Akritas, "Significance Testing and Group Variable Selection", arxiv:1205.6843