Post-Model-Selection Inference14 Aug 2019 15:48
Yet Another Inadequate Placeholder
Model selection, in statistics, means using your data to pick the correct statistical model, or at least a good one. Often we're interested in doing statistical inference with the selected model --- we might want to know confidence sets for parameters (or functions), we might want to attach measures of uncertainty to its predictions, etc. The difficulty is that we usually calculate the properties of our inferential procedures on the assumption of a fixed model, as though the right model were communicated to us by the angels. When instead it's something we've selected using the data, there are going to be problems.
The easiest way to see this may be to reflect that our data are random (that's why we're doing statistics), so which model we get from our model selection is also random (at least a little), and this will create correlations between the selected model and the outputs of statistical tests. If we're doing regression and we've used model selection to pick which variables are included as regressors, of course the selected variables are going to look significant on the data we used to pick them! (Thus the classic Freedman, 1983, which never fails to make a mind-blowing assignment for undergrads.) The whole rest of this subject is essentially refining this basic observation.
One direction of refinement is to try to develop new inferential procedures, more or less approximate, which can compensate for the fact that our model was picked in a data-dependent way. This is most of what gets called "post-selection inference" or "post-model-selection inference" or "selective inference". There is a lot of intricate theory here, often relying on clever mathematical understanding of specific selection procedures and how they interact with specific assumptions about the data-generating process.
The other direction is to attack the problem at its root: using the same data for selection and inference creates correlations between them, so use different data for selection and inference. This gets called "data splitting" or "sample splitting". It's easy to do for IID data --- divide your data set, at random, into two parts, do your selection on one part, and then do the inference on the other, with no cross-contamination. (This is close to, but not quite, cross-validation.) Because they're independent, the selected model is independent of the contents of the inference set, hence the usual procedures work with their usual properties. Problem solved.
Sample splitting is a simple, radical, almost a-theoretical way to solve the problem of post-selection inference, and as such it appeals to my temperament. (This is why two of my students wrote their dissertations, in part, on how to extend it to dependent data, where, alas, theory and subtlety re-enter.) With all sincere respect to those working heroically on what I called the other direction, honestly don't know why the sample-splitting approach isn't the default we all use.
- Recommended (including by reference recommendations listed under model selection):
- Richard Berk, Lawrence Brown, Andreas Buja, Kai Zhang, and Linda Zhao, "Valid post-selection inference", Annals of Statistics 41 (2013): 802--837
- Julian J. Faraway
- "On the Cost of Data Analysis", Journal of Computational and Graphical Statistics 1 (1992): 213--229 [PDF preprint]
- "Does Data Splitting Improve Prediction?", Statistics and Computing 26 (2016): 49--60, arxiv:1301.2983
- William Fithian, Dennis Sun, Jonathan Taylor, "Optimal Inference After Model Selection", arxiv:1410.2597
- David A. Freedman, "A Note on Screening Regression Equations", The American Statistician 37 (1983): 152--155
- Jason D. Lee, Dennis L. Sun, Yuekai Sun, Jonathan E. Taylor, "Exact post-selection inference, with application to the lasso", arxiv:1311.6238
- Hannes Leeb
- "Conditional Predictive Inference Post Model Selection", Annals of Statistics 37 (2009): 2838--2876, arxiv:0908.3615
- "Evaluation and selection of models for out-of-sample prediction when the sample size is small relative to the complexity of the data-generating process", Bernoulli 14 (2008): 661--690, arxiv:0802.3364
- Hannes Leeb and Benedikt M. Pötscher
- Alessandro Rinaldo, Larry Wasserman, Max G'Sell, Jing Lei, "Bootstrapping and Sample Splitting For High-Dimensional, Assumption-Free Inference", arxiv:1611.05401 [Disclaimer: All colleagues and friends]
- Pride compels me to recommend:
- Robert Lunde, Bootstrapping and Sample Splitting Under Weak Dependence [Ph.D. thesis, CMU Statistics, 2018]
- Lawrence Wang, Network Comparisons using Sample Splitting [Ph.D. thesis, CMU Statistics, 2016]
- To read:
- Alexandre Belloni, Victor Chernozhukov, Ivan Fernández-Val, Christian Hansen, "Program Evaluation and Causal Inference with High-Dimensional Data", Econometrica 85 (2017): 233--298, arxiv:1311.2645
- Yoav Benjamini, Marina Bogomolov, "Adjusting for selection bias in testing multiple families of hypotheses", arxiv:1106.3670
- Gavin C. Cawley, Nicola L. C. Talbot, "On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation", Journal of Machine Learning Research 11 (2010): 2079--2107
- Victor Chernozhukov, Christian Hansen, Martin Spindler, "Valid Post-Selection and Post-Regularization Inference: An Elementary, General Approach", Annual Review of Economics 7 (2015): 649--688, arxiv:1501.03430
- Karl Ewald, Ulrike Schneider, "Uniformly Valid Confidence Sets Based on the Lasso", Electronic Journal of Statistics 12 (2018): 1358--1387, arxiv:1507.05315
- Paul Kabaila and Khageswor Giri, "Upper bounds on the minimum coverage probability of confidence intervals in regression after variable selection", arxiv:0711.0993
- Benedikt M. Pötscher
- Yoshikazu Terada, Hidetoshi Shimodaira, "Selective inference after variable selection via multiscale bootstrap", arxiv:1905.10573 [I presume they have an answer to "why not just use sample splitting?"]
- Xiaoying Tian, Jonathan Taylor, "Asymptotics of selective inference", arxiv:1501.03588