Notebooks

Post-Model-Selection Inference

Last update: 08 Dec 2024 11:59
First version: 14 August 2019

Model selection, in statistics, means using your data to pick the correct statistical model, or at least a good one. Often we're interested in doing statistical inference with the selected model --- we might want to know confidence sets for parameters (or functions), we might want to attach measures of uncertainty to its predictions, etc. The difficulty is that we usually calculate the properties of our inferential procedures on the assumption of a fixed model, as though the right model were communicated to us by the angels. When instead it's something we've selected using the data, there are going to be problems.

The easiest way to see this may be to reflect that our data are random (that's why we're doing statistics), so which model we get from our model selection is also random (at least a little), and this will create correlations between the selected model and the outputs of statistical tests. If we're doing regression and we've used model selection to pick which variables are included as regressors, of course the selected variables are going to look significant on the data we used to pick them! (Thus the classic Freedman, 1983, which never fails to make a mind-blowing assignment for undergrads.) The whole rest of this subject is essentially refining this basic observation.

One direction of refinement is to try to develop new inferential procedures, more or less approximate, which can compensate for the fact that our model was picked in a data-dependent way. This is most of what gets called "post-selection inference" or "post-model-selection inference" or "selective inference". There is a lot of intricate theory here, often relying on clever mathematical understanding of specific selection procedures and how they interact with specific assumptions about the data-generating process.

The other direction is to attack the problem at its root: using the same data for selection and inference creates correlations between them, so use different data for selection and inference. This gets called "data splitting" or "sample splitting". It's easy to do for IID data --- divide your data set, at random, into two parts, do your selection on one part, and then do the inference on the other, with no cross-contamination. (This is close to, but not quite, cross-validation.) Because they're independent, the selected model is independent of the contents of the inference set, hence the usual procedures work with their usual properties. Problem solved.

Sample splitting is a simple, radical, almost a-theoretical way to solve the problem of post-selection inference, and as such it appeals to my temperament. (This is why two of my students wrote their dissertations, in part, on how to extend it to dependent data, where, alas, theory and subtlety re-enter.) With all sincere respect to those working heroically on what I called the other direction, honestly don't know why the sample-splitting approach isn't the default we all use.


Notebooks: