Model Selection

23 Jun 2016 01:41

(Reader, please make your own suitably awful pun about the different senses of "model selection" here, as a discouragement to those finding this page through prurient searching. Thank you.)

In statistics and machine learning, "model selection" is the problem of picking among different mathematical models which all purport to describe the same data set. This notebook will not (for now) give advice on it; as usual, it's more of a place to organize my thoughts and references...

Classification of approaches to model selection (probably not really exhaustive but I can't think of others, right now):

Direct optimization of some measure of goodness of fit or risk on training data.
Seems implicit in a lot of work which points to marginal improvements in "the proportion of variance explained", mis-classification rates, "perplexity", etc. Often, also, a recipe for over-fitting and chasing snarks. What's wanted is (almost always) some way of measuring the ability to generalize to new data, and in-sample performance is a biased estimate of this. Still, with enough data, if the gods of ergodicity are kind, in-sample performance is representative of generalization performance, so perhaps this will work asymptotically, though in many cases the researcher will never even glimpse Asymptopia across the Jordan.
Optimize fit with model-dependent penalty
Add on a term to each model which supposed indicates its ability to over-fit. (Adjusted R^2, AIC, BIC, ..., all do this in terms of the number of parameters.) Sounds reasonable, but I wonder how many actually work better, in practice, than direct optimization. (See Domingos for some depressing evidence on this score.)
Classical two-part minimum description length methods were penalties; I don't yet understand one-part MDL.
Penalties which depend on the model class
Measure the capacity of a class of models to over-fit; penalize all models in that class accordingly, regardless of their individual properties. Outstanding example: Vapnik's "structural risk minimization" (provably consistent under some circumstances). Only sporadically coincides with *IC-type penalties based on the number of parameters.
Estimate the ability to generalize to different data by, in fact, using different data. Maybe the "industry standard" of machine learning. Query, how are we to know how much different data to use?
Query, how are we to cross-validate when we have complex, relational data? That is, I understand how to do it for independent samples, and I even understand how to do it for time series, but I do not understand how to do it for networks, and I don't think I am alone in this. (Well, I understand how to do it for Erdos-Renyi networks, because that's back to independent samples...)
The method of sieves
Directly optimize the fit, but within a constrained class of models; relax the constraint as the amount of data grows. If the constraint is relaxed slowly enough, should converge on the truth. (Ordinary parametric inference, within a single model class, is a limiting case where the constraint is relaxed infinitely slowly, and we converge on the pseudo-truth within that class [provided we have a consistent estimator].)
Encompassing models
The sampling distribution of any estimator of any model class is a function of the true distribution. If the true model class has been well-estimated, it should be able to predict what other, wrong model classes will estimate, but not vice versa. In this sense the true model class "encompasses the predictions" of the wrong ones. ("Truth is the criterion both of itself and of error.")
General or covering models
Come up with a single model class which includes all the interesting model classes as special cases; do ordinary estimation within it. Getting a consistent estimator of the additional parameters this introduces is often non-trivial, and interpretability can be a problem.
Model averaging
Don't try to pick the best or correct model; use them all with different weights. Chose the weighting scheme so that if one is best, it will tend to be more and more influential. Often I think the improvement is not so much from using multiple models as from smoothing, since estimates of the single best model are going to be more noisy than estimates of a bunch of models which are all pretty good. (This leads to ensemble methods.)
Adequacy testing
The correct model should be able to encode the data as uniform IID noise. Test whether "residuals", in the appropriate sense, are IID uniform. Reject models which can't hack it. Possibly none of the models on offer is adequate; this, too, is informative. Or: models make specific probabilistic assumptions (IID Gaussian noise, for example); test those. Mis-specification testing.

The machine-learning-ish literature on model selection doesn't seem to ever talk about setting up experiments to select among models; or do I just not read the right papers there? (The statistical literature on experimental design seems to tend to talk about "model discrimination" rather than "model selection".)

See also: Information Theory; The Minimum Description Length Principle; Occam's Razor