Model Selection
04 May 2023 11:53
(Reader, please make your own suitably awful pun about the different senses of "model selection" here, as a discouragement to those finding this page through prurient searching. Thank you.)
In statistics and machine learning, "model selection" is the problem of picking among different mathematical models which all purport to describe the same data set. This notebook will not (for now) give advice on it; as usual, it's more of a place to organize my thoughts and references...
Classification of approaches to model selection (probably not really exhaustive but I can't think of others, right now):
- Direct optimization of some measure of goodness of fit or risk on training data.
- Seems implicit in a lot of work which points to marginal improvements in "the proportion of variance explained", mis-classification rates, "perplexity", etc. Often, also, a recipe for over-fitting and chasing snarks. What's wanted is (almost always) some way of measuring the ability to generalize to new data, and in-sample performance is a biased estimate of this. Still, with enough data, if the gods of ergodicity are kind, in-sample performance is representative of generalization performance, so perhaps this will work asymptotically, though in many cases the researcher will never even glimpse Asymptopia across the Jordan.
- Optimize fit with model-dependent penalty
- Add on a term to each model which supposed indicates its ability to over-fit. (Adjusted R^2, AIC, BIC, ..., all do this in terms of the number of parameters.) Sounds reasonable, but I wonder how many actually work better, in practice, than direct optimization. (See Domingos for some depressing evidence on this score.)
- Classical two-part minimum description length methods were penalties; I don't yet understand one-part MDL.
- Penalties which depend on the model class
- Measure the capacity of a class of models to over-fit; penalize all models in that class accordingly, regardless of their individual properties. Outstanding example: Vapnik's "structural risk minimization" (provably consistent under some circumstances). Only sporadically coincides with *IC-type penalties based on the number of parameters.
- Cross-validation
- Estimate the ability to generalize to different data by, in fact, using different data. Maybe the "industry standard" of machine learning. Query, how are we to know how much different data to use?
- Query, how are we to cross-validate when we have complex, relational data? That is, I understand how to do it for independent samples, and I even understand how to do it for time series, but I do not understand how to do it for networks, and I don't think I am alone in this. (Well, I understand how to do it for Erdos-Renyi networks, because that's back to independent samples...) [Update, August 2019: If you follow that last link, you will now see a bunch of references about CV for networks; I am happy that a lot of this work was done here in the CMU Statistics Department, and even happier that I didn't have to do it.]
- The method of sieves
- Directly optimize the fit, but within a constrained class of models; relax the constraint as the amount of data grows. If the constraint is relaxed slowly enough, should converge on the truth. (Ordinary parametric inference, within a single model class, is a limiting case where the constraint is relaxed infinitely slowly, and we converge on the pseudo-truth within that class [provided we have a consistent estimator].)
- Encompassing models
- The sampling distribution of any estimator of any model class is a function of the true distribution. If the true model class has been well-estimated, it should be able to predict what other, wrong model classes will estimate, but not vice versa. In this sense the true model class "encompasses the predictions" of the wrong ones. ("Truth is the criterion both of itself and of error.")
- General or covering models
- Come up with a single model class which includes all the interesting model classes as special cases; do ordinary estimation within it. Getting a consistent estimator of the additional parameters this introduces is often non-trivial, and interpretability can be a problem.
- Model averaging
- Don't try to pick the best or correct model; use them all with different weights. Chose the weighting scheme so that if one is best, it will tend to be more and more influential. Often I think the improvement is not so much from using multiple models as from smoothing, since estimates of the single best model are going to be more noisy than estimates of a bunch of models which are all pretty good. (This leads to ensemble methods.)
- Adequacy testing
- The correct model should be able to encode the data as uniform IID noise. Test whether "residuals", in the appropriate sense, are IID uniform. Reject models which can't hack it. Possibly none of the models on offer is adequate; this, too, is informative. Or: models make specific probabilistic assumptions (IID Gaussian noise, for example); test those. Mis-specification testing.
The machine-learning-ish literature on model selection doesn't seem to ever talk about setting up experiments to select among models; or do I just not read the right papers there? (The statistical literature on experimental design seems to tend to talk about "model discrimination" rather than "model selection".) --- Update, 2023-05-04: That statement is now, thankfully, obsolete; there's definitely "active learning" work which does this, though how well it works is a different question.
Two technical issues spun off from here:
- Variable selection for regression models (and classifiers), i.e., picking which variables should go in to the model as regressors.
- Post-model-selection inference, i.e., how having picked a model by model selection (rather than receiving the model directly from the angels) alters inferential statistics like hypothesis tests and confidence sets, and how to compensate.
- See also:
- Information Theory
- The Minimum Description Length Principle
- Occam's Razor
- Random Time Changes for Stochastic Processes
- Recommended, big-picture:
- Gerda Claeskens and Nils Lid Hjort, Model Selection and Model Averaging [Review: How Can You Choose Just One?]
- Bruce E. Hansen, "Challenges for Econometric Model Selection", Econometric Theory 21 (2005): 60--68 ["Standard econometric model selection methods are based on four fundamental errors in approach: parametric vision, the assumption of a true [data-generating process], evaluation based on fit, and ignoring the impact of model uncertainty on inference. Instead, econometric model selection methods should be based on a semiparametric vision, models should be viewed as approximations, models should be evaluated based on their purpose, and model uncertainty should be incorporated into inference methods." PDF]
- Trevor Hastie, Robert Tibshirani and Jerome Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction
- C. R. Rao and Y. Wu, with discussion by Sadanori Konishi and Rahul Mukerjee, "On Model Selection", in P. Lahiri (ed.), Model Selection, pp. 1--64 [Thorough review paper, if from a rather old-school statistical-theory perspective. The rest of the volume is too Bayesian to be of interest to me.]
- Brian Ripley, "Selecting Amongst Large Classes of Models" [Talk slides (PDF), but informative, chatty, and approvable]
- Recommended, close-ups:
- Alekh Agarwal, John C. Duchi, Peter L. Bartlett, Clement Levrard, "Oracle inequalities for computationally budgeted model selection" [COLT 2011]
- Sylvain Arlot
- "V-fold cross-validation improved: V-fold penalization", arxiv:0802.0566 [Seeing cross-validation as a penalization method, and improving it accordingly by strengthening the penalty term]
- "Model selection by resampling penalization", Electronic Journal of Statistics 3 (2009): 557--624, arxiv:0906.3124
- Pierre Alquier and Olivier Wintenberger, "Model selection and randomization for weakly dependent time series forecasting", Bernoulli 18 (2012): 883--913, arxiv:0902.2924
- A. C. Atkinson and A. N. Donev, Optimum Experimental Design
- Leo Breiman, "Heuristics of Instability and Stabilization in Model Selection," Annals of Statistics 24 (1996): 2350--2383
- Peter Bühlmann and Sara van de Geer, Statistics for High-Dimensional Data: Methods, Theory and Applications [State-of-the art (2011) compendium of what's known about using the Lasso, and related methods, for model selection. Mini-review]
- Nicolo Cesa-Bianchi and Gabor Lugosi, Prediction, Learning, and Games [Mini-review. For avoiding model selection in favor of adaptively-weighted combinations of models.]
- Snigdhansu Chatterjee, Nitai D. Mukhopadhyay, "Risk and resampling under model uncertainty", arxiv:0805.3244 [an interesting approach to model averaging with provably good frequentist properties, via bootstrapping --- for a trivial linear-Gaussian problem; not clear to me how to generalize]
- D. R. Cox, "Tests of Separate Families of Hypotheses", Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Vol. 1 (Univ. of Calif. Press, 1961), 105-123 [The origins of Cox's test for non-nested hypotheses]
- Pedro Domingos
- Marcus Hutter, "The Loss Rank Principle for Model Selection", math.ST/0702804 [This is a simplified form of Deborah Mayo's "severity".]
- Pascal Massart, Concentration Inequalities and Model Selection [Using empirical process theory to get finite-sample, i.e., non-asymptotic, risk bounds for various forms of model selection. Available for free as a large PDF preprint. Mini-review]
- Charles Mitchell and Sara van de Geer, "General Oracle Inequalities for Model Selection", Electronic Journal of Statistics 3 (2009): 176--204 [Analyzes a data-set splitting scheme (like cross-validation with only one "fold")]
- Douglas Rivers and Quang H. Vuong, "Model selection tests for nonlinear dynamic models", The Econometrics Journal 5 (2002): 1--39
- Aris Spanos, "Curve-Fitting, the Reliability of Inductive Inference and the Error-Statistical Approach", Philosophy of Science 74 (2007): 1046--1066 [PDF preprint]
- T. P. Speed and Bin Yu, "Model selection and prediction: Normal regression", Annals of the Institute of Statistical Mathematics 45 (1993): 35--54
- David J. Spiegelhalter, Nicola G. Best, Bradley P. Carlin and Angelika van der Linde, "Bayesian Measures of Model Complexity and Fit", Journal of the Royal Statistical Society B 64 (2002): 583--639 [PDF reprint. However, see Claeskens and Hjort, especially p. 92, for a discussion of how this just turns into the Takeuchi (= "model-robust" Akaike) IC in the large-sample limit.]
- Ryan J. Tibshirani, "Degrees of Freedom and Model Search", arxiv:1402.1920
- Sara van de Geer, Empirical Process Theory in M-Estimation
- V. N. (=Vladimir Naumovich) Vapnik, The Nature of Statistical Learning Theory [Review: A Useful Biased Estimator]
- Quang H. Vuong, "Likelihood Ratio Tests for Model Selection and Non-Nested Hypotheses", Econometrica 57 (1989): 307--333
- Jianming Ye, "On Measuring and Correcting the Effects of Data Mining and Model Selection", Journal of the American Statistical Association 93 (1998): 120--131
- Recommendation in limbo:
- George Casella and Guido Consonni, "Reconciling Model Selection and Prediction", arxiv:0903.3620 [This is about the issue of whether it is possible for one and the same model selection procedure to both be consistent and to have the (asymptotic) minimax rate for prediction error. The original demonstration that this is not possible is in the 2005 paper by Prof. Yuhong Yang (referenced below). Casella and Consonni try to characterize the set of counter-examples. They argue that this set, as they work it out, is "pathological". However, Prof. Yang informs me that their characterization rests on something of a mis-apprehension of the results in question, which can create a set of counter-examples for any procedure. This brings home to me that I do not yet know enough about this corner of the literature to have an opinion.]
- Modesty forbids me to recommend:
- Xiaoran Yan, Jacob E. Jensen, Florent Krzakala, Cristopher Moore, CRS, Lenka Zdeborova, Pan Zhang and Yaojia Zhu, "Model Selection for Degree-corrected Block Models", arxiv:1207.3994
- To read:
- Animashree Anandkumar, Vincent Y.F. Tan, Alan. S. Willsky, "High-Dimensional Gaussian Graphical Model Selection: Tractable Graph Families", arxiv:1107.1270
- Sylvain Arlot, "Choosing a penalty for model selection in heteroscedastic regression", arxiv:0812.3141
- Sylvain Arlot and Alain Celisse, "A survey of cross-validation procedures for model selection", Statistics Surveys 4 (2010): 40--79
- Sylvain Arlot and Pascal Massart, "Data-driven Calibration of Penalties for Least-Squares Regression", Journal of Machine Learning Research 10 (2009): 245--279
- Florent Autin, Erwan Le Pennec, Jean-Michel Loubes, Vincent Rivoirard, "Maxisets for Model Selection", arxiv:0802.4192 ["the maximal spaces (maxisets) where model selection procedures attain a given rate of convergence"]
- A. R. Baigorri, C. R. Goncalves, P. A. A. Resende, "Markov Chain Order Estimation and Relative Entropy", arxiv:0910.0264
- Maria Maddalena Barbieri and James O. Berger, "Optimal Predictive Model Selection", Annals of Statistics 32 (2004): 870--897, math.ST/0406464 [Unfortunately, Bayesian]
- Andrew Barron, Lucien Birgé, and Pascal Massart, "Risk bounds for model selection via penalization", Probability Theory and Related Fields 113 (1999): 301--413
- Lucien Birgé
- "The Brouwer Lecture 2005: Statistical estimation with model selection", math.ST/0605187
- "Model selection for Poisson processes", math/0609549
- Lucien Birgŕ and Pascal Massart
- "Minimal Penalties for Gaussian
Model Selection", Probability Theory and
Related Fields 138 (2007): 33--73
- "From model selection to adaptive estimation", pp. 55--87 in Pollard, Torgersen and Yang (eds.), Fetschrift for Lucien Le Cam: Research Papers in Probability and Statistics (1997)
- Gilles Blanchard, Olivier Bousquet, Pascal Massart, "Statistical performance of support vector machines", Annals of Statistics 36 (2008): 489--531, arxiv:0804.0551
- Borowiak, Model Discrimination for Nonlinear Regression Models
- Kenneth P. Burnham and David R. Anderson, Model Selection and Inference: A Practical Information-Theoretic Approach [Here, "information-theoretic" just seems to mean "AIC"]
- Daniel R. Cavagnaro, Jay I. Myung, Mark A. Pitt and Janne V. Kujala, "Adaptive Design Optimization: A Mutual Information-Based Approach to Model Discrimination in Cognitive Science", Neural Computation 22 (2010): 887--905
- Gavin C. Cawley, Nicola L. C. Talbot, "On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation", Journal of Machine Learning Research 11 (2010): 2079--2107
- A. E. Clark and C. G. Troskie, "Time Series and Model Selection", Communications in Statistics: Simulation and computing 37 (2008): 766--771 [Simulation study of the accuracy of different information criteria]
- Kevin A. Clarke, "A Simple Distribution-Free Test for Nonnested Hypotheses" [PDF preprint]
- Guilhem Coq, Olivier Alata, Marc Arnaudon and Christian Olivier, "An improved method for model selection based on Information Criteria", math.ST/0702540
- Michiel Debruyne, Mia Hubert, Johan A.K. Suykens, "Model Selection in Kernel Based Regression using the Influence Function", Journal of Machine Learning Research 9 (2008): 2377--2400
- Charanpal Dhanjal, Nicolas Baskiotis, Stéphan Clémen&ccdeil;on and Nicolas Usunier, "An Empirical Comparison of V-fold Penalisation and Cross Validation for Model Selection in Distribution-Free Regression", arxiv:1212.1780
- Hugo Jair Escalante, Manuel Montes, Luis Enrique Sucar, "Particle Swarm Model Selection", Journal of Machine Learning Research 10 (2009): 405--440
- Robin J. Evans, "Model selection and local geometry", Annals of Statistics 48 (2020): 3513--3544, arxiv:1801.08364
- Frédéric Ferraty, Peter Hall, "An Algorithm for Nonlinear, Nonparametric Model Choice and Prediction", arxiv:1401.8097
- Magalie Fromont, "Model selection by bootstrap penalization for classification", Machine Learning 66 (2007): 165--207
- Elisabeth Gassiat, Ramon Van Handel, "Consistent order estimation and minimal penalties", arxiv:1002.1280
- Christophe Giraud, "Estimation of Gaussian graphs by model selection", arxiv:0710.2044
- Alexander Goldenshluger and Eitan Greenshtein, "Asymptotically minimax regret procedures in regression model selection and the magnitude of the dimension penalty", Annals of Statistics 28 (2000): 1620--1637 [Via Kevin Kelly.]
- Christian Gourieroux and Alain Monfort, "Testing, Encompassing, and Simulating Dynamic Econometric Models", Econometric Theory 11 (1995): 195--228 [JSTOR]
- Sonja Greven and Thomas Kneib, "On the behaviour of marginal and conditional AIC in linear mixed models", Biometrika 97 (2010): 773--789
- Jenny Häggström and Xavier de Luna, "Estimating
Prediction Error: Cross-Validation vs. accumulated Prediction Error",
Communications in Statistics: Simulation
and Computation 39 (2010): 880--898
- David F. Hendry and Jurgen A. Doornik, Empirical Model Discovery and Theory Evaluation: Automatic Selection Methods in Econometrics
- Benjamin Hofner, Torsten Hothorn, Thomas Kneib, and Matthias Schmid, "A Framework for Unbiased Model Selection Based on Boosting", Journal of Computational and Graphical Statistics forthcoming (2011)
- Xiaoming Huo, Xuelei (Sherry) Ni, "When do stepwise algorithms meet subset selection criteria?", Annals of Statistics 35 (2007): 870--887, arxiv:0708.2149
- Ching-Kang Ing, "Accumulated prediction errors, information criteria and optimal forecasting for autoregressive time series", Annals of Statistics 35 (2007): 1238--1277, arxiv:0708.2373
- Nicholas M. Kiefer and Hwan-Sik Choi, "Robust Model Selection in Dynamic Models with an Application to Comparing Predictive Accuracy" [SSRN]
- Sadanori Konishi and Genshiro Kitagawa, "Asymptotic theory for information crteria in model selection --- functional approach," Journal of Statistical Planning and Inference 114 (2003): 45--61
- Tri M. Le, Bertrand S. Clarke, "Model Averaging Is Asymptotically Better Than Model Selection For Prediction", Journal of Machine Learning Research 23 (2022): 33
- Matthieu Lerasle, "Optimal model selection for density estimation of stationary data under various mixing conditions", Annals of Statistics 39 (2011): 1852--1877, arxiv:0911.1497
- Chenlei Leng, "The Residual Information Criterion, Corrected", arxiv:0711.1918
- F. Liang and A. Barron, "Exact Minimax Strategies for Predictive Density Estimation, Data Compression, and Model Selection", IEEE Transactions on Information Theory 50 (2004): 2708--2726
- Wei Liu and Yuhong Yang, "Parametric or nonparametric? A parametricness index for model selection", Annals of Statistics 39 (2011): 2074--2102
- Han Liu, Kathryn Roeder, Larry Wasserman, "Stability Approach to Regularization Selection (StARS) for High Dimensional Graphical Models", arxiv:1006.3316
- Andrea Mariani, Andrea Giorgetti, Marco Chiani, "Model Order Selection Based on Information Theoretic Criteria: Design of the Penalty", IEEE Transactions on Signal Processing 63 (2015): 2779--2789, arxiv:1910.03980
- Abraham Meidan and Boris Levin, "Choosing from Competing Theories in Computerised Learning", Minds and Machines 12 (2002): 119--129
- Grayham E. Mizon and Massimiliano Marcellino (eds.), Progressive Modelling: Non-nested Testing and Encompassing
- Ali Mohammad-Djafari, "Model selection for inverse problems: Best choice of basis functions and model order selection," physics/0111020
- Jose Luis Montiel Olea, Pietro Ortoleva, Mallesh M Pai, Andrea Prat, "Competing Models", arxiv:1907.03809 [I'm a bit puzzled by the abstract, since it's trivial that for any Bayesian agent, $\Pr_{me}(My model is correct|Data)=1$, regardless of the data. Every Bayesian agent begins and remains 100% confident that the truth is in the support of their prior.]
- Samuel Mueller and A. H. Welsh, "Robust model selection in generalized linear models", arxiv:0711.2349
- Danielle J. Navarro, "Between the Devil and the Deep Blue Sea: Tensions Between Scientific Judgement and Statistical Model Selection", Computational Brain and Behavior 2 (2019): 28--34
- Zacharias Psaradakis, Martin Sola, Fabio Spagnolo and Nicola Spagnolo, "Selecting nonlinear time series models using information criteria", Journal of Time Series Analysis 30 (2009): 369--394
- Pradeep Ravikumar, Martin J. Wainwright, and John D. Lafferty, "High-dimensional Ising model selection using $\ell_1$-regularized logistic regression", Annals of Statistics 38 (2010): 1287--1319, arxiv:0804.4202
- Jeremy Sabourin, William Valdar, Andrew Nobel, "A Permutation Approach for Selecting the Penalty Parameter in Penalized Model Selection", arxiv:1404.2007
- Aris Spanos
- "Statistical Induction, Severe Testing, and Model Validation" [Preprint]
- "Statistical Model Specification vs. Model Selection: Akaike-type Criteria and the Reliability of Inference" [preprint kindly provided by Prof. Spanos]
- Tina Toni and Michael P. H. Stumpf
- "Parameter Inference and Model Selection in Signaling Pathway Models", arxiv:0905.4468
- "Simulation-based model selection for dynamical systems in systems and population biology", arxiv:0911.1705
- Masayuki Uchida and Nakahiro Yoshida, "Information Criteria in Model Selection for Mixing Processes", Statistical Inference for Stochastic Processes 4 (2001): 73--98
- Samuel Vaiter, Mohammad Golbabaee, Jalal Fadili, Gabriel Peyré, "Model Selection with Piecewise Regular Gauges", arxiv:1307.2342
- Tim van Erven, Peter Grunwald and Steven de Rooij, "Catching Up Faster by Switching Sooner: A Prequential Solution to the AIC-BIC Dilemma", arxiv:0807.1005
- Ramon van Hanel, "On the minimal penalty for Markov order estimation", Probability Theory and Related Fields 150 (2011): 709--738, arxiv:0908.3666
- Geert Verbeke, Geert Molenberghs, Caroline Beunckens, "Formal and Informal Model Selection with Incomplete Data", Statistical Science 23 (2008): 201--218, arxiv:0808.3587
- Junhui Wang, "Consistent selection of the number of clusters via crossvalidation", Biometrika 97 (2010): 893--904
- Zijun Wang, "Finite Sample Performances of the Model Selection Approach in Nonparametric Model Specification for Time Series", Communications in Statistics: Theory and Methods 38 (2009): 2302--2330
- ChangJiang Xu and A. Ian McLeod, "Further asymptotic properties of the generalized information criterion", Electronic Journal of Statistics 6 (2012): 656--663
- Lan Xue, annie Qu, Jianhui Zhou, "Consistent Model Selection for Marginal Generalized Additive Model for Correlated Data", Journal of the American Statistical Association forthcoming
- Yuhong Yang
- "Can the strengths of AIC and BIC be shared? A conflict between model identification and regression estimation", Econometric Theory 23 (2007): 1--36
- Yi Yu, Yang Feng, "Modified Cross-Validation for Penalized High-Dimensional Linear Regression Models", arxiv:1309.2068
- Yiyun Zhang, Runze Li and Chih-Ling Tsai, "Regularization Parameter Selections via Generalized Information Criterion", Journal of the American Statistical Association 105 (2010): 312--323
- Piotr Zwiernik, "An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models", Journal of Machine Learning Research 12 (2011): 3283--3310 [where "general Markov models" == binary graphical tree models where all the inner nodes of a tree represent binary hidden variables]
- Piotr Zwiernik, Jim Q. Smith, "The Dependence of Routine Bayesian Model Selection Methods on Irrelevant Alternatives", arxiv:1208.3553