Cross-Validation
28 Jun 2023 12:45
One of the most brilliantly simple and compelling ideas in all of statistics: to estimate how well your model will do on new data, take your data set and divide it into two parts at random. Fit the model to one part and then evaluate its prediction on the other; average over a couple of splits into training and testing sets.
As a method of model selection; as (not quite the same thing) a means of estimating the generalization error of a statistical model; relations to bootstrapping. How best to cross-validate time series? Spatial models? Networks? Other kinds of structured data? Relation to "stability" in learning theory.
— I'd like to say I'm astonished at the number of people I encounter who think that cross-validation was invented by computer scientists rather than statisticians; but I know how academia works. Some of my references below are to help counter this amnesia.
- Recommended, big picture:
- Sylvain Arlot and Alain Celisse, "A survey of cross-validation procedures for model selection", Statistics Surveys 4 (2010): 40--79
- Seymour Geisser and William F. Eddy, "A Predictive Approach to Model Selection", Journal of the American Statistical Association 74 (1979): 153--160
- M. Stone, "Cross-Validatory Choice and Assessment of Statistical Predictions", Journal of the Royal Statistical Society B 36 (1974): 111--147 [JSTOR]
- Recommended, close-ups:
- Sylvain Arlot
- "V-fold cross-validation improved: V-fold penalization", arxiv:0802.0566 [Seeing cross-validation as a penalization method, and improving it accordingly by strengthening the penalty term]
- "Model selection by resampling penalization", Electronic Journal of Statistics 3 (2009): 557--624, arxiv:0906.3124
- Sylvain Arlot and Alain Celisse, "Segmentation of the mean of heteroscedastic data via cross-validation", Statistics and Computing 21 (2011): 613--632, arxiv:0902.3977 [MATLAB code]
- Sylvain Arlot and Matthieu Lerasle, "Choice of V for V-Fold Cross-Validation in Least-Squares Density Estimation", arxiv:1210.5830 [The paper formerly known as "Why V=5 is enough in V-fold cross-validation"]
- Prabir Burman, Edmond Chow and Deborah Nolan, "A cross-validatory method for dependent data", Biometrika 81 (1994): 351--358 [JSTOR]
- Patrick S. Carmack, William R. Schucany, Jeffrey S. Spence, Richard F. Gunst, Qihua Lin and Robert W. Haley, "Far Casting Cross Validation", Journal of Computational and Graphical Statistics 18 (2009): 879--893 [Leave-one-out CV, with a constant-radius window skipped around each hold-out point as well; this is designed to deal with correlations in time or in space.]
- Kehui Chen and Jing Lei, "Network Cross-Validation for Determining the Number of Communities in Network Data", arxiv:1411.1715
- Matthieu Cornec, "Concentration inequalities of the cross-validation estimator for Empirical Risk Minimiser", arxiv:1011.0096
- Lászlo Györfi, Michael Kohler, Adam Krzyzak and Harro Walk, A Distribution-Free Theory of Nonparametric Regression [chapters 7 and 8 have important results on data splitting and cross-validation]
- Darren Homrighausen and Daniel J. McDonald
- "Cross-validation is risk consistent for lasso", arxiv:1206.6128
- "Risk-consistency of cross-validation with lasso-type procedures", arxiv:1308.0810
- Michael Kearns and Dana Ron, "Algorithmic Stability and Sanity-Check Bounds for Leave-One-Out Cross-Validation," Neural Computation 11 (1999): 1427--1453
- Guillaume Lecué and Charles Mitchell, "Oracle inequalities for cross-validation type procedures", Electronic Journal of Statistics 6 (2012): 1803--1837
- Charles Mitchell and Sara van de Geer, "General Oracle Inequalities for Model Selection", Electronic Journal of Statistics 3 (2009): 176--204 [Analyzes a data-set splitting scheme (like cross-validation with only one "fold")]
- Art B. Owen, Patrick O. Perry, "Bi-cross-validation of the SVD and the nonnegative matrix factorization", Annals of Applied Statistics 3 (2009): 564--594, arxiv:0908.2062
- Jeffrey S. Racine
- "Feasible Cross-Validatory Model Selection for General Stationary Processes", Journal of Applied Econometrics 12 (1997): 169--179 [JSTOR. This is closely related to (maybe algebraically just a special case of?) the familiar trick from splines of writing the CV criterion in terms of the hat/influence/projection matrix.]
- "Consistent cross-validatory model-selection for dependent data: hv-block cross-validation", Journal of Econometrics 99 (2000): 39--61
- Ryan J. Tibshirani and Robert Tibshirani, "A bias correction for the minimum error rate in cross-validation", Annals of Applied Statistics 3 (2009): 822--829 = arxiv:0908.2904
- Mark J. van der Laan and Sandrine Dudoit, "Unified Cross-Validation Methodology for Selection Among Estimators and a General Cross-Validated Adaptive Epsilon-Net Estimator: Finite Sample Oracle Inequalities and Examples" [PDF working paper, i.e., a 100-page tome. The first part proves that multi-fold cross-validation and the like will work for selecting the best estimator out of a finite set of estimators (provided the loss function is nicely bounded and the data are IID). The second part ingeniously turns this into a complete estimation procedure, by effectively creating a discrete sieve and then using CV to say which part of the sieve to use. This is a very cool set of results, but (1) the limitations to bounded loss functions make me nervous, and (2) the formulas appearing in the finite-sample and even asymptotic bounds are ugly. On the other hand, they have finite-sample bounds! — I wonder if the bounded-and-IID restrictions could be lifted using the techniques in Jiang's "On Uniform Deviation Bounds" (link and description under Learning Theory), or those in Dedecker et al.'s Weak Dependence.]
- Aad W. van der Vaart, Sandrine Dudoit and Mark J. van der Laan, "Oracle inequalities for multi-fold cross validation", Statistics and Decisions 24 (2006): 351--371 [Streamlined and improved versions of the key results from the van der Laan/Dudoit tome. Thanks to Prof. van der Vaart for a reprint]
- To read:
- Stephen Bates, Trevor Hastie and Robert Tibshirani, "Cross-Validation: What Does It Estimate and How Well Does It Do It?", Journal of the American Statistical Association forthcoming (2023)
- Yoshua Bengio and Yves Grandvalet, "No unbiased estimator of the variance of k-fold cross-validation", Journal of Machine Learning Research 5 (2004): 1089--1105
- P. Burman, "A comparative study of ordinary cross-validation, v-fold cross-validation and the repeated learning-testing methods", Biometrika 76 (1989): 503--514
- Alain Celisse, "Model selection in density estimation via cross-validation", arxiv:0811.0802
- Emily Colby and Eric Bair, "Cross-Validation for Nonlinear Mixed Effects Models", technical report 35, UNC-Chapel Hill Dept. of Biostatistics, 2013
- Matthieu Cornec, "Estimating Subbagging by cross-validation", arxiv:1011.5142
- Charanpal Dhanjal, Nicolas Baskiotis, Stéphan Clémen&ccdeil;on and Nicolas Usunier, "An Empirical Comparison of V-fold Penalisation and Cross Validation for Model Selection in Distribution-Free Regression", arxiv:1212.1780
- Sandrine Dudoit and Mark J. van der Laan, "Asymptotics of Cross-Validated Risk Estimation in Estimator Selection and Performance Assessment", Statistical Methodology 2 (2005): 131--154 [preprint]
- Jianqing Fan, Shaojun Guo and Ning Hao, "Variance estimation using refitted cross-validation in ultrahigh dimensional regression", Journal of the Royal Statistical Society B 74 (2012): 37--65
- Cheryl J. Flynn, Clifford M. Hurvich, Jeffrey S. Simonoff, "On the Sensitivity of the Lasso to the Number of Predictor Variables", arxiv:1403.4544
- Jenny Häggström and Xavier de Luna, "Estimating
Prediction Error: Cross-Validation vs. accumulated Prediction Error",
Communications in Statistics: Simulation
and Computation 39 (2010): 880--898
- Satyen Kale, Ravi Kumar and Sergei Vassilvitskii, "Cross-Validation and Mean-Square Stability" [PDF preprint via Dr. Kale]
- Heeyoung Kim and Xiaoming Huo, "Asymptotic optimality of a multivariate version of the generalized cross validation in adaptive smoothing splines", Electronic Journal of Statistics 8 (2014): 159--183
- Tammo Krueger, Danny Panknin, Mikio Braun, "Fast Cross-Validation via Sequential Testing", arxiv:1206.2248
- Loic Le Gratiet and Claire Cannamela, "Kriging-based sequential design strategies using fast cross-validation techniques with extensions to multi-fidelity computer codes", arxiv:1210.6187
- Chinghway Lim, Bin Yu, "Estimation Stability with Cross Validation (ESCV)", arxiv:1303.3128
- Art B. Owen and Jingshu Wang, "Bi-Cross-Validation for Factor Analysis", Statistical Science 31 (2016): 119--139, arxiv:1503.03515
- M. Pavlic and M. J. van der Laan, "Fitting of mixtures with unspecified number of components using cross validation distance estimate", Computational Statistics and Data Analysis 41 (2003): 413--428
- Juan Diego Rodriguez, Aritz Perez and Jose Antonio Lozano, "Sensitivty analysis of k-fold cross validation in prediction error estimation", IEEE Transactions on Pattern Analysis and Machine Intelligence 32 (2010): 569--575
- Olga Y. Savchuk, Jeffrey D. Hart, and Simon J. Sheather, "Indirect Cross-Validation for Density Estimation", Journal of the American Statistical Association 105 (2010): 415--423
- Hui Shen, William J. Welch, and Jacqueline M. Hughes-Oliver, "Efficient, adaptive cross-validation for tuning and comparing models, with application to drug discovery", Annals of Applied Statistics 5 (2011): 2668--2687
- David Shilane, Richard H. Liang and Sandrine Dudoit, "Loss-Based Estimation with Evolutionary Algorithms and Cross-Validation", UC Berkeley Biostatistics Working Paper 227 [Abstract, PDF]
- Ansgar Steland
- "Sequential Data-Adaptive Bandwidth Selection by Cross-Validation for Nonparametric Prediction", arxiv:1010.6202
- "Sequential Cross-Validated Bandwidth Selection Under Dependence and Anscombe-Type Extensions to Random Time Horizons", arxiv:1205.6741
- Junhui Wang, "Consistent selection of the number of clusters via crossvalidation", Biometrika 97 (2010): 893--904
- Xiaogang Wang and James V. Zidek, "Selecting likelihood weights by cross-validation", math.ST/0505599 = Annals of Statistics 33 (2005): 463--500
- Jerzy Wieczorek, Cole Guerin and Thomas McMahon, "K-fold cross-validation for complex sample surveys", Stat 11 (2022): e454 [Jerzy's self-exposition]
- Yuhong Yang, "Consistency of cross validation for comparing regression procedures", arxiv:0803.2963
- Xianli Zeng, Yingcun Xia, Linjun Zhang, "Double Cross Validation for the Number of Factors in Approximate Factor Models", arxiv:1907.01670
- To write:
- CRS, "Cross-validation for mixing processes" [using some notions from learning with dependent data]
- CRS + co-conspirators, "Cross-validation for networks"