Statistical Learning Theory with Dependent Data

Last update: 07 Jul 2025 13:47
First version: 4 February 2009

See learning theory if that title doesn't make sense. I am particularly interested in learning for ergodic time series. A lot of the work in this area involves strong mixing conditions, especially beta-mixing. I suspect that in many cases strong mixing is not actually needed, but this is a pure hunch on my part with absolutely no evidence to back it up (right now).

Terrence M. Adams and Andrew B. Nobel, "Uniform convergence of Vapnik-Chervonenkis classes under ergodic sampling", Annals of Probability 38 (2010): 1345--1367, arxiv:1010.3162 [Finite VC dimension ensures uniform convergence for arbitrary ergodic processes — but perhaps arbitrarily slow uniform convergence.]
Ron Meir, "Nonparametric Time Series Prediction Through Adaptive Model Selection," Machine Learning 39 (2000): 5--34 [PDF. Extending the "structural risk minimization" framework due to Vapnik to time series. Unfortunately Meir's approach demands knowledge of the mixing rate of the process, which we don't really know how to estimate, but this is a very encouraging first step. --- Edit, years later: see my papers with McDonald and Schervish on how to estimate the mixing rate.]
Mehryar Mohri and Afshin Rostamizadeh
- "Rademacher Complexity Bounds for Non-I.I.D. Processes", in Advances in Neural Information Processing vol. 21 (NIPS 2008) [PDF. Stationary beta-mixing processes.]
- "Stability Bound for Stationary Phi-mixing and Beta-mixing Processes", Journal of Machine Learning Research 11 (2010): 789--814, arxiv:0811.1629 ["Stability" is the property of a learning algorithm that changing a single observation in the training set leads to only small changes in predictions on the test set. This paper shows that stable learning algorithms continue to perform well with dependent data, provided the dependence decays sufficiently quickly --- the "mixing" properties of ergodic theory.]
Goran Peskir, From Uniform Laws of Large Numbers to Uniform Ergodic Theorems [PDF reprint via Prof. Peskir. Mini-review]
Ramon van Handel, "The universal Glivenko-Cantelli property", Probability Theory and Related Fields 155 (2013): 911--934, arxiv:1009.4434
Mathukumalli Vidyasagar, A Theory of Learning and Generalization: With Applications to Neural Networks and Control Systems [Has a very nice discussion of when the uniform laws of large numbers of statistical learning theory transfer from the usual IID setting to dependent processes, becoming uniform ergodic theorems. (Sufficient conditions include things like beta-mixing, but necessary and sufficient conditions seem to still be unknown.) Mini-review]
Bin Yu, "Rates of Convergence for Empirical Processes of Stationary Mixing Sequences," Annals of Probability 22 (1994): 94--116

Terrence M. Adams and Andrew B. Nobel
- "The Gap Dimension and Uniform Laws of Large Numbers for Ergodic Processes", arxiv:1007.2964 [Extends the results in their "Uniform Convergence" paper (above) to get uniform convergence of real-valued functions, in terms of the gap or "fat shattering" dimension. It's the same (new) technique, of explicitly constructing arbitrarily large shatterable sets when uniform convergence fails.]
- "Uniform Approximation and Bracketing Properties of VC classes", arxiv:1007.4037 ["We show that the sets in a family with finite VC dimension can be uniformly approximated within a given error by a finite partition. Immediate corollaries include the fact that VC classes have finite bracketing numbers, satisfy uniform laws of averages under strong dependence, and exhibit uniform mixing."]
Alekh Agarwal and John C. Duchi, "The Generalization Ability of Online Algorithms for Dependent Data", arxiv:1110.2529 [If you have a learning algorithm which achieves low regret against some class of alternatives, and the data source is suitably mixing, your algorithm will generalize almost as well as the best rival.]
Pierre Alquier and Olivier Wintenberger
- "Fast rates in learning with dependent observations", arxiv:1202.4283 [Assumes observations bounded in the interval [-B,B] and gets a generalization bound which is, if I am not mistaken, O(B³), and so not suitable for my purposes, but an interesting technique.]
- "Model selection and randomization for weakly dependent time series forecasting", Bernoulli 18 (2012): 883--913, arxiv:0902.2924
Patrizia Berti, Irene Crimaldi, Luca Pratelli, and Pietro Rigo, "Rate of convergence of predictive distributions for dependent data", Bernoulli 15 (2009): 1351--1367, arxiv:1001.2152 [Only for exchangeable sequences, sadly]
Patrizia Berti and Pietro Rigo, "A Glivenko-Cantelli theorem for exchangeable random variables", Statistics and Probability Letters 32 (1997): 385--391
Wenxin Jiang, "On Uniform Deviations of General Empirical Risks with Unboundedness, Dependence, and High Dimensionality", 10 (2009): 977--996 [This is a sweet result, despite the appearance of constants like 288 in the main theorem. The term which measures the effect of dependence is one of the weak dependence coefficients of Dedecker et al. (see under ergodic theory) --- I want to say it's their first-order gamma coefficient, but I don't have the book to hand and may be mis-remembering --- applied here to the loss function.]
Aryeh Kontorovich, "Concentration in unbounded metric spaces and algorithmic stability", arxiv:1309.1007
Aryeh Kontorovich and Maxim Raginsky, "Concentration of measure without independence: a unified approach via the martingale method", arxiv:1602.0072
Matthew O. Jackson, Ehud Kalai and Rann Smorodinsky, "Bayesian Representation of Stochastic Processes Under Learning: De Finetti Revisited", Econometrica 67 (1999): 875--893 [JSTOR. See under Characterizing Mixtures of Processes by Summarizing Statistics]
Ben London, Bert Huang, Ben Taskar, Lise Getoor, "Collective Stability in Structured Prediction: Generalization from One Example", ICML 2013 [My notes]
Kevin McGoff, Andrew B. Nobel, "Empirical risk minimization and complexity of dynamical models", Annals of Statistics 48 (2020): 2031--2054, arxiv:1611.06173 [Glad to see it's not just me with delays like that between the preprint and the journal version...]
Andrew B. Nobel, "Limits to classification and regression estimation from ergodic processes", Annals of Statistics 27 (1999): 262--273
Andrew Nobel and Amir Dembo, "A Note on Uniform Laws of Averages for Dependent Processes", Statistics and Probability Letters 17 (1993): 169--172 [An extremely easy way to extend uniform laws of large numbers to uniform ergodic theorems for mixing processes. Actually I suspect that mixing is only necessary to get an explicit rate. (That was written in 2008; see Adams and Nobel's 2010 paper, above, for a very different proof that mixing is not needed for uniform ergodicity.) PDF preprint via Dr. Nobel.]
Vladimir Pestov, "Predictive PAC learnability: a paradigm for learning from exchangeable input data", arxiv:1006.1129 [The trick here is basically as follows: Every exchangeable sequence is a mixture of IID sequences (de Finetti), and which mixture component a realization is in, is an invariant. So one demands only that the function f be well-estimated according to the component the sample is in, rather than over all of the mixture. If one could condition on the mixture component, one would lose no precision/accuracy compared to the IID situation; conditioning on $ X_1, X_2, \ldots X_n $ leaves some uncertainty about which component we're in, hence the cost. All of this should, I'd think, carry over straightforwardly to stationary mixing sequences, since (von Neumann) every stationary sequence is a mixture of ergodic sequences, and ergodic components are invariants of the motion. Perhaps a short paper? — Update, 3 years later: the short paper is Shalizi and Kantorovich, below.]
Maxim Raginsky, "Divergence-based characterization of fundamental limitations of adaptive dynamical systems", arxiv:1010.2286
Daniil Ryabko, "Pattern Recognition for Conditionally Indpendent Data", cs.LG/0507040 [It's a bit of an odd form of dependence: the sequence of labels can show essentially any form of dependence you like, but the features of the nth object have to be independent of everything else, given the label of the nth object. (Hence "conditionally independent data".) This is like some kinds of inside-out hidden Markov model...]
Sara van de Geer, Empirical Processes in M-Estimation
Alexander Zimin and Christoph H. Lampert, "Conditional Risk Minimization for Stochastic Processes", arxiv:1510.02706

Daniel J. McDonald, CRS and Mark Schervish
- "Estimating $\beta$-mixing coefficients", AISTATS 2011, arxiv:1103.0941
- "Generalization error bounds for stationary autoregressive models", arxiv:1103.0942
- "Risk bounds for time series without strong mixing", arxiv:1106.0730
- "Estimating Beta-Mixing Coefficients via Histograms", arxiv:1109.5998 [The extended journal version of the AISTATS 2011 paper]
- "Nonparametric Risk Bounds for Time-Series Forecasting", Journal of Machine Learning Research 18:32 (2017): 1--40, arxiv:1212.0463
CRS and Aryeh Kontorovich, "Predictive PAC Learning and Process Decompositions", NIPS 2013, arxiv:1309.4859

Daniel J. McDonald, Generalization Error Bounds for Time Series [Ph.D. thesis, Carnegie Mellon University, 2012]

Terrence M. Adams and Andrew B. Nobel
- "Uniform Approximation of Vapnik-Chervonenkis Classes", Bernoulli 18 (2012): 1310--1319, arxiv:1010.4515
- "A counterexample concerning the extension of uniform strong laws to ergodic processes", pp. 1--4 in M. Banerjee, F. Bunea, J. Huang, V. Koltchinskii and M. H. Maathuis (eds.), From Probability to Statistics and Back: High-Dimensional Models and Processes --- A Festschrift in Honor of Jon A. Wellner
Pierre Alquier and Xiaoyin Li, "Prediction of quantiles by statistical learning and application to GDP forecasting", arxiv:1202.4294
D. W. K. Andrews, "Consistency in nonlinear econometric models: A generic uniform law of large numbers", Econometrica 55 (1987): 1465--1471
Anuradha M. Annaswamy, Alexander L. Fradkov, "A Historical Perspective of Adaptive Control and Learning", arxiv:2108.11336
David Barrera, Emmanuel Gobet, "Generalization bounds for nonparametric regression with $ \beta $−mixing samples", arxiv:2108.00997
Jose Bento, Morteza Ibrahimi and Andrea Montanari
- "Learning Networks of Stochastic Differential Equations", NIPS 23 (2010), arxiv:1011.0415
- "Information Theoretic Limits on Learning Stochastic Differential Equations", arxiv:1103.1689
Adam Block, Alexander Rakhlin, Abhishek Shetty, "On the Performance of Empirical Risk Minimization with Smoothed Data", arxiv:2402.14987
Graiela Boente and Ricardo Fraiman, "Robust Nonparametric Regression Estimation for Dependent Observations", Annals of Statistics 17 (1989): 1242--1256 [Autoregressions for phi and alpha mixing processes]
Olivier Bousquet, Yegor Klochkov, Nikita Zhivotovskiy, "Sharper bounds for uniformly stable algorithms", arxiv:1910.07833
Xiaohong Chen and Xiaotong Shen, "Extremum Estimates for Weakly Dependent Data", Econometrica 66 (1998): 289--314 [JSTOR]
Xiaohong Chen and Halbert White, "Improved Rates and Asymptotic Normality for Nonparametric Neural Network Estimators", IEEE Transactions on Information Theory 45 (1999): 682--691
Shay B. Cohen and Noah A. Smith, "Empirical Risk Minimization with Approximations of Probabilistic Grammars", NIPS 23 (2010) [PDF]
Toby S. Cubitt, Jens Eisert, Michael M. Wolf, "Extracting dynamical equations from experimental data is NP-hard", Physical Review Letters 108 (2012): 120503, arxiv:1005.0005
Herold Dehling (ed.), Empirical Process Techniques for Dependent Data
Francois Denis, Mattias Gybels, Amaury Habrard, "Dimension-free Concentration Bounds on Hankel Matrices for Spectral Learning", arxiv:1312.6282
Ashwin De Silva, Rahul Ramesh, Rubing Yang, Siyu Yu, Joshua T Vogelstein, Pratik Chaudhari, "Prospective Learning: Learning for a Dynamic Future", arxiv:2411.00109
Hanyuan Hang, Yunlong Feng, Ingo Steinwart and Johan A. K. Suykens, "Learning Theory Estimates with Observations from General Stationary Stochastic Processes", Neural Computation 28 (2016): 2853--2889
H. Hang, I. Steinwart, "A Bernstein-type Inequality for Some Mixing Processes and Dynamical Systems with an Application to Learning", arxiv:1501.03059
Peng He and Changshui Zhang, "Non-Asymptotic Analysis of Relational Learning with One Network", pp. 320--327 in AISTATS 2014
Purushottam Kar, Bharath K Sriperumbudur, Prateek Jain, Harish C Karnick, "On the Generalization Ability of Online Learning Algorithms for Pairwise Loss Functions", arxiv:1305.2505
A. C. Lozano, S. R. Kulkarni and R. E. Schapire, "Convergence and Consistency of Regularized Boosting With Weakly Dependent Observations", IEEE Transactions on Information Theory 60 (2014): 651--660
Kevin McGoff, Andrew B. Nobel, "Variational analysis of inference from dynamical systems", arxiv:1601.05033
Blakeley B. McShane, Shane T. Jensen, Allan I. Pack and Abraham J. Wyner, "Statistical Learning With Time Series Dependence: An Application to Scoring Sleep in Mice", Journal of the American Statistical Association 108 (2013): 1147--1162
Andrew B. Nobel and Terrence M. Adams, "Estimating a function from ergodic samples with additive noise", IEEE Transactions on Information Theory 47 (2001): 2895--2902
Thomas Peel, Sandrine Anthoine, Liva Ralaivola, "Empirical Bernstein Inequalities for U-Statistics", NIPS 2010
Liva Ralaivola, Marie Szafranski, Guillaume Stempfel, "Chromatic PAC-Bayes Bounds for Non-IID Data: Applications to Ranking and Stationary \beta-Mixing Processes", Journal of Machine Learning Research 11 (2010): 1927--1956, arxiv:0909.1933
Daniil Ryabko
- "Characterizing predictable classes of processes", arxiv:0905.4341
- "A criterion for hypothesis testing for stationary processes", arxiv:0905.4937
Yevgeny Seldin, Fran&ccdeil;ois Laviolette, Nicolò Cesa-Bianchi, John Shawe-Taylor, Peter Auer, "PAC-Bayesian Inequalities for Martingales", arxiv:1110.6886
Ingo Steinwart, Marian Anghel, "Consistency of support vector machines for forecasting the evolution of an unknown ergodic dynamical system from observations with unknown noise", Annals of Statistics 37 (2009): 841--875, arxiv:0707.0322
Ingo Steinwart, Don Hush and Clint Scovel, "Learning from dependent observations", arxiv:07078.0303
Yuyi Wang, Jan Ramon, Zheng-Chu Guo, "Learning from networked examples", arxiv:1405.2600
Huan Xu and Shie Mannor, "Robustness and Generalization", arxiv:1005.2243 [Includes results for Doeblin Markov sources]
Henryk Zühle, "Qualitative robustness of statistical functionals under strong mixing", arxiv:1203.5245
Chao Zhang and Dacheng Tao, "Risk Bounds for Levy Processes in the PAC-Learning Framework", Journal of Machine Learning Research Proceedings 9 (2010): 948--955
Alexander Zimin and Christoph Lampert, "MACRO: A Meta-Algorithm for Conditional Risk Minimization", arxiv:1801.00507
Bin Zou, Luoqing Li and Zongben Xu, "The generalization performance of ERM algorithm with strongly mixing observations", Machine Learning 75 (2009): 275--295

Co-conspirators + CRS, "Risk Bounds with Finite-Dimensional Dependence Coefficients"
Co-conspirators + CRS, "Rademacher Complexity Bounds for Predictive Risk"
Co-conspirators + CRS, "Your Favorite DSGE Sucks" [working title]