## Statistical Learning Theory with Dependent Data

*07 Aug 2019 10:59*

See learning theory if that title doesn't
make sense. I am particularly interested in learning
for ergodic time series. A lot of the work
in this area involves strong mixing conditions, especially beta-mixing. I
suspect that in many cases strong mixing is *not* actually needed, but
this is a pure hunch on my part with absolutely no evidence to back it up
(right now).

See also: Empirical Process Theory; Ergodic Theory; Learning Theory; Relational Learning; Statistics; Time Series

- Recommended, big picture:
- Terrence M. Adams and Andrew B. Nobel, "Uniform convergence of
Vapnik-Chervonenkis classes under ergodic
sampling", Annals
of Probability
**38**(2010): 1345--1367, arxiv:1010.3162 [Finite VC dimension ensures uniform convergence for arbitrary ergodic processes — but perhaps arbitrarily slow uniform convergence.] - Ron Meir, "Nonparametric Time Series Prediction Through Adaptive
Model Selection," Machine Learning
**39**(2000): 5--34 [PDF. Extending the "structural risk minimization" framework due to Vapnik to time series. Unfortunately Meir's approach demands knowledge of the mixing rate of the process, which we don't really know how to estimate, but this is a very encouraging first step.] - Mehryar Mohri and Afshin Rostamizadeh
- "Rademacher Complexity Bounds for Non-I.I.D. Processes", in Advances in Neural Information Processing vol. 21 (NIPS 2008) [PDF. Stationary beta-mixing processes.]
- "Stability Bound for Stationary Phi-mixing and Beta-mixing
Processes", Journal
of Machine Learning Research
**11**(2010): 789--814, arxiv:0811.1629 ["Stability" is the property of a learning algorithm that changing a single observation in the training set leads to only small changes in predictions on the test set. This paper shows that stable learning algorithms continue to perform well with dependent data, provided the dependence decays sufficiently quickly --- the "mixing" properties of ergodic theory.]

- Goran Peskir, From Uniform Laws of Large Numbers to Uniform Ergodic Theorems [PDF reprint via Prof. Peskir]
- Ramon van Handel, "The universal Glivenko-Cantelli property",
Probability Theory and Related Fields
**155**(2013): 911--934, arxiv:1009.4434 - Mathukumalli Vidyasagar, A Theory of Learning and Generalization: With Applications to Neural Networks and Control Systems [Has a very nice discussion of when the uniform laws of large numbers of statistical learning theory transfer from the usual IID setting to dependent processes, becoming uniform ergodic theorems. (Sufficient conditions include things like beta-mixing, but necessary and sufficient conditions seem to still be unknown.) Mini-review]
- Bin Yu, "Rates of Convergence for Empirical Processes of Stationary
Mixing Sequences," Annals of Probability
**22**(1994): 94--116

- Recommended, close-ups:
- Terrence M. Adams and Andrew B. Nobel
- "The Gap Dimension and Uniform Laws of Large Numbers for Ergodic Processes", arxiv:1007.2964 [Extends the results in their "Uniform Convergence" paper (above) to get uniform convergence of real-valued functions, in terms of the gap or "fat shattering" dimension. It's the same (new) technique, of explicitly constructing arbitrarily large shatterable sets when uniform convergence fails.]
- "Uniform Approximation and Bracketing Properties of VC classes", arxiv:1007.4037 ["We show that the sets in a family with finite VC dimension can be uniformly approximated within a given error by a finite partition. Immediate corollaries include the fact that VC classes have finite bracketing numbers, satisfy uniform laws of averages under strong dependence, and exhibit uniform mixing."]

- Alekh Agarwal and John C. Duchi, "The Generalization Ability of Online Algorithms for Dependent Data", arxiv:1110.2529 [If you have a learning algorithm which achieves low regret against some class of alternatives, and the data source is suitably mixing, your algorithm will generalize almost as well as the best rival.]
- Pierre Alquier and Olivier Wintenberger
- "Fast rates in learning with dependent observations",
arxiv:1202.4283 [Assumes
observations bounded in the interval [-B,B] and gets a generalization bound
which is, if I am not mistaken, O(B
^{3}), and so not suitable for my purposes, but an interesting technique.] - "Model selection and randomization for weakly dependent
time series
forecasting", Bernoulli
**18**(2012): 883--913, arxiv:0902.2924

- "Fast rates in learning with dependent observations",
arxiv:1202.4283 [Assumes
observations bounded in the interval [-B,B] and gets a generalization bound
which is, if I am not mistaken, O(B
- Patrizia Berti, Irene Crimaldi, Luca Pratelli, and Pietro Rigo,
"Rate of convergence of predictive distributions for dependent data",
Bernoulli
**15**(2009): 1351--1367, arxiv:1001.2152 [Only for exchangeable sequences, sadly] - Patrizia Berti and Pietro Rigo, "A Glivenko-Cantelli theorem for
exchangeable random variables", Statistics and Probability
Letters
**32**(1997): 385--391 - Wenxin Jiang, "On Uniform Deviations of General Empirical Risks
with Unboundedness, Dependence, and High
Dimensionality",
**10**(2009): 977--996 [This is a sweet result, despite the appearance of constants like 288 in the main theorem. The term which measures the effect of dependence is one of the weak dependence coefficients of Dedecker et al. (see under ergodic theory) --- I want to say it's their first-order gamma coefficient, but I don't have the book to hand and may be mis-remembering --- applied here to the loss function.] - Aryeh Kontorovich, "Concentration in unbounded metric spaces and algorithmic stability", arxiv:1309.1007
- Matthew O. Jackson, Ehud Kalai and Rann Smorodinsky, "Bayesian Representation of Stochastic Processes Under Learning: De Finetti Revisited",
Econometrica
**67**(1999): 875--893 [JSTOR. See under Characterizing Mixtures of Processes by Summarizing Statistics] - Ben London, Bert Huang, Ben Taskar, Lise Getoor, "Collective Stability in Structured Prediction: Generalization from One Example", ICML 2013 [My notes]
- Andrew Nobel and Amir Dembo, "A Note on Uniform Laws of Averages for Dependent Processes", Statistics and Probability Letters
**17**(1993): 169--172 [An extremely easy way to extend uniform laws of large numbers to uniform ergodic theorems for mixing processes. Actually I suspect that mixing is only necessary to get an explicit rate. (That was written in 2008; see Adams and Nobel's 2010 paper, above, for a very different proof that mixing is not needed for uniform ergodicity.) PDF preprint via Dr. Nobel.] - Vladimir Pestov, "Predictive PAC learnability: a paradigm for
learning from exchangeable input
data", arxiv:1006.1129 [The trick
here is basically as follows: Every exchangeable sequence is a mixture of IID
sequences (de Finetti), and which mixture component a realization is in, is an
invariant. So one demands only that the function
*f*be well-estimated according to the component the sample is in, rather than over all of the mixture. If one could condition on the mixture component, one would lose no precision/accuracy compared to the IID situation; conditioning on \( X_1, X_2, \ldots X_n \) leaves some uncertainty about which component we're in, hence the cost. All of this should, I'd think, carry over straightforwardly to stationary mixing sequences, since (von Neumann) every stationary sequence is a mixture of ergodic sequences, and ergodic components are invariants of the motion. Perhaps a short paper? — Update, 3 years later: the short paper is Shalizi and Kantorovich, below.] - Maxim Raginsky, "Divergence-based characterization of fundamental limitations of adaptive dynamical systems", arxiv:1010.2286
- Daniil Ryabko, "Pattern Recognition for Conditionally Indpendent Data", cs.LG/0507040 [It's a bit of an odd form of dependence: the sequence of labels can show essentially any form of dependence you like, but the features of the nth object have to be independent of everything else, given the label of the nth object. (Hence "conditionally independent data".) This is like some kinds of inside-out hidden Markov model...]
- Sara van de Geer, Empirical Processes in M-Estimation

- Modesty forbids me to recommend:
- Daniel J. McDonald, CRS and Mark Schervish
- "Estimating $\beta$-mixing coefficients", AISTATS 2011, arxiv:1103.0941
- "Generalization error bounds for stationary autoregressive models", arxiv:1103.0942
- "Risk bounds for time series without strong mixing", arxiv:1106.0730
- "Estimating Beta-Mixing Coefficients via Histograms", arxiv:1109.5998 [The extended journal version of the AISTATS 2011 paper]
- "Nonparametric Risk Bounds for Time-Series Forecasting",
Journal of Machine
Learning Research
**18:32**(2017): 1--40, arxiv:1212.0463

- CRS and Aryeh Kontorovich, "Predictive PAC Learning and Process Decompositions", NIPS 2013, arxiv:1309.4859

- Pride compels me to recommend:
- Daniel J. McDonald, Generalization Error Bounds for Time Series [Ph.D. thesis, Carnegie Mellon University, 2012]

- To read:
- Terrence M. Adams and Andrew B. Nobel
- "Uniform Approximation of
Vapnik-Chervonenkis Classes", Bernoulli
**18**(2012): 1310--1319, arxiv:1010.4515 - "A counterexample concerning the extension of uniform strong laws to ergodic processes", pp. 1--4 in M. Banerjee, F. Bunea, J. Huang, V. Koltchinskii and M. H. Maathuis (eds.), From Probability to Statistics and Back: High-Dimensional Models and Processes --- A Festschrift in Honor of Jon A. Wellner

- "Uniform Approximation of
Vapnik-Chervonenkis Classes", Bernoulli
- Pierre Alquier and Xiaoyin Li, "Prediction of quantiles by statistical learning and application to GDP forecasting", arxiv:1202.4294
- D. W. K. Andrews, "Consistency in nonlinear econometric models: A generic uniform law of large numbers", Econometrica
**55**(1987): 1465--1471 - Jose Bento, Morteza Ibrahimi and Andrea Montanari
- "Learning Networks of Stochastic Differential Equations", NIPS 23 (2010), arxiv:1011.0415
- "Information Theoretic Limits on Learning Stochastic Differential Equations", arxiv:1103.1689

- Graiela Boente and Ricardo Fraiman, "Robust Nonparametric
Regression Estimation for Dependent Observations",
Annals of Statistics
**17**(1989): 1242--1256 [Autoregressions for phi and alpha mixing processes] - Shay B. Cohen and Noah A. Smith, "Empirical Risk Minimization with Approximations of Probabilistic Grammars", NIPS 23 (2010) [PDF]
- Toby S. Cubitt, Jens Eisert, Michael M. Wolf, "Extracting dynamical equations from experimental data is NP-hard", Physical Review Letters
**108**(2012): 120503, arxiv:1005.0005 - Herold Dehling (ed.), Empirical Process Techniques for Dependent Data
- Francois Denis, Mattias Gybels, Amaury Habrard, "Dimension-free Concentration Bounds on Hankel Matrices for Spectral Learning", arxiv:1312.6282
- Hanyuan Hang, Yunlong Feng, Ingo Steinwart and Johan A. K. Suykens, "Learning Theory Estimates with Observations from General Stationary Stochastic Processes", Neural Computation
**28**(2016): 2853--2889 - H. Hang, I. Steinwart, "A Bernstein-type Inequality for Some Mixing Processes and Dynamical Systems with an Application to Learning", arxiv:1501.03059
- Peng He and Changshui Zhang, "Non-Asymptotic Analysis of Relational Learning with One Network", pp. 320--327 in AISTATS 2014
- Purushottam Kar, Bharath K Sriperumbudur, Prateek Jain, Harish C Karnick, "On the Generalization Ability of Online Learning Algorithms for Pairwise Loss Functions", arxiv:1305.2505
- A. C. Lozano, S. R. Kulkarni and R. E. Schapire, "Convergence and Consistency of Regularized Boosting With Weakly Dependent Observations",
IEEE Transactions on Information Theory
**60**(2014): 651--660 - Blakeley B. McShane, Shane T. Jensen, Allan I. Pack and Abraham J. Wyner, "Statistical Learning With Time Series Dependence: An Application to Scoring Sleep in Mice", Journal of the American Statistical Association
**108**(2013): 1147--1162 - Andrew B. Nobel, "Limits to classification and regression estimation from ergodic processes",
Annals of Statistics
**27**(1999): 262--273 - Andrew B. Nobel and Terrence M. Adams, "Estimating a function from ergodic samples with additive noise", IEEE Transactions on Information Theory
**47**(2001): 2895--2902 - Thomas Peel, Sandrine Anthoine, Liva Ralaivola, "Empirical Bernstein Inequalities for U-Statistics", NIPS 2010
- Liva Ralaivola, Marie Szafranski, Guillaume Stempfel, "Chromatic PAC-Bayes Bounds for Non-IID Data: Applications to Ranking and Stationary \beta-Mixing Processes", Journal of Machine Learning Research
**11**(2010): 1927--1956, arxiv:0909.1933 - Daniil Ryabko
- "Characterizing predictable classes of processes", arxiv:0905.4341
- "A criterion for hypothesis testing for stationary processes", arxiv:0905.4937

- Yevgeny Seldin, Fran&ccdeil;ois Laviolette, Nicolò Cesa-Bianchi, John Shawe-Taylor, Peter Auer, "PAC-Bayesian Inequalities for Martingales", arxiv:1110.6886
- Ingo Steinwart, Marian Anghel, "Consistency of support vector machines for forecasting the evolution of an unknown ergodic dynamical system from observations with unknown noise", Annals of Statistics
**37**(2009): 841--875, arxiv:0707.0322 - Ingo Steinwart, Don Hush and Clint Scovel, "Learning from dependent observations", arxiv:07078.0303
- Yuyi Wang, Jan Ramon, Zheng-Chu Guo, "Learning from networked examples", arxiv:1405.2600
- Huan Xu and Shie Mannor, "Robustness and Generalization", arxiv:1005.2243 [Includes results for Doeblin Markov sources]
- Henryk Zühle, "Qualitative robustness of statistical functionals under strong mixing", arxiv:1203.5245
- Chao Zhang and Dacheng Tao, "Risk Bounds for Levy Processes in the PAC-Learning Framework", Journal of Machine Learning Research Proceedings
**9**(2010): 948--955 - Bin Zou, Luoqing Li and Zongben Xu, "The generalization performance
of ERM algorithm with strongly mixing
observations", Machine
Learning
**75**(2009): 275--295

- To write:
- Co-conspirators + CRS, "Risk Bounds with Finite-Dimensional Dependence Coefficients"
- Co-conspirators + CRS, "Rademacher Complexity Bounds for Predictive Risk"
- Co-conspirators + CRS, "Your Favorite DSGE Sucks" [working title]