Notebooks

## Statistical Learning Theory with Dependent Data

26 Aug 2015 20:26

See learning theory if that title doesn't make sense. I am particularly interested in learning for ergodic time series. A lot of the work in this area involves strong mixing conditions, especially beta-mixing. I suspect that in many cases strong mixing is not actually needed, but this is a pure hunch on my part with absolutely no evidence to back it up (right now).

Recommended:
• Terrence M. Adams and Andrew B. Nobel
"Uniform convergence of Vapnik-Chervonenkis classes under ergodic sampling", Annals of Probability 38 (2010): 1345--1367, arxiv:1010.3162 [Finite VC dimension ensures uniform convergence for arbitrary ergodic processes — but perhaps arbitrarily slow uniform convergence.]
• "The Gap Dimension and Uniform Laws of Large Numbers for Ergodic Processes", arxiv:1007.2964 [Extends the results in the paper above to get uniform convergence of real-valued functions, in terms of the gap or "fat shattering" dimension. It's the same (new) technique, of explicitly constructing arbitrarily large shatterable sets when uniform convergence fails.]
• "Uniform Approximation and Bracketing Properties of VC classes", arxiv:1007.4037 ["We show that the sets in a family with finite VC dimension can be uniformly approximated within a given error by a finite partition. Immediate corollaries include the fact that VC classes have finite bracketing numbers, satisfy uniform laws of averages under strong dependence, and exhibit uniform mixing."]
• Alekh Agarwal and John C. Duchi, "The Generalization Ability of Online Algorithms for Dependent Data", arxiv:1110.2529 [If you have a learning algorithm which achieves low regret against some class of alternatives, and the data source is suitably mixing, your algorithm will generalize almost as well as the best rival.]
• Pierre Alquier and Olivier Wintenberger
• "Fast rates in learning with dependent observations", arxiv:1202.4283 [Assumes observations bounded in the interval [-B,B] and gets a generalization bound which is, if I am not mistaken, O(B3), and so not suitable for my purposes, but an interesting technique.]
• "Model selection and randomization for weakly dependent time series forecasting", Bernoulli 18 (2012): 883--913, arxiv:0902.2924
• Patrizia Berti, Irene Crimaldi, Luca Pratelli, and Pietro Rigo, "Rate of convergence of predictive distributions for dependent data", Bernoulli 15 (2009): 1351--1367, arxiv:1001.2152 [Only for exchangeable sequences, sadly]
• Patrizia Berti and Pietro Rigo, "A Glivenko-Cantelli theorem for exchangeable random variables", Statistics and Probability Letters 32 (1997): 385--391
• Wenxin Jiang, "On Uniform Deviations of General Empirical Risks with Unboundedness, Dependence, and High Dimensionality", 10 (2009): 977--996 [This is a sweet result, despite the appearance of constants like 288 in the main theorem. The term which measures the effect of dependence is one of the weak dependence coefficients of Dedecker et al. (see under ergodic theory) --- I want to say it's their first-order gamma coefficient, but I don't have the book to hand and may be mis-remembering --- applied here to the loss function.]
• Aryeh Kontorovich, "Concentration in unbounded metric spaces and algorithmic stability", arxiv:1309.1007
• Matthew O. Jackson, Ehud Kalai and Rann Smorodinsky, "Bayesian Representation of Stochastic Processes Under Learning: De Finetti Revisited", Econometrica 67 (1999): 875--893 [JSTOR]
• Ben London, Bert Huang, Ben Taskar, Lise Getoor, "Collective Stability in Structured Prediction: Generalization from One Example", ICML 2013 [My notes]
• Ron Meir, "Nonparametric Time Series Prediction Through Adaptive Model Selection," Machine Learning 39 (2000): 5--34 [PDF. Extending the "structural risk minimization" framework due to Vapnik to time series. Unfortunately Meir's approach demands knowledge of the mixing rate of the process, which we don't really know how to estimate, but this is a very encouraging first step.]
• Mehryar Mohri and Afshin Rostamizadeh
• "Rademacher Complexity Bounds for Non-I.I.D. Processes", in Advances in Neural Information Processing vol. 21 (NIPS 2008) [PDF. Stationary beta-mixing processes.]
• "Stability Bound for Stationary Phi-mixing and Beta-mixing Processes", Journal of Machine Learning Research 11 (2010): 789--814, arxiv:0811.1629 ["Stability" is the property of a learning algorithm that changing a single observation in the training set leads to only small changes in predictions on the test set. This paper shows that stable learning algorithms continue to perform well with dependent data, provided the dependence decays sufficiently quickly --- the "mixing" properties of ergodic theory.]
• Andrew Nobel and Amir Dembo, "A Note on Uniform Laws of Averages for Dependent Processes", Statistics and Probability Letters 17 (1993): 169--172 [An extremely easy way to extend uniform laws of large numbers to uniform ergodic theorems for mixing processes. Actually I suspect that mixing is only necessary to get an explicit rate. (That was written in 2008; see Adams and Nobel's 2010 paper, above, for a very different proof that mixing is not needed for uniform ergodicity.) PDF preprint via Dr. Nobel.]
• Vladimir Pestov, "Predictive PAC learnability: a paradigm for learning from exchangeable input data", arxiv:1006.1129 [The trick here is basically as follows: Every exchangeable sequence is a mixture of IID sequences (de Finetti), and which mixture component a realization is in, is an invariant. So one demands only that the function f be well-estimated according to the component the sample is in, rather than over all of the mixture. If one could condition on the mixture component, one would lose no precision/accuracy compared to the IID situation; conditioning on $X_1, X_2, \ldots X_n$ leaves some uncertainty about which component we're in, hence the cost. All of this should, I'd think, carry over straightforwardly to stationary mixing sequences, since (von Neumann) every stationary sequence is a mixture of ergodic sequences, and ergodic components are invariants of the motion. Perhaps a short paper? — Update, 3 years later: the short paper is Shalizi and Kantorovich, below.]
• Maxim Raginsky, "Divergence-based characterization of fundamental limitations of adaptive dynamical systems", arxiv:1010.2286
• Daniil Ryabko, "Pattern Recognition for Conditionally Indpendent Data", cs.LG/0507040 [It's a bit of an odd form of dependence: the sequence of labels can show essentially any form of dependence you like, but the features of the nth object have to be independent of everything else, given the label of the nth object. (Hence "conditionally independent data".) This is like some kinds of inside-out hidden Markov model...]
• Sara van de Geer, Empirical Processes in M-Estimation
• Ramon van Handel, "The universal Glivenko-Cantelli property", Probability Theory and Related Fields 155 (2013): 911--934, arxiv:1009.4434
• Mathukumalli Vidyasagar, A Theory of Learning and Generalization: With Applications to Neural Networks and Control Systems [Has a very nice discussion of when the uniform laws of large numbers of statistical learning theory transfer from the usual IID setting to dependent processes, becoming uniform ergodic theorems. (Sufficient conditions include things like beta-mixing, but necessary and sufficient conditions seem to still be unknown.) Mini-review]
• Bin Yu, "Rates of Convergence for Empirical Processes of Stationary Mixing Sequences," Annals of Probability 22 (1994): 94--116
Modesty forbids me to recommend:
• Daniel J. McDonald, CRS and Mark Schervish
• "Estimating $\beta$-mixing coefficients", AISTATS 2011, arxiv:1103.0941
• "Generalization error bounds for stationary autoregressive models", arxiv:1103.0942
• "Risk bounds for time series without strong mixing", arxiv:1106.0730
• "Estimating Beta-Mixing Coefficients via Histograms", arxiv:1109.5998 [The extended journal version of the AISTATS 2011 paper]
• "Time series forecasting: model evaluation and selection using nonparametric risk bounds", arxiv:1212.0463
• CRS and Aryeh Kontorovich, "Predictive PAC Learning and Process Decompositions", forthcoming, NIPS 2013, arxiv:1309.4859