Deviation Inequalities in Probability Theory
19 Jan 2024 11:05\[ \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \]
The laws of large numbers say that averages taken over large samples converge on expectation values. But these are asymptotic statements which say nothing about what happens for samples of any particular size. A deviation inequality, by contrast, is a result which says that, for realizations of such-and-such a stochastic process, the sample value of this functional deviates by so much from its typical value with no more than a certain probability: \[ \Prob{\left|f(X_1, X_2, \ldots X_n) - \Expect{f}\right| > h } < r(n,h,f) ~, \] where the rate function \( r \) has to be given explicitly, and may depend on the true joint distribution of the \( X_i \) (though it's more useful if it doesn't depend on that very much). (And of course one could compare to the median rather than the mean, or just look at fluctuations above the typical value rather than to other side, or get within a certain factor of the typical value rather than a certain distance, etc.) The rate should be decreasing in \( h \) and in \( n \).
An elementary example is "Markov's inequality": If \( X \) is a non-negative random variable with a finite mean, then \[ \Prob{X > h } < \frac{\Expect{X}} ~. \] One can derive many other deviation inequalities from Markov's inequality by taking \( X = g(Y) \), where \( Y \) is another random variable and \( g \) is some suitable non-negative-valued function.
For instance if \( Y \) has a finite variance \( v \), then \[ \Prob{ |Y-\Expect{Y}| > h } < \frac{v}{h^2} ~. \] This is known as "Chebyshev's inequality". (Exercise: derive Chebyshev's inequality from Markov's inequality. Since Markov was in fact Chebyshev's student, it would seem that the logical order here reverses the historical one, though guessing at priority from eponyms is always hazardous.) Suppose that \( X_1, X_2, \ldots \) are random variables with a common mean \( m \) and variance \( v \), and \( Y \) is the average of the first \( n \) of these. Then Chebyshev's inequality tells us that \[ \Prob{ |Y - m| > h } < \frac{\Var{Y}}{h^2} ~. \] If the \( X_i \) are uncorrelated (e.g., independent), then \( \Var{Y} = v/n \), so the probability that the sample average differs from the expectation by \( h \) or more goes to zero, no matter how small we make \( h \). This is precisely the weak law of large numbers. If the \( X_i \) are correlated, but nonetheless \( \Var{Y} \) goes to zero as \( n \) grows (generally because correlations decay), then we get an ergodic theorem. The rate of convergence here however is not very good, just \( O(n^{-1}) \).
Since \( e^u \) is a monotonically increasing function of \( u \), for any positive \( t \), \( X > h \) if and only if \( e^{tX} > e^{th} \), so we get an exponential inequality, \[ \Prob{ X > h } < e^{-th} \Expect{ e^{tX} } ~. \] Notice that the first term in the bound does not depend on the distribution of \( X \), unlike the second term, which doesn't depend on the scale of the deviation \( h \). We are in fact free to pick whichever \( t \) gives us the tightest bound. The quantity \( \Expect{ e^{tX} } \) is called the "moment generating function" of \( X \), let's abbreviate it \( M_X(t) \), and can fail to exist if some moments are infinite. (Write the power series for \( e^u \) and take expectations term by term to see all this.) It has however the very nice property that when \( X_1 \) and \( X_2 \) are independent, \( M_{X_1+X_2}(t) = M_{X_1}(t) M_{X_2}(t) \). From this it follows that if \( Y \) is the sum of \( n \) independent and identically distributed copies of \( X \), \( M_Y(t) = {(M_X(t))}^{n} \). Thus \[ \Prob{ Y > h }< e^{-th} (M_X(t))^{n} ~. \] If \( Z = Y/n \), the sample mean, this in turn gives \[ \Prob{ Z > h } = \Prob{ Y > hn } < e^{-thn} (M_X(t))^{n} ~. \] So we can get exponential rates of convergence for the law of large numbers from this. (Students who took the CMU statistics department's probability qualifying exam in 2010 now know who wrote problem 9.) Again, the restriction to IID random variables is not really essential, allowing dependence just means that the moment generating functions don't factor exactly, but if they almost factor than we can get results of the same form. (Often, we end up with \( n \) being replaced by \( n/n_0 \), where \( n_0 \) is something like how long it takes dependence to decay to trivial levels.)
I don't feel like going into the reasoning behind the other common deviation bounds --- Bernstein, Chernoff, Hoeffding, Azuma, McDiarmid, etc. --- because I feel like I've given enough of the flavor already. I am using this notebook as, actually, a notebook, more specifically a place to collect references on deviation inequalities, especially ones that apply to collections of dependent random variables. Results here typically appeal to various notions of mixing or decay of correlations, as found in ergodic theory.
"Deviation" vs. "Concentration": A terminological note
The literature is full of references to both "deviation inequalities" and "concentration inequalities". I have never seen a rigorous definition separating the two. When I consulted four scholars more expert than I in this field (I'm just a user, not an innovator, unlike my informants), I received four distinct answers:- "Concentration" inequalities get tighter as we consider more random variables, "deviation" inequalities do not. (So in this sense Markov's inequality is just a deviation inequality, and so is Chebyshev's, but Chebyshev's applied to means of uncorrelated random variables is a concentration inequality.)
- If we use a union bound, a "deviation" inequality becomes trivial after considering a merely polynomial-in-\( 1/h \) events, while a "concentration" inequality can handle unions of exponentially many event. (Again, this would make Markov's inequality a "deviation" inequality, but the elaboration using the moment generating function a "concentration" inequality.)
- "Deviation" inequalities are one-sided while "concentration" inequalities are two-sided, following the usage on the first page of Massart, 2000. This would make Markov's inequality a deviation result, and Chebyshev's a concentration result.
- The terms are completely synonymous.
- See also:
- Concentration of Measure
- Cumulants, and Cumulant Generating Functions
- Ergodic Theory
- Large Deviations
- Learning TheoryX
- Probability
- Recommended, big picture:
- Olivier Bousquet, Stéphane Boucheron and Gábor Lugosi, "Introduction to Statistical Learning Theory" [Gives a very nice review of many deviation inequalities, with references.]
- Nicolo Cesa-Bianchi and Gabor Lugosi, Prediction, Learning, and Games [Provides exemplary proofs in the appendix. Mini-review]
- Recommended, close-ups:
- Daniel Berend and Aryeh Kontorovich, "On the Convergence of the Empirical Distribution", arxiv:1205.6711
- Daniel Berend, Peter Harremoës, Aryeh Kontorovich, "Minimum KL-divergence on complements of L1 balls", arxiv:1206.6544
- G. G. Bosco, F. P. Machado and Thomas Logan Ritchie, "Exponential Rates of Convergence in the Ergodic Theorem: A Constructive Approach", Journal of Statistical Physics 139 (2010): 367--374
- J.-R. Chazottes, P. Collet, C. Kuelske and F. Redig, "Deviation inequalities via coupling for stochastic processes and random fields", math.PR/0503483 [Very cool]
- Xinjia Chen, "A New Generalization of Chebyshev Inequality for Random Vectors", arxiv:0707.0805 [A very natural bound using the inverse variance matrix (= Mahalanobis distance); see Navarro for a truly clever proof]
- Ioannis Kontoyiannis, L. A. Lastras-Montano and S. P. Meyn, "Relative Entropy and Exponential Deviation Bounds for General Markov Chains", ISIT 2005 [PDF reprint via Prof. Meyn]
- Aryeh Kontorovich and Anthony Brockwell, "A Strong Law of Large Numbers for Strongly Mixing Processes", arxiv:0807.4665
- Aryeh Kontorovich and Roi Weiss, "Uniform Chernoff and Dvoretzky-Kiefer-Wolfowitz-type inequalities for Markov chains and related processes", arxiv:1207.4678
- Ioannis Kontoyiannis, L. A. Lastras-Montano and S. P. Meyn, "Relative Entropy and Exponential Deviation Bounds for General Markov Chains", ISIT 2005 [PDF reprint via Prof. Meyn]
- Pascal Massart, "About the constants in Talagrand's concentration inequalities for empirical processes", Annals of Probability 28 (2000): 863--884
- Andreas Maurer and Massimiliano Pontil, "Some Hoeffding- and Bernstein-type Concentration Inequalities", arxiv:2102.06304
- Jorge Navarro [Two wonderfully elementary proofs of Chen's multivariate Chebyshev inequality]
- "A simple proof for the multivariate Chebyshev inequality", arxiv:1305.5646
- "A very simple proof of the multivariate Chebyshev's inequality", Communications in Statistics: Theory and Methods forthcoming (2014) [preprint]
- Iosif Pinelis, "Between Chebyshev and Cantelli", arxiv:1011.6065 [Cute, with an appealing proof]
- Yongqiang Tang, "A Hoeffding-Type Inequality for Ergodic Time Series", Journal of Theoretical Probability 20 (2007): 167--176 [PDF preprint]
- Modesty forbids me to recommend:
- The lecture notes on deviation inequalities in my course on Conceptual Foundations of Statistical Learning (5--7
- To read:
- Yannick Baraud, "A Bernstein-type inequality for suprema of random processes with an application to statistics", arxiv:0904.3295
- Olivier Catoni
- "High confidence estimates of the mean of heavy-tailed real random variables", arxiv:0909.5366
- "Challenging the empirical mean and empirical variance: a deviation study", arxiv:1009.2048
- Patrick Cattiaux and Nathael Gozlan, "Deviations bounds and conditional principles for thin sets", arxiv:math/0510257
- Patrick Cattiaux and Arnaud Guillin, "Deviation bounds for additive functionals of Markov process", math.PR/0603021 [Non-asymptotic bounds for the probability that time averages deviate from expectations with respect to the invariant measure, when the process is stationary and ergodic and the invariant measure is reasonably regular.]
- Xinjia Chen
- "A Likelihood Ratio Approach for Probabilistic Inequalities", arxiv:1308.4123
- "Concentration Inequalities from Likelihood Ratio Method", arxiv:1409.6276
- Jérôme Dedecker, Florence Merlevède, Magda Peligrad, Sergey Utev, "Moderate deviations for stationary sequences of bounded random variables", arxiv:0711.3924
- Cyrille Dubarry and Sylvain Le Corff, "Non-asymptotic deviation inequalities for smoothed additive functionals in nonlinear state-space models", Bernoulli 19 (2013): 2222--2249
- Jianqing Fan, Bai Jiang, Qiang Sun, "Hoeffding's lemma for Markov Chains and its applications to statistical learning", arxiv:1802.00211
- Xiequan Fan, Ion Grama, Quansheng Liu, "On Hoeffding's and Talagrand's inequalities", arxiv:1206.2501
- Nicolas Fournier, Arnaud Guillin, "On the rate of convergence in Wasserstein distance of the empirical measure", arxiv:1312.2128
- David A. Freedman, "On Tail Probabilities for Martingales", Annals of Probability 3 (1975): 100--118
- Marian Grendar Jr. and Marian Grendar, "Chernoff's bound forms," math.PR/0306326
- Spencer Greenberg, Mehryar Mohri, "Tight Lower Bound on the Probability of a Binomial Exceeding its Expectation", arxiv:1306.1433
- A. Guionnet and B. Zegarlinski, Lectures on Logarithmic Sobolev Inequalities [120 pp. PDF]
- H. Hang, I. Steinwart, "A Bernstein-type Inequality for Some Mixing Processes and Dynamical Systems with an Application to Learning", arxiv:1501.03059
- Bai Jiang, Qiang Sun, Jianqing Fan, "Bernstein's inequality for general Markov chains", arxiv:1805.10721
- Emilien Joly, Gábor Lugosi, "Robust estimation of U-statistics", arxiv:1504.04580
- Vladislav Kargin, "A large deviation inequality for vector functions on finite reversible Markov Chains", Annals of Applied Probability 17 (2007): 1202--1221, arxiv:0508538
- Bahman Yari Saeed Khaloo and Gholamreza Haffari, "Novel Bernstein-like Concentration Inequalkities for the Missing Mass", UAI 2015
- Carlos A. Leon and Francois Perron, "Optimal Hoeffding bounds for discrete reversible Markov chains", math.PR/0405296
- Dasha Loukianova, Oleg Loukianov, Eva Loecherbach, "Polynomial bounds in the Ergodic Theorem for positive recurrent one-dimensional diffusions and integrability of hitting times", arxiv:0903.2405 [non-asymptotic deviation bounds from bounds on moments of recurrence times]
- P. Major
- "On a multivariate version of Bernstein's inequality", math.PR/0411287
- "A multivariate generalization of Hoeffding's ineqality", math.PR/0411288
- Florence Merlevède, Magda Peligrad, Emmanuel Rio , "A Bernstein type inequality and moderate deviations for weakly dependent sequences", Probability Theory and Related Fields 151 (2011): 435--474, arxiv:0902.0582
- Blazej Miasojedow, "Hoeffding's inequalities for geometrically ergodic Markov chains on general state space", arxiv:1201.2265
- Thomas Peel, Sandrine Anthoine, Liva Ralaivola, "Empirical Bernstein Inequalities for U-Statistics", NIPS 2010
- Ohad Shamir, "A Variant of Azuma's Inequality for Martingales with Subgaussian Tails", arxiv:1110.2392
- Ted Theodosopoulos, "A Reversion of the Chernoff Bound", math.PR/0501360
- Sara van de Geer and Johannes Lederer, "The Bernstein-Orlicz norm and deviation inequalities", Probability Theory and Related Fields 157 (2013): 225--250 arxiv:1111.2450
- Lutz Warnke, "On the method of typical bounded differences", arxiv:1212.5796
- Xinxing Wu, Junping Zhang, "Distribution-dependent concentration inequalities for tighter generalization bounds", arxiv:1607.05506
- Chao Zhang, "Bennett-type Generalization Bounds: Large-deviation Case and Faster Rate of Convergence", arxiv:1309.6876
(Thanks to Michael Kalgalenko for typo-spotting)