Deviation Inequalities in Probability Theory

08 Dec 2018 11:23

The laws of large numbers say that averages taken over large samples converge on expectation values. But these are asymptotic statements which say nothing about what happens for samples of any particular size. A deviation inequality, by contrast, is a result which says that, for realizations of such-and-such a stochastic process, the sample value of this functional deviates by so much from its typical value with no more than a certain probability: \( \Pr{\left(\left|f(X_1, X_2, \ldots X_n) - \mathbb{E}[f]\right| > h \right)} < r(n,h,f) \), where the rate function \( r \) has to be given explicitly, and may depend on the true joint distribution of the \( X_i \) (though it's more useful if it doesn't depend on that very much). (And of course one could compare to the median rather than the mean, or just look at fluctuations above the typical value rather than to other side, or get within a certain factor of the typical value rather than a certain distance, etc.) The rate should be decreasing in \( h \) and in \( n \).

An elementary example is "Markov's inequality": If \( X \) is a non-negative random variable with a finite mean, then \[ \Pr{\left(X > h \right)} < \frac{\mathbb{E}[X]}{h} ~. \] One can derive many other deviation inequalities from Markov's inequality by taking \( X = g(Y) \), where \( Y \) is another random variable and \( g \) is some suitable non-negative-valued function.

For instance if \( Y \) has a finite variance \( v \), then \[ \Pr{\left(|Y-\mathbb{E}[Y]| > h \right)} < \frac{v}{h^2} ~. \] This is known as "Chebyshev's inequality". (Exercise: derive Chebyshev's inequality from Markov's inequality. Since Markov was in fact Chebyshev's student, it would seem that the logical order here reverses the historical one, though guessing at priority from eponyms is always hazardous.) Suppose that \( X_1, X_2, \ldots \) are random variables with a common mean \( m \) and variance \( v \), and \( Y \) is the average of the first \( n \) of these. Then Chebyshev's inequality tells us that \[ \Pr{\left(|Y - m| > h\right)} < \frac{\mathrm{Var}[Y]}{h^2} ~. \] If the \( X_i \) are uncorrelated (e.g., independent), then \( \mathrm{Var}[Y] = v/n \), so the probability that the sample average differs from the expectation by \( h \) or more goes to zero, no matter how small we make \( h \). This is precisely the weak law of large numbers. If the \( X_i \) are correlated, but nonetheless \( \mathrm{Var}[Y] \) goes to zero as \( n \) grows (generally because correlations decay), then we get an ergodic theorem. The rate of convergence here however is not very good, just \( O(n^{-1}) \).

Since \( e^u \) is a monotonically increasing function of \( u \), for any positive \( t \), \( X > h \) if and only if \( e^{tX} > e^{th} \), so we get an exponential inequality, \[ \Pr{\left(X > h\right)} < e^{-th} \mathbb{E}\left[e^{tX}\right] ~. \] Notice that the first term in the bound does not depend on the distribution of \( X \), unlike the second term, which doesn't depend on the scale of the deviation \( h \). We are in fact free to pick whichever \( t \) gives us the tightest bound. The quantity \( \mathbb{E}\left[e^{tX}\right] \) is called the "moment generating function" of \( X \), let's abbreviate it \( M_X(t) \), and can fail to exist if some moments are infinite. (Write the power series for \( e^u \) and take expectations term by term to see all this.) It has however the very nice property that when \( X_1 \) and \( X_2 \) are independent, \( M_{X_1+X_2}(t) = M_{X_1}(t) M_{X_2}(t) \). From this it follows that if \( Y \) is the sum of \( n \) independent and identically distributed copies of \( X \), \( M_Y(t) = {(M_X(t))}^{n} \). Thus \[ \Pr{\left(Y > h\right)} < e^{-th} (M_X(t))^{n} ~. \] If \( Z = Y/n \), the sample mean, this in turn gives \[ \Pr{\left(Z > h\right)} = \Pr{\left(Y > hn\right)} < e^{-thn} (M_X(t))^{n} ~. \] So we can get exponential rates of convergence for the law of large numbers from this. [Students who took the CMU statistics department's probability qualifying exam in 2010 now know who wrote problem 9.] Again, the restriction to IID random variables is not really essential, allowing dependence just means that the moment generating functions don't factor exactly, but if they almost factor than we can get results of the same form. (Often, we end up with \( n \) being replaced by \( n/n_0 \), where \( n_0 \) is something like how long it takes dependence to decay to trivial levels.)

I don't feel like going into the reasoning behind the other common deviation bounds — Bernstein, Chernoff, Hoeffding, Azuma, McDiarmid, etc. — because I feel like I've given enough of the flavor already. I am using this notebook as, actually, a notebook, more specifically a place to collect references on deviation inequalities, especially ones that apply to collections of dependent random variables. Results here typically appeal to various notions of mixing or decay of correlations, as found in ergodic theory.

See also: Concentration of Measure; Ergodic Theory; Large Deviations; Learning Theory; Probability

(Thanks to Michael Kalgalenko for typo-spotting)