Central Limit Theorem(s)
Last update: 08 Dec 2024 00:01First version: 21 October 2024
\[ \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \]
At some point, I want to collect all the ways of saying why we should expect Gaussian fluctuations in the large-$n$ limit when everything is nice, and why we should expect other limiting distributions in other cases... A partial, off-the-top-of-my-head list:
- The cumulant-generating-function argument: cumulant generating functions add for independent random variables, so the CGF of an (appropriately scaled and centered) average approaches the CGF of a standard Gaussian. (See below.) There are variants of this argument for moment generating functions and for "characteristic functions" [i.e. Fourier transforms]. (True, and can be made rigorous in the characteristic-function version, but I find it unilluminating somehow.)
- The maximum entropy argument: The Gaussian is the maximum-Shannon-entropy distribution with given mean and variance. (So what?)
- Decompose an arbitrary distribution into a part that's invariant under averaging and a part that isn't. The latter component goes away under repeated averaging, so we need to converge to a distribution that's invariant under averaging, such as the Gaussian... (Arguably the same as the generating-function argument, at least in the characteristic-function/Fourier-transform version?)
- The renormalization group, somehow; I understood that argument in the 2000s. (Maybe identical to some or all of the above?)
- The large-deviations argument: Suppose \( X_n \) obeys a large deviations principle, so \( \lim_{n\rightarrow\infty}{n^{-1} \log{\Prob{X_n = x}} = -J(x)} \) for some non-negative rate function $J$. (That's an inexact statement of the LDP, and you can follow the link for more precise ones.) Turned around, \( \Prob{X_n = x} \approx e^{-n J(x) } \). Now suppose/hope that the rate function has a very nice minimum at $x^*$, where $J(x^*) = 0$ and the first derivatives are zero but the second derivatives aren't. Taylor-expand around $x^*$ to get \( \Prob{X_n=x} \approx e^{-\frac{n}{2}(x-x^*)^2 J^{\prime\prime}(x^*)} \), which is the form of a Gaussian centered at $x^*$ and with variance proportional to the inverse of the second derivative. (When $x$ is a vector rather than a scalar, you can make the appropriate modifications yourself.) So if there's an LDP, and it has a nice minimum, we should expect locally-Gaussian fluctuations. (Here "locally" means "where doing a Taylor expansion inside an exponent is only sloppy and not positively criminal.")
- In equilibrium statistical mechanics, apply Einstein's fluctuation relation, that the probability of a macrostate $m$ is $\propto e^{S(m)}$, $S$ being the (unitless) Boltzmann entropy. Now Taylor-expand around the maximum-entropy macrostate, i.e., the equilibrium macrostate. (Cf.) This is basically the same as the large deviations argument, but tells us we shouldn't expected Gaussian fluctuations if the entropy is discontinuous or not differentiable (i.e., at a phase transition).
- Le Cam's "local asymptotic normality" (which I don't understand well enough to caricature, so I need to study more).
Things I need to think about:
- Which variants most easily allow for non-identically-distributed terms?
- Which variants most easily allow for non-independent terms?
- Which variants are clearest about the non-Gaussian limits?
- What're the trade-offs here across ease of teaching for me, ease of understanding for my (undergrad stats. major or stats. Ph.D.) students, and density of lies-told-to-children?
The cumulant-generating-function argument (15 October 2024)
\[ \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathbb{V}\left[ #1 \right]} \]Write the CGF of the random variable $X$ as \( C_{X}(t) \equiv \log{\Expect{e^{t X}}} \). (All logarithms are natural logs.)
- \( C_{a+X}(t) = at + C_{X}(t) \) for any constant $a$.
- \( C_{bX}(t) = C_{X}(bt) \) for any constant $b$.
- \( C_{X+Y}(t) = C_{X}(t) + C_{Y}(t) \) if $X$ and $Y$ are independent.
- \( C_{X}(t) = t\Expect{X} + \frac{t^2}{2}\Var{X} + O(t^3) \) for small $t$.
Now consider IID random variables \( X_{1}, X_{2}, \ldots \), with common mean $\mu = \Expect{X}$, and common variance $\sigma^2=\Var{X}$. Define the sum, the average, and the rescaled-and-centered average of the first $n$ variables: \[ \begin{eqnarray} S_n & \equiv & \sum_{i=1}^{n}{X_i}\\ A_n & \equiv & \frac{1}{n}S_n\\ Z_n & \equiv & \sqrt{n}\frac{A_n - \mu}{\sigma} \end{eqnarray} \]
Using property (3), we have \[ C_{S_n}(t) = n C_{X}(t) \] Using property (2), we have \[ C_{A_n}(t) = C_{S_n}(t/n) = n C_{X}(t/n) \] Using property (4), \[ C_{A_n}(t) = \mu t + \frac{\sigma^2}{2n}t^2 + O(t^3 n^{-2}) \] Now using properties (1) and (2), \[ C_{Z_{n}}(t) = \frac{1}{2}t^2 + O(t^3 \sigma^3 n^{-1/2}) \] so in the limit $n\rightarrow \infty$, \[ C_{Z_{n}}(t) \approx \frac{1}{2}t^2 \] We have now shown that (suitably-massaged) sample averages of IID converge on the same limiting CGF, regardless of the distribution we start with, provided only that the initial CGF is well-defined (and so all moments exist).
Finally, some character-building integrals will confirm that the CGF of the standard Gaussian distribution is precisely $t^2/2$. Now, it is not quite true that a distribution is uniquely determined by its cumulants, but if we ignore that little complication, we have shown that \[ \sqrt{n}\frac{A_n - \mu}{\sigma} \rightsquigarrow \mathcal{N}(0,1) ~. \]
One of the nice things about this argument is that it suggests ways things can be modified for non-IID summands. Suppose we can say that \[ C_{S_n}(t) = r_n \Gamma(t) + g_n(t) \] for some sequence \( r_n \rightarrow \infty \) as \( n \rightarrow \infty \), where the remainder \( g_n(t) \) is \( o(r_n) \) and \( o(t^2) \). (In the IID case, \( r_n=n \) and the remainder \( g_n=0 \).) Now modify the definition of \( A_n \) to be \( A_n = S_n/r_n \). We get \[ C_{A_n}(t) = r_n \Gamma(t/r_n) + g_n(t/r_n) = \mu t + \frac{\sigma^2}{2 r_n}t^2 + O(t^3 r_n^{-2}) \] definining the asymptotic mean and variance by \( \mu = \Gamma^{\prime}(0) \) and \( \sigma^2 = \Gamma^{\prime\prime}(0) \). (The big-$O$ term includes both the higher-order contributions from $\Gamma$, and the remainder \( g_n \).) Finally, set \( Z_n = \sqrt{r_n}\frac{A_n - \mu}{\sigma} \) to get \[ C_{Z_n}(t) = \frac{t^2}{2} + O(t^3 r_n^{-1/2}) \] Since \( r_n \rightarrow \infty \), the last term goes to zero, and \( Z_n \rightsquigarrow \mathcal{N}(0,1) \) once again, at least up to the caveats in the previous paragraph. (This form of argument is occasionally useful when dealing with network dependence. I will leave the situation where expectation values change non-stationarily, so we define \( \mu_n = r_n^{-1} \sum_{i=1}^{n}{\Expect{X_i}} \), as an exercise in elaborating the notation.) The factor \( r_n \) then is basically the effective sample size, and saying \( r_n \rightarrow \infty \) is saying "there has to be a growing number of effectively independent random contributions to the average, so no one term can be so large compared to the others that it dominates, and correlations cannot be so strong that fluctuations in one term shift everything forever".
- See also:
- Cumulants, and Cumulant Generating Functions
- Maximum Entropy Methods
- Mixing and Weak Dependence of Stochastic Processes
- Phase Transitions and Critical Phenomena
- Power Laws, and Other Heavy-Tailed Distributions
- Stochastic Processes
- To re-read, because I've completely forgotten how they go:
- Iván Calvo, Juan C. Cuchí, José G. Esteve, Fernando Falceto, "Generalized Central Limit Theorem and Renormalization Group", Journal of Statistical Physics 141 (2010): 409--421, arxiv:1009.2899
- Giovanni Jona-Lasinio, "Renormalization Group and Probability Theory", Physics Reports 352 (2001): 439--458, arxiv:cond-mat/0009219
- To read, historical:
- William J. Adams, The Life and Times of the Central Limit Theorem
- Hans Fischer, History of the Central Limit Theorem: From Laplace to Donsker
- Francesco Mainardi, Sergei Rogosin, "The origin of infinitely divisible distributions: from de Finetti's problem to Levy-Khintchine formula", arxiv:0801.1910
- Jan von Plato, Creating Modern Probability [I need to finish this, one of these years...]
- To read:
- Fulvio Baldovin, Attilio L. Stella, "Central limit theorem for anomalous scaling induced by correlations", cond-mat/0510225
- I. Calvo, J. C. Cuchí, J. G. Esteve, F. Falceto, "Generalized Central Limit Theorem and Renormalization Group", arxiv:1009.2899
- Oliver Johnson and Andrew Barron, "Fisher Information inequalities and the Central Limit Theorem," math.PR/0111020
- Oliver Johnson and Richard Samworth, "Central Limit Theorem and convergence to stable laws in Mallows distance", math.PR/0406218
- Mohamed El Machkouri and Lahcen Ouchti, "Exact convergence rates in the central limit theorem for a class of martingales", math.PR/0403385
- Magda Peligrad and Sergey Utev, "Central limit theorem for stationary linear processes", math.PR/0509682