Notebooks

Central Limit Theorem(s)

Last update: 08 Dec 2024 00:01
First version: 21 October 2024

\[ \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \]

At some point, I want to collect all the ways of saying why we should expect Gaussian fluctuations in the large-$n$ limit when everything is nice, and why we should expect other limiting distributions in other cases... A partial, off-the-top-of-my-head list:

  1. The cumulant-generating-function argument: cumulant generating functions add for independent random variables, so the CGF of an (appropriately scaled and centered) average approaches the CGF of a standard Gaussian. (See below.) There are variants of this argument for moment generating functions and for "characteristic functions" [i.e. Fourier transforms]. (True, and can be made rigorous in the characteristic-function version, but I find it unilluminating somehow.)
  2. The maximum entropy argument: The Gaussian is the maximum-Shannon-entropy distribution with given mean and variance. (So what?)
  3. Decompose an arbitrary distribution into a part that's invariant under averaging and a part that isn't. The latter component goes away under repeated averaging, so we need to converge to a distribution that's invariant under averaging, such as the Gaussian... (Arguably the same as the generating-function argument, at least in the characteristic-function/Fourier-transform version?)
  4. The renormalization group, somehow; I understood that argument in the 2000s. (Maybe identical to some or all of the above?)
  5. The large-deviations argument: Suppose \( X_n \) obeys a large deviations principle, so \( \lim_{n\rightarrow\infty}{n^{-1} \log{\Prob{X_n = x}} = -J(x)} \) for some non-negative rate function $J$. (That's an inexact statement of the LDP, and you can follow the link for more precise ones.) Turned around, \( \Prob{X_n = x} \approx e^{-n J(x) } \). Now suppose/hope that the rate function has a very nice minimum at $x^*$, where $J(x^*) = 0$ and the first derivatives are zero but the second derivatives aren't. Taylor-expand around $x^*$ to get \( \Prob{X_n=x} \approx e^{-\frac{n}{2}(x-x^*)^2 J^{\prime\prime}(x^*)} \), which is the form of a Gaussian centered at $x^*$ and with variance proportional to the inverse of the second derivative. (When $x$ is a vector rather than a scalar, you can make the appropriate modifications yourself.) So if there's an LDP, and it has a nice minimum, we should expect locally-Gaussian fluctuations. (Here "locally" means "where doing a Taylor expansion inside an exponent is only sloppy and not positively criminal.")
  6. In equilibrium statistical mechanics, apply Einstein's fluctuation relation, that the probability of a macrostate $m$ is $\propto e^{S(m)}$, $S$ being the (unitless) Boltzmann entropy. Now Taylor-expand around the maximum-entropy macrostate, i.e., the equilibrium macrostate. (Cf.) This is basically the same as the large deviations argument, but tells us we shouldn't expected Gaussian fluctuations if the entropy is discontinuous or not differentiable (i.e., at a phase transition).
  7. Le Cam's "local asymptotic normality" (which I don't understand well enough to caricature, so I need to study more).

Things I need to think about:

The cumulant-generating-function argument (15 October 2024)

\[ \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathbb{V}\left[ #1 \right]} \]

Write the CGF of the random variable $X$ as \( C_{X}(t) \equiv \log{\Expect{e^{t X}}} \). (All logarithms are natural logs.)

  1. \( C_{a+X}(t) = at + C_{X}(t) \) for any constant $a$.
  2. \( C_{bX}(t) = C_{X}(bt) \) for any constant $b$.
  3. \( C_{X+Y}(t) = C_{X}(t) + C_{Y}(t) \) if $X$ and $Y$ are independent.
  4. \( C_{X}(t) = t\Expect{X} + \frac{t^2}{2}\Var{X} + O(t^3) \) for small $t$.
Properties (1)--(3) are easy algebra from the definition; property (4) follows from the fact that the CGF is, indeed, a generating function, and the expected value and the variance being the first two cumulants.

Now consider IID random variables \( X_{1}, X_{2}, \ldots \), with common mean $\mu = \Expect{X}$, and common variance $\sigma^2=\Var{X}$. Define the sum, the average, and the rescaled-and-centered average of the first $n$ variables: \[ \begin{eqnarray} S_n & \equiv & \sum_{i=1}^{n}{X_i}\\ A_n & \equiv & \frac{1}{n}S_n\\ Z_n & \equiv & \sqrt{n}\frac{A_n - \mu}{\sigma} \end{eqnarray} \]

Using property (3), we have \[ C_{S_n}(t) = n C_{X}(t) \] Using property (2), we have \[ C_{A_n}(t) = C_{S_n}(t/n) = n C_{X}(t/n) \] Using property (4), \[ C_{A_n}(t) = \mu t + \frac{\sigma^2}{2n}t^2 + O(t^3 n^{-2}) \] Now using properties (1) and (2), \[ C_{Z_{n}}(t) = \frac{1}{2}t^2 + O(t^3 \sigma^3 n^{-1/2}) \] so in the limit $n\rightarrow \infty$, \[ C_{Z_{n}}(t) \approx \frac{1}{2}t^2 \] We have now shown that (suitably-massaged) sample averages of IID converge on the same limiting CGF, regardless of the distribution we start with, provided only that the initial CGF is well-defined (and so all moments exist).

Finally, some character-building integrals will confirm that the CGF of the standard Gaussian distribution is precisely $t^2/2$. Now, it is not quite true that a distribution is uniquely determined by its cumulants, but if we ignore that little complication, we have shown that \[ \sqrt{n}\frac{A_n - \mu}{\sigma} \rightsquigarrow \mathcal{N}(0,1) ~. \]

One of the nice things about this argument is that it suggests ways things can be modified for non-IID summands. Suppose we can say that \[ C_{S_n}(t) = r_n \Gamma(t) + g_n(t) \] for some sequence \( r_n \rightarrow \infty \) as \( n \rightarrow \infty \), where the remainder \( g_n(t) \) is \( o(r_n) \) and \( o(t^2) \). (In the IID case, \( r_n=n \) and the remainder \( g_n=0 \).) Now modify the definition of \( A_n \) to be \( A_n = S_n/r_n \). We get \[ C_{A_n}(t) = r_n \Gamma(t/r_n) + g_n(t/r_n) = \mu t + \frac{\sigma^2}{2 r_n}t^2 + O(t^3 r_n^{-2}) \] definining the asymptotic mean and variance by \( \mu = \Gamma^{\prime}(0) \) and \( \sigma^2 = \Gamma^{\prime\prime}(0) \). (The big-$O$ term includes both the higher-order contributions from $\Gamma$, and the remainder \( g_n \).) Finally, set \( Z_n = \sqrt{r_n}\frac{A_n - \mu}{\sigma} \) to get \[ C_{Z_n}(t) = \frac{t^2}{2} + O(t^3 r_n^{-1/2}) \] Since \( r_n \rightarrow \infty \), the last term goes to zero, and \( Z_n \rightsquigarrow \mathcal{N}(0,1) \) once again, at least up to the caveats in the previous paragraph. (This form of argument is occasionally useful when dealing with network dependence. I will leave the situation where expectation values change non-stationarily, so we define \( \mu_n = r_n^{-1} \sum_{i=1}^{n}{\Expect{X_i}} \), as an exercise in elaborating the notation.) The factor \( r_n \) then is basically the effective sample size, and saying \( r_n \rightarrow \infty \) is saying "there has to be a growing number of effectively independent random contributions to the average, so no one term can be so large compared to the others that it dominates, and correlations cannot be so strong that fluctuations in one term shift everything forever".


Notebooks: