Attention conservation notice: Equation-filled attempt at a teaching note on some theorems in mathematical probability and their statistical application. (Plus an oblique swipe at macroeconomists.)

The "law of large numbers" says that averages of measurements calculated over increasingly large random samples converge on the averages calculated over the whole probability distribution; since that's a vague statement, there are actually several laws of large numbers, from the various ways of making this precise. As traditionally stated, they assume that the measurements are all independent of each other. Successive observations from a dynamical system or stochastic process are generally dependent on each other, so the laws of large numbers don't, strictly, apply, but they have analogs, called "ergodic theorems". (Blame Boltzmann.) Laws of large numbers and ergodic theorems are the foundations of statistics; they say that sufficiently large samples are representative of the underlying process, and so let us generalize from training data to future or currently-unobserved occurrences.

Here is the simplest route I know to such a theorem; I can't remember if I
learned it from
Prof. A. V. Chubukov's
statistical mechanics class, or from Uriel Frisch's
marvellous Turbulence.
Start with a sequence of random variables \( X_1, X_2, \ldots X_n \). Assume
that they all have the same (finite) mean *m* and the same (finite)
variance *v*; also assume that the
covariance, \( \mathbf{E}[X_t X_{t+h}] - \mathbf{E}[X_t]\mathbf{E}[X_{t+h}] \),
depends only on the difference in times *h* and not on the starting
time *t*. (These assumptions together comprise "second-order" or "weak"
or "wide-sense" stationarity. Stationarity is *not* actually needed for
ergodic theorems, one can get away with what's called "asymptotic mean
stationarity", but stationarity simplifies the presentation here.) Call this
covariance \( c_h \). We contemplate the arithmetic mean of
the first *n* values in *X*, called the "time average":
\[
A_n = \frac{1}{n}\sum_{t=1}^{n}{X_t}
\]

What is the expectation value of the time average? Taking expectations
is a linear operator, so
\[
\mathbf{E}[A_n] = \frac{1}{n}\sum_{t=1}^{n}{\mathbf{E}[X_t]} = \frac{n}{n}m = m
\]
which is re-assuring: the expectation of the time average is the common
expectation. What we need for an ergodic theorem is to show that as *n*
grows, \( A_n \) tends, in some sense, to get closer and
closer to its expectation value.

The most obvious sense we could try is for the variance
of \( A_n \) to shrink as *n* grows. Let's work
out that variance, remembering that for any random
variable *Y*, \( \mathrm{Var}[Y] = \mathbf{E}[Y^2] - {\left(\mathbf{E}[Y]\right)}^2 \).
\[
\begin{eqnarray*}
\mathrm{Var}[A_n] & = & \mathbf{E}[A_n^2] - m^2\\
& = & \frac{1}{n^2}\mathbf{E}\left[{\left(\sum_{t=1}^{n}{X_t}\right)}^2\right] - m^2\\
& = & \frac{1}{n^2}\mathbf{E}\left[\sum_{t=1}^{n}{\sum_{s=1}^{n}{X_t X_s}}\right] - m^2\\
& = & \frac{1}{n^2}\sum_{t=1}^{n}{\sum_{s=1}^{n}{\mathbf{E}\left[X_t X_s\right]}} - m^2\\
& = & \frac{1}{n^2}\sum_{t=1}^{n}{\sum_{s=1}^{n}{ c_{s-t} + m^2}} - m^2\\
& = & \frac{1}{n^2}\sum_{t=1}^{n}{\sum_{s=1}^{n}{ c_{s-t}}}\\
& = & \frac{1}{n^2}\sum_{t=1}^{n}{\sum_{h=1-t}^{n-t}{ c_h}}
\end{eqnarray*}
\]

This used the linearity of expectations, and the definition of the
covariances *c*_{h}. Imagine that we write out all the
covariances in an *n***n* matrix, and average them together; that's
the variance of \( A_n \). The entries on the diagonal of
the matrix are all *c*_{0} = *v*, and the off-diagonal
entries are symmetric, because (check this!) *c*_{-h}
= *c*_{h}. So the sum over the whole matrix is the sum on
the diagonal, plus twice the sum of what's above the diagonal.
\[
\mathrm{Var}[A_n] = \frac{v}{n} + \frac{2}{n^2}\sum_{t=1}^{n-1}{\sum_{h=1}^{n-t}{c_{h}}}
\]

If the *X*_{t} were uncorrelated, we'd
have *c*_{h} = 0 for all *h* > 0, so the variance of
the time average would be *O*(*n*^{-1}). Since independent
random variables are necessarily uncorrelated (but not vice versa), we have
just recovered a form of the law of large numbers for independent data. How
can we make the remaining part, the sum over the upper triangle of the
covariance matrix, go to zero as well?

We need to recognize that it won't *automatically* do so. The
assumptions we've made so far are compatible with a process
where *X*_{1} is chosen randomly, and then all subsequent
observations are copies of it, so that then the variance of the time average
is *v*, no matter how long the time series; this is the famous problem of
checking a newspaper story by reading another copy of the same paper. (More
formally, in this situation *c*_{h} = *v* for
all *h*, and you can check that plugging this in to the equations above
gives *v* for variance of \( A_n \) for all *n*.)
So if we want an ergodic theorem, we will have to impose some assumption on the
covariances, one weaker than "they are all zero" but strong enough to exclude
the sequence of identical copies.

Using two inequalities to put upper bounds on the variance of the time
average suggests a natural and useful assumption which will give us our ergodic
theorem.
\[
\begin{eqnarray*}
\sum_{t=1}^{n-1}{\sum_{h=1}^{n-t}{c_{h}}} & \leq & \sum_{t=1}^{n-1}{\sum_{h=1}^{n-t}{|c_h|}}\\
& \leq & \sum_{t=1}^{n-1}{\sum_{h=1}^{\infty}{|c_h|}}
\end{eqnarray*}
\]
Covariances can be negative, so we upper-bound the sum of the actual
covariances by the sum of their magnitudes. (There is no approximation here if
all covariances are positive.) Then we extend the inner sum so it
covers *all* lags. This might of course be infinite, and would be for
the sequence-of-identical-copies. Our assumption then is
\[
\sum_{h=1}^{\infty}{|c_h|} < \infty
\]
This is a sum of covariances over time, so let's write it in a way which
reflects those units: \( \sum_{h=1}^{\infty}{|c_h|} = v T \),
where *T* is called the "(auto)covariance time", "integrated
(auto)covariance time" or "(auto)correlation time". We are assuming a finite
correlation time. (Exercise: Suppose that \( c_h = v e^{-h \tau} \), as
would be the case for a first-order linear autoregressive model, and
find *T*. This confirms, by the way, that the assumption of finite
correlation time *can* be satisfied by processes with non-zero
correlations.)

Returning to the variance of the time average,
\[
\begin{eqnarray*}
\mathrm{Var}[A_n] & = & \frac{v}{n} + \frac{2}{n^2}\sum_{t=1}^{n-1}{\sum_{h=1}^{n-t}{c_{h}}}\\
& \leq & \frac{v}{n} + \frac{2}{n^2}\sum_{t=1}^{n-1}{v T}\\
& = & \frac{v}{n} + \frac{2(n-1) vT}{n^2}\\
& \leq & \frac{v}{n} + \frac{2 vT}{n}\\
& = & \frac{v}{n}(1+ 2T)
\end{eqnarray*}
\]
So, if we can assume the correlation time is finite, the variance of the time
averages is \( O \left( n^{-1} \right) \), just as if the data were
independent. However, the convergence is slower than for independent data by
an over-all factor which depends only on *T*. As *T* shrinks to
zero, we recover the result for uncorrelated data, an indication that our
approximations were not *too* crude.

From knowing the variance, we can get rather tight bounds on the probability
of \( A_n \)'s deviations from *m* if we assume that the
fluctuations are Gaussian. Unfortunately, none of our assumptions so far
entitle us to assume that. For independent data, we get Gaussian fluctuations
of averages via the central limit theorem, and these results, too, can
be extended to dependent
data. But the assumptions needed for dependent central limit theorems are
much stronger than merely a finite correlation time. What needs to happen,
roughly speaking, is that if I take (nearly) arbitrary functions
*f* and *g*, the correlation between \( f(X_t) \) and \( g(X_{t+h})
\) must go to zero as *h* grows. (This idea is quantified as
"mixing" or
"weak dependence".)

However, even without the Gaussian assumption, we can put *some*
bounds on deviation probabilities by bounding the variance (as we have) and
using Chebyshev's inequality:
\[
\mathrm{Pr}\left(|A_n - m| > \epsilon\right) \leq \frac{\mathrm{Var}[A_n]}{\epsilon^2} \leq \frac{v}{\epsilon^2} \frac{2T+1}{n}
\]
which goes to zero as *n* grows. So we have just proved convergence "in
mean square" and "in probability" of time averages on their stationary
expectation values, i.e., the mean square and weak ergodic theorems, under the
assumptions that the data are weakly stationary and the correlation time is
finite. There were a couple of steps in our argument where we used not very
tight inequalities, and it turns out we can weaken the assumption of finite
correlation time. The necessary and sufficient condition for the
mean-square ergodic theorem turns out to be that, as one might hope,
\[
\lim_{n\rightarrow 0}{\frac{1}{n}\sum_{h=1}^{n}{c_h}} = 0
\]
though I don't know of any way of proving it rigorously without using Fourier
analysis (which is linked to the autocovariance via the Wiener-Khinchin
theorem; see chapters 19 and 21 of Almost None of the Theory of
Stochastic Processes).

Reverting to the case of finite correlation time *T*, observe that we
have the same variance from *n* dependent samples as we would
from *n*/(1+2*T*) independent ones. One way to think of this is that
the dependence shrinks the *effective* sample size by a factor of
2*T*+1. Another, which is related to the name "correlation time", is
to imagine dividing the time series up into blocks of that length, i.e., a
central point and its *T* neighbors in either direction, and use only the
central points in our averages. Those are, in a sense, effectively
uncorrelated. Non-trivial correlations extend about *T* time-steps in
either direction. Knowing *T* can be very important in figuring out how
much actual information is contained in your data set.

To give an illustration not entirely at random, quantitative macroeconomic modeling is usually based on official statistics, like GDP, which come out quarterly. For the US, which is the main but not exclusive focus of these efforts, the data effectively start in 1947, as what national income accounts exist before then are generally thought too noisy to use. Taking the GDP growth rate series from 1947 to the beginning of 2010, 252 quarters in all, de-trending, I calculate a correlation time of just over ten quarters. (This granting the economists their usual, but absurd, assumption that economic fluctuations are stationary.) So macroeconomic modelers effectively have 11 or 12 independent data points to argue over.

Constructively, this idea leads to the mathematical trick of "blocking". To extend a result about independent random sequences to dependent ones, divide the dependent sequence up into contiguous blocks, but with gaps between them, long enough that the blocks are nearly independent of each other. One then has the IID result for the blocks, plus a correction which depends on how much residual dependence remains despite the filler. Picking an appropriate combination of block length and spacing between blocks keeps the correction small, or at least controllable. This idea is used extensively in ergodic theory (including the simplest possible proof of the strong ergodic theorem) and information theory (see Almost None again), in proving convergence results for weakly dependent processes, in bootstrapping time series, and in statistical learning theory under dependence.

*Manual trackback*: An Ergodic Walk (fittingly enough);
Thoughts on Economics

**Update**, 7 August: Fixed typos in equations.

**Update**, 20 January 2013: Finally thought to link to the follow-up.

Posted at July 02, 2010 13:40 | permanent link