July 02, 2010

The World's Simplest Ergodic Theorem

Attention conservation notice: Equation-filled attempt at a teaching note on some theorems in mathematical probability and their statistical application. (Plus an oblique swipe at macroeconomists.)

The "law of large numbers" says that averages of measurements calculated over increasingly large random samples converge on the averages calculated over the whole probability distribution; since that's a vague statement, there are actually several laws of large numbers, from the various ways of making this precise. As traditionally stated, they assume that the measurements are all independent of each other. Successive observations from a dynamical system or stochastic process are generally dependent on each other, so the laws of large numbers don't, strictly, apply, but they have analogs, called "ergodic theorems". (Blame Boltzmann.) Laws of large numbers and ergodic theorems are the foundations of statistics; they say that sufficiently large samples are representative of the underlying process, and so let us generalize from training data to future or currently-unobserved occurrences.

Here is the simplest route I know to such a theorem; I can't remember if I learned it from Prof. A. V. Chubukov's statistical mechanics class, or from Uriel Frisch's marvellous Turbulence. Start with a sequence of random variables \( X_1, X_2, \ldots X_n \). Assume that they all have the same (finite) mean m and the same (finite) variance v; also assume that the covariance, \( \mathbf{E}[X_t X_{t+h}] - \mathbf{E}[X_t]\mathbf{E}[X_{t+h}] \), depends only on the difference in times h and not on the starting time t. (These assumptions together comprise "second-order" or "weak" or "wide-sense" stationarity. Stationarity is not actually needed for ergodic theorems, one can get away with what's called "asymptotic mean stationarity", but stationarity simplifies the presentation here.) Call this covariance \( c_h \). We contemplate the arithmetic mean of the first n values in X, called the "time average": \[ A_n = \frac{1}{n}\sum_{t=1}^{n}{X_t} \]

What is the expectation value of the time average? Taking expectations is a linear operator, so \[ \mathbf{E}[A_n] = \frac{1}{n}\sum_{t=1}^{n}{\mathbf{E}[X_t]} = \frac{n}{n}m = m \] which is re-assuring: the expectation of the time average is the common expectation. What we need for an ergodic theorem is to show that as n grows, \( A_n \) tends, in some sense, to get closer and closer to its expectation value.

The most obvious sense we could try is for the variance of \( A_n \) to shrink as n grows. Let's work out that variance, remembering that for any random variable Y, \( \mathrm{Var}[Y] = \mathbf{E}[Y^2] - {\left(\mathbf{E}[Y]\right)}^2 \). \[ \begin{eqnarray*} \mathrm{Var}[A_n] & = & \mathbf{E}[A_n^2] - m^2\\ & = & \frac{1}{n^2}\mathbf{E}\left[{\left(\sum_{t=1}^{n}{X_t}\right)}^2\right] - m^2\\ & = & \frac{1}{n^2}\mathbf{E}\left[\sum_{t=1}^{n}{\sum_{s=1}^{n}{X_t X_s}}\right] - m^2\\ & = & \frac{1}{n^2}\sum_{t=1}^{n}{\sum_{s=1}^{n}{\mathbf{E}\left[X_t X_s\right]}} - m^2\\ & = & \frac{1}{n^2}\sum_{t=1}^{n}{\sum_{s=1}^{n}{ c_{s-t} + m^2}} - m^2\\ & = & \frac{1}{n^2}\sum_{t=1}^{n}{\sum_{s=1}^{n}{ c_{s-t}}}\\ & = & \frac{1}{n^2}\sum_{t=1}^{n}{\sum_{h=1-t}^{n-t}{ c_h}} \end{eqnarray*} \]

This used the linearity of expectations, and the definition of the covariances ch. Imagine that we write out all the covariances in an n*n matrix, and average them together; that's the variance of \( A_n \). The entries on the diagonal of the matrix are all c0 = v, and the off-diagonal entries are symmetric, because (check this!) c-h = ch. So the sum over the whole matrix is the sum on the diagonal, plus twice the sum of what's above the diagonal. \[ \mathrm{Var}[A_n] = \frac{v}{n} + \frac{2}{n^2}\sum_{t=1}^{n-1}{\sum_{h=1}^{n-t}{c_{h}}} \]

If the Xt were uncorrelated, we'd have ch = 0 for all h > 0, so the variance of the time average would be O(n-1). Since independent random variables are necessarily uncorrelated (but not vice versa), we have just recovered a form of the law of large numbers for independent data. How can we make the remaining part, the sum over the upper triangle of the covariance matrix, go to zero as well?

We need to recognize that it won't automatically do so. The assumptions we've made so far are compatible with a process where X1 is chosen randomly, and then all subsequent observations are copies of it, so that then the variance of the time average is v, no matter how long the time series; this is the famous problem of checking a newspaper story by reading another copy of the same paper. (More formally, in this situation ch = v for all h, and you can check that plugging this in to the equations above gives v for variance of \( A_n \) for all n.) So if we want an ergodic theorem, we will have to impose some assumption on the covariances, one weaker than "they are all zero" but strong enough to exclude the sequence of identical copies.

Using two inequalities to put upper bounds on the variance of the time average suggests a natural and useful assumption which will give us our ergodic theorem. \[ \begin{eqnarray*} \sum_{t=1}^{n-1}{\sum_{h=1}^{n-t}{c_{h}}} & \leq & \sum_{t=1}^{n-1}{\sum_{h=1}^{n-t}{|c_h|}}\\ & \leq & \sum_{t=1}^{n-1}{\sum_{h=1}^{\infty}{|c_h|}} \end{eqnarray*} \] Covariances can be negative, so we upper-bound the sum of the actual covariances by the sum of their magnitudes. (There is no approximation here if all covariances are positive.) Then we extend the inner sum so it covers all lags. This might of course be infinite, and would be for the sequence-of-identical-copies. Our assumption then is \[ \sum_{h=1}^{\infty}{|c_h|} < \infty \] This is a sum of covariances over time, so let's write it in a way which reflects those units: \( \sum_{h=1}^{\infty}{|c_h|} = v T \), where T is called the "(auto)covariance time", "integrated (auto)covariance time" or "(auto)correlation time". We are assuming a finite correlation time. (Exercise: Suppose that \( c_h = v e^{-h \tau} \), as would be the case for a first-order linear autoregressive model, and find T. This confirms, by the way, that the assumption of finite correlation time can be satisfied by processes with non-zero correlations.)

Returning to the variance of the time average, \[ \begin{eqnarray*} \mathrm{Var}[A_n] & = & \frac{v}{n} + \frac{2}{n^2}\sum_{t=1}^{n-1}{\sum_{h=1}^{n-t}{c_{h}}}\\ & \leq & \frac{v}{n} + \frac{2}{n^2}\sum_{t=1}^{n-1}{v T}\\ & = & \frac{v}{n} + \frac{2(n-1) vT}{n^2}\\ & \leq & \frac{v}{n} + \frac{2 vT}{n}\\ & = & \frac{v}{n}(1+ 2T) \end{eqnarray*} \] So, if we can assume the correlation time is finite, the variance of the time averages is \( O \left( n^{-1} \right) \), just as if the data were independent. However, the convergence is slower than for independent data by an over-all factor which depends only on T. As T shrinks to zero, we recover the result for uncorrelated data, an indication that our approximations were not too crude.

From knowing the variance, we can get rather tight bounds on the probability of \( A_n \)'s deviations from m if we assume that the fluctuations are Gaussian. Unfortunately, none of our assumptions so far entitle us to assume that. For independent data, we get Gaussian fluctuations of averages via the central limit theorem, and these results, too, can be extended to dependent data. But the assumptions needed for dependent central limit theorems are much stronger than merely a finite correlation time. What needs to happen, roughly speaking, is that if I take (nearly) arbitrary functions f and g, the correlation between \( f(X_t) \) and \( g(X_{t+h}) \) must go to zero as h grows. (This idea is quantified as "mixing" or "weak dependence".)

However, even without the Gaussian assumption, we can put some bounds on deviation probabilities by bounding the variance (as we have) and using Chebyshev's inequality: \[ \mathrm{Pr}\left(|A_n - m| > \epsilon\right) \leq \frac{\mathrm{Var}[A_n]}{\epsilon^2} \leq \frac{v}{\epsilon^2} \frac{2T+1}{n} \] which goes to zero as n grows. So we have just proved convergence "in mean square" and "in probability" of time averages on their stationary expectation values, i.e., the mean square and weak ergodic theorems, under the assumptions that the data are weakly stationary and the correlation time is finite. There were a couple of steps in our argument where we used not very tight inequalities, and it turns out we can weaken the assumption of finite correlation time. The necessary and sufficient condition for the mean-square ergodic theorem turns out to be that, as one might hope, \[ \lim_{n\rightarrow 0}{\frac{1}{n}\sum_{h=1}^{n}{c_h}} = 0 \] though I don't know of any way of proving it rigorously without using Fourier analysis (which is linked to the autocovariance via the Wiener-Khinchin theorem; see chapters 19 and 21 of Almost None of the Theory of Stochastic Processes).

Reverting to the case of finite correlation time T, observe that we have the same variance from n dependent samples as we would from n/(1+2T) independent ones. One way to think of this is that the dependence shrinks the effective sample size by a factor of 2T+1. Another, which is related to the name "correlation time", is to imagine dividing the time series up into blocks of that length, i.e., a central point and its T neighbors in either direction, and use only the central points in our averages. Those are, in a sense, effectively uncorrelated. Non-trivial correlations extend about T time-steps in either direction. Knowing T can be very important in figuring out how much actual information is contained in your data set.

To give an illustration not entirely at random, quantitative macroeconomic modeling is usually based on official statistics, like GDP, which come out quarterly. For the US, which is the main but not exclusive focus of these efforts, the data effectively start in 1947, as what national income accounts exist before then are generally thought too noisy to use. Taking the GDP growth rate series from 1947 to the beginning of 2010, 252 quarters in all, de-trending, I calculate a correlation time of just over ten quarters. (This granting the economists their usual, but absurd, assumption that economic fluctuations are stationary.) So macroeconomic modelers effectively have 11 or 12 independent data points to argue over.

Constructively, this idea leads to the mathematical trick of "blocking". To extend a result about independent random sequences to dependent ones, divide the dependent sequence up into contiguous blocks, but with gaps between them, long enough that the blocks are nearly independent of each other. One then has the IID result for the blocks, plus a correction which depends on how much residual dependence remains despite the filler. Picking an appropriate combination of block length and spacing between blocks keeps the correction small, or at least controllable. This idea is used extensively in ergodic theory (including the simplest possible proof of the strong ergodic theorem) and information theory (see Almost None again), in proving convergence results for weakly dependent processes, in bootstrapping time series, and in statistical learning theory under dependence.

Manual trackback: An Ergodic Walk (fittingly enough); Thoughts on Economics

Update, 7 August: Fixed typos in equations.

Update, 20 January 2013: Finally thought to link to the follow-up.

Update, 19 March 2022: I've now referred to the argument often enough in teaching and even in papers that it felt worthwhile to turn it into a brief pedagogical paper (arxiv:2203.09085), which also handles the easy non-stationary extension, and some of the conditions under which time averages do not converge.

Enigmas of Chance

Posted at July 02, 2010 13:40 | permanent link

Three-Toed Sloth