Ergodic Theory

27 Feb 2017 16:30

A measure on a mathematical space is a way of assigning weights to different parts of the space; volume is a measure on ordinary three-dimensional Euclidean space. Probability distributions are measures, such that the largest measure of any set is 1 (and some other restrictions). Suppose we're interested in a dynamical system --- a transformation that maps a space into itself. The set of points we get from applying the transformation repeatedly to a point is called its trajectory or orbit. Some dynamical systems are "measure preserving", meaning that the measure of a set is always the same as the measure of the set of points which map to it. (In symbols, using \( T \) for the map and \( P \) for the probability measure, \( P(T^{-1}(A)) = P(A) \) for any measureable set \( A \).) Some sets may be invariant: they are the same as their images. An ergodic dynamical system is one in which, with respect to some probability distribution, all invariant sets either have measure 0 or measure 1. (Sometimes non-ergodic systems can be decomposed into a number of components, each of which is separately ergodic.) The dynamics need not be deterministic; in particular, irreducible Markov chains with finite state spaces are ergodic processes, since they have a unique invariant distribution over the states. (In the Markov chain case, each of the ergodic components corresponds to an irreducible sub-space.)

Ergodicity is important because of the following theorem (due to von Neumann, and then improved substantially by Birkhoff, in the 1930s). If we take any well-behaved (integrable) function of our space, pick a point in the space at random (according to the ergodic distribution) and calculate the average of the function along the point's orbit, the time-average. Then, with probability 1, in the limit as the time goes to infinity, (1) the time-average converges to a limit and (2) that limit is equal to the weighted average of the value of the function at all points in the space (with the weights given by the same distribution), the space-average. The orbit of almost any point you please will in some sense look like the whole of the state space.

(Symbolically, write \( x \) for a point in the state space, \( f \) for the function we're averaging, and \( T \) and \( P \) for the map and the probability measure as before. The space-average, \( \overline{f} = \int{f(x)P(dx)} \). The time-average starting from \( x \), \( {\langle f\rangle}_x = \lim_{n\rightarrow\infty}{(1/n) \sum_{i=0}^{n}{f(T^i(x))}} \). The ergodic theorem asserts that if \( f \) is integrable and \( T \) is ergodic with respect to \( P \), then \( {\langle f \rangle}_x \) exists, and \( P\left\{x : {\langle f \rangle}_x = \overline{f} \right\} = 1 \). --- A similar result holds for continuous-time dynamical systems, where we replace the summation in the time average with an integral.)

This is an extremely important property for statistical mechanics. In fact, the founder of statistical mechanics, Ludwig Boltzmann, coined "ergodic" as the name for a stronger but related property: starting from a random point in state space, orbits will typically pass through every point in state space. It is easy to show (with set theory) that this isn't doable, so people appealled to a weaker property which was for a time known as "quasi-ergodicity": a typical trajectory will pass arbitrarily close to every point in phase space. Finally it became clear that only the modern ergodic property is needed. To see the relation, consider the function, call it \( I_A \), which is 1 on a certain set \( A \), and 0 elsewhere. (This is the "indicator function" of \( A \).) The time-average of \( I_A \) is the fraction of time that the orbit spends in \( A \). The space-average of \( I_A \) is the probability that a randomly picked point is in \( A \). Since the two averages are almost always equal, almost all trajectories end up covering the state space in the same way.

One way of thinking about the classical ergodic theorem is that it's a version of the law of large numbers --- it tells us that a sufficiently large sample (i.e., an average over a long time) is representative of the whole population (i.e., the space average). One thing I'd like to know more about than I do is ergodic equivalents of the central limit theorem, which say how big the sampling fluctuations are, and how they're distributed. The other thing I want to know about is the rate of convergence in the ergodic theorem --- how long must I wait before my time average is within a certain margin of probable error of the state average. Here I do know a bit more of the relevant literature, from large deviations theory.

Again in symbols: Let's write \( {\langle f\rangle}_{x,n} \) for the time-average of \( f \), starting from \( x \), taken over \( n \) steps. Then a central-limit theorem result would say that (for example) \( \frac{{\langle f\rangle}_{X,n} - \overline{f}}{\sigma^{2}r(n)} \) converges in distribution to a Gaussian with mean zero and variance one, where \( \sigma^2 \) is the (space-averaged) variance of \( f \) and \( r(n) \) is some positive, increasing function of \( n \). This would be weak convergence of the time averages to the space averages, and \( r(n) \) would give the rate. (In the usual IID case, \( r(n) = \sqrt{n} \).) Somewhat stronger would be a convegence in probability result, giving us a function \( N(\varepsilon,\delta) \) such that \( P\left\{x : \left|{\langle f\rangle}_{x,n} - \overline{f}\right| \geq \varepsilon \right\} \leq \delta \) if \( n \geq N(\varepsilon,\delta) \). Proving many of these results requires stronger assumptions than proving ergodicity does --- for instance, mixing properties.

These issues are part of a more general question about how to do statistical inference for stochastic processes, a.k.a. time-series analysis. I am especially interested in statistical learning theory in this setting, which is in part about ensuring that the ergodic theorem holds uniformly across classes of functions. Very strong results have recently been achieved on that front by Adams and Nobel (links below).

Another thing I'd like to understand, but don't have time to explain here, are Pinsker sigma-algebras.

See also: Deviation Inequalities; Dynamical Systems and Chaos; Empirical Process Theory; Information Theory; Mixing and Weak Dependence of Stochastic Processes; Nonequilibrium Statistical Mechanics; Probability Theory; Recurrence Times of Stochastic Processes; Stochastic Processes; Symbolic Dynamics; Time Series; Universal Prediction Algorithms

Updated 29 October 2007; thanks to "tushar" for pointing out an embarrassing think-o in the first paragraph.

Previous versions: 2007-10-29 9:40; 2005-11-15 16:18; first version written c. 1997 (?)