## Exchangeable Random Sequences

*14 Nov 2017 10:53*

A sequence \( X \) of random variables, \( X_1, X_2, \ldots \)
is **exchangeable** if its distribution is invariant under
permutation of the indices. What implications does such symmetry have for
a stochastic process?

(To make this precise needs some notation. Start with a finite vector of
non-repeating indices \( I \), say (1, 30, 2, 47, 94980) for definiteness.
Then the collection of the \( X_i \) at those indices, in that order, \( X_I
\), has some distribution or other. [Formally speaking, the distribution of \(
X_I \) is obtained by applying a projection operator to the
infinite-dimensional distribution of \( X \).] Now pick some permutation \( s
\) of *all* the indices, not necessarily just those in \( I \). Write
\( sX \) for the sequence we get by permuting \( X \), so \( {(sX)}_j =
X_{s^{-1}(j)} \), and so on. For instance, *s* might swap each
odd-numbered index with the even-numbered index one larger, so 1 with 2, 30
with 29, 47 with 48, 94980 with 94979. We say that the sequence \( X \) is
exchangeable when \( X_I \) has the same distribution as \( {(sX)}_{I} \), for
all permutations \( s \) and all finite sets of indices \( I \). [The
Kolmogorov extension theorem then tells us that *X* and *sX* have the
same infinite-dimensional distribution.] In particular, we'd need \( X_{(1,
30, 2, 47, 94980)} = X_{(2, 29, 1, 48, 94979)} \). In fact, it is easy to check
that two necessary conditions for *X* to be exchangeable are that (i) \(
X_I \) has the same distribution as \( X_J \) whenever \( I \) and \( J \) have
the same length; and (ii) for each \( n \), the distribution of \( X_{1:n} \)
is invariant under permutations of the first \( n \) indices. I leave it as an
exercise to the reader to determine whether the combination of these necessary
conditions is sufficient for exchangeability.)

It's easy to confirm that a sequence of independent, identically-distributed (IID) random variables is exchangeable. This at least shows that the concept is not self-contradictory. Is it, however, just another way of saying "IID"? The answer, interestingly enough, turns out to be "almost". The classical result, generally called "de Finetti's theorem" after the person who proved the first version, is that every infinite-dimensional exchangeable distribution \( m \) is a mixture of infinite-dimensional IID distributions: \[ m = \int_{\mathrm{IID}}{\lambda dp(\lambda)} \] where \( p \) is some prior distribution over IID distributions \( \lambda \). In terms of random variables and sequences, the way to generate an infinite exchangeable sequence is to first randomly chose a distribution for \( X_1 \), and then keep drawing all later \( X_i \), independently, from that same distribution. Symbolically, \[ \begin{eqnarray*} \lambda & \sim & p\\ X_i|\lambda & \sim_{\mathrm{IID}} & \lambda \end{eqnarray*} \] (It is important that the sequences be infinite here. A finite-dimensional exchangeable distribution is not necessarily a mixture of finite-dimensional IID distributions, though in a sense it must be close to such a mixture; see Diaconis and Freedman.)

A mixture of IID distributions is *not* IID. To see this, consider a
very simple example, where the *X*_{i} are binary, say
coin-tosses, with some bias towards heads. There is a well-defined
distribution where all the coin-tosses are independent, and the probability of
heads is 75% per toss. There is another where the coin tosses are independent,
and the probability of heads is only 25%. If we take an equally-weighted
mixture of these, we get an exchangeable sequence where the probability of
heads on any given toss is 50%. However, the coin tosses in the mixture are no
longer independent. If the first \( n \) tosses contain \( h \) heads and \( n
- h \) tails, then, by Bayes's rule, the odds that we initially chose the
head-biased distribution are not 1:1 but
\[
\frac{{n \choose h} {\left(\frac{3}{4}\right)}^h {\left(\frac{1}{4}\right)}^{n-h} }{{n \choose h} {\left(\frac{1}{4}\right)}^h {\left(\frac{3}{4}\right)}^{n-h}}
= \frac{3^h}{3^{n-h}} = 3^{2h -n}
\]
and so the probability that the \( {n+1}^{\mathrm{st}} \) coin toss will be heads is
\[
\frac{3}{4}\frac{3^{2h -n}}{1+3^{2h -n}} + \frac{1}{4}\frac{1}{1+3^{2h -n}} = \frac{3^{2h-n+1}+1}{4(3^{2h-n}+1)}
\]
which will not be 1/2 unless *h* = *n*/2.

The de Finetti theorem has some important uses for Bayesian statistical inference — for instance, it lets helps us test prior distributions, i.e., check Bayesian models — but since I am not a Bayesian I don't find those all that compelling. Rather, I am interested in the mathematical extensions. In one direction, de Finetti's theorem is a special case of the decomposition of stationary processes into stationary-and-ergodic processes. (Or, even more generally, of asymptotically mean-stationary processes into AMS-and-ergodic processes.) In another, exchangeability has a natural extension to graphs and arrays, which is important for network data analysis, and where the extremal distributions are rather more interesting than just IID sequences.

See also: Characterizing Mixtures of Processes by Summarizing Statistics

- Recommended, big picture:
- Bert Fristedt and Lawrence Gray, A Modern Approach to Probability Theory [Very good chapter on exchangeable sequences]
- Olav Kallenberg, Probabilistic Symmetries and Invariance Principles [Extensive, even exhaustive, treatment of exchangeability and its extensions]
- Steffen L. Lauritzen, "Exchangeability and de Finetti's Theorem" [PDF slides]

- Recommended, close-ups:
- Patrizia Berti, Irene Crimaldi, Luca Pratelli, and Pietro Rigo,
"Rate of convergence of predictive distributions for dependent data",
Bernoulli
**15**(2009): 1351--1367, arxiv:1001.2152 [See comments under Statistical Learning Theory with Dependent Data] - Patrizia Berti and Pietro Rigo, "A Glivenko-Cantelli theorem for
exchangeable random variables", Statistics and Probability
Letters
**32**(1997): 385--391 [See comments under Statistical Learning Theory with Dependent Data] - Persi Diaconis and David Freedman, "Finite Exchangeable Sequences", Annals of Probability (1980): 745--764
- E. Hewitt and L. J. Savage, "Symmetric measures on Cartesian products", Transactions of the American Mathematical Society
**80**(1955): 470--501 - Matthew O. Jackson, Ehud Kalai and Rann Smorodinsky, "Bayesian Representation of Stochastic Processes Under Learning: De Finetti Revisited",
Econometrica
**67**(1999): 875--893 [JSTOR] - Vladimir Pestov, "Predictive PAC learnability: a paradigm for learning from exchangeable input data", arxiv:1006.1129 [See comments under Statistical Learning Theory with Dependent Data]

- Modesty forbids me to recommend:
- CRS and Aryeh Kontorovich, "Predictive PAC Learning and Process Decompositions", NIPS 2013, arxiv:1309.4859

- To read:
- Tim Austin, Dmitry Panchenko, "A hierarchical version of the de Finetti and Aldous-Hoover representations", arxiv:1301.1259