Some Bad (Heuristic) Combinatorial Large Deviations Arguments
Last update: 28 Jan 2026 13:42First version: 25 January 2026
Attention conservation notice: Re-worked teaching material, in need of better notation.\[ \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Prob}[1]{\mathbb{P} \left( #1 \right)} \newcommand{\InvDist}{v} \newcommand{\EmpDist}{\widehat{P}_n} \newcommand{\EmpDistIID}{\EmpDist} \newcommand{\EmpDistMarkov}{\EmpDist} \newcommand{\EmpDistCCC}{\EmpDist} \newcommand{\Transition}{\mathbf{p}} \] These are heuristic arguments for some important results about convergence to, and deviations from, the limit in various laws of large numbers, i.e., large deviations principles. In calling them "heuristic", I mean that they point in the right direction, and are easy to follow, while acknowledging that they are, strictly speaking, fallacious. I find them much more pedagogically useful than opening with the rigorous arguments. I've given versions of these multiple times over the decades, including even in these notebooks, but I find it useful, for reasons I do not want to go into here, to collect them in one place.
In every case, I am interested in the convergence of an empirical distribution, based on some sample \( X_1, X_2, \ldots X_n \), generated from some stochastic process \( \mu \). The various empirical distributions converge on the corresponding true distributions (according to \( \mu \) ), and the probability of samples that look like any fixed alternative distribution shrinks exponentially fast in \( n \). I will go through a series of increasingly complicated examples.
Throughout, I will assume that there are only a finite number \( k \) of possible values for the \( X_i \).
- IID Samples (Sanov's Theorem)
- Markov Chains
- Chains with Complete Connections / Stochastic Deterministic Finite Automata
- Patching the Holes in the Proofs
IID Samples (Sanov's Theorem)
The simplest and clearest situation is when the generating process \( \mu \) is just independent and identically distributed random variables, so each \( X_i \) is distributed according to (say) \( p \), which we can represent as a vector \( (p_1, p_2, \ldots p_k) \), with \( p_i \geq 0 \), \( \sum_{i=1}^{k}{p_i} = 1 \). We are interested in the convergence of the one-dimensional empirical distribution, \[ \EmpDistIID(i) \equiv \frac{1}{n}\sum_{i=1}^{n}{\delta_i(X_i)} \]What is the probability that \( \EmpDistIID \) takes any particular value, say \( q \)? The probability of the sequence \( X_1, X_2, \ldots X_n \) is \[ \prod_{t=1}^{n}{p_{X_t}} \] So the probability of any one sequence yielding \( \EmpDistIID = q \), a sequence of type \( q \), is \[ \prod_{i=1}^{k}{p_i^{n q_i}} \] The probability of all such sequences, taken together, is \[ \Prob{\EmpDistIID = q} = c_{IID}(n, q) \prod_{i=1}^{k}{p_i^{n q_i}} \] where \( c_{IID}(n, q) \) counts the number of length-\( n \) sequences of type \( q \). A little thought about combinatorics tells us that \[ c_{IID}(n,q) = \frac{n!}{\prod_{i=1}^{k}{(nq_i)!}} \]
Now we decide that we are only interested in the exponential growth (or decay) of probabilities, so what we really want is \[ \lim_{n\rightarrow\infty}{\frac{1}{n}\log{\Prob{\EmpDistIID = q}}} \]
Let's evaluate this, using Stirling's approximation to \( n! \): \begin{eqnarray} \frac{1}{n}\log{\Prob{\EmpDistIID = q}} & = & \frac{1}{n}\left[ \log{(n!)} - \sum_{i=1}^{k}{\log{((nq_i)!)}} + \sum_{i=1}^{k}{(nq_i)\log{p_i}} \right] \\ & \approx & \frac{1}{n}\left[ n\log{n} - \sum_{i=1}^{k}{n q_i (\log{n} + \log{q_i})} + n\sum_{i=1}^{k}{q_i \log{p_i}} \right] \\ & = & - \sum_{i=1}^{k}{q_i \log{\frac{q_i}{p_i}}} \\ & = & - D(q, p) \end{eqnarray} introducing \( D(q, p) \) for the Kullback-Leibler divergence of \( q \) from \( p \). It is not hard to show, though I will not do so here, that \( D(q, p) \geq 0 \), with equality if and only if \( q=p \), and that \( D(q, p) \neq D(p, q) \) (in general).
We thus have the limit we sought: \[ \frac{1}{n}\log{\Prob{\EmpDistIID = q}} = - D(q,p) \] In words: when IID data comes from the distribution \( p \), the probability of a sample that looks like any particular alternative \( q \) goes to zero exponentially fast, but only exponentially fast, the exponential rate being given by the KL divergence.
Markov Chains
Now \( X_0, X_1, \ldots X_n \) are generated from a stationary Markov chain, with transition matrix \( \Transition \), and invariant distribution over states \( \InvDist \). (The extra step at the start is to avoid having some annoying \( n-1 \)s instead of \( n \)s in later formulas.) The distribution over pairs \( (X_t, X_{t+1}) \) is \( \InvDist \otimes \Transition \), which should be the limit of \[ \EmpDistMarkov(x,y) \equiv \frac{1}{n}\sum_{t=1}^{n}{\delta_{x,y}(X_{t-1}, X_{t})} \]
The argument is very similar. Conditioned on the starting state \( X_0=x_0 \), the probability of a particular sequence is \[ \prod_{t=1}^{n}{p_{X_{t-1}, X_{t}}} \]
The probability of sequence leading to an empirical pair distribution of type \( q \) is thus \[ \prod_{i,j}{\Transition_{ij}^{n q_{ij}}} \]
The probability of all sequences of type \( q \) is \[ \Prob{\EmpDistMarkov = q|X_0=x_0} = c_{Markov}(n, q) \prod_{i,j}{\Transition_{ij}^{n q_{ij}}} \] so we need to count the number of length-\( n \) sequences of type \( q \).
A very naive idea would be that there are \( n \) positions and \( nq_{ij} \) transitions from state \( i \) to state \( j \), so the number of sequences with given transition counts would be \[ \frac{n!}{\prod_{i,j}{(nq_{ij})!}} \] However, this has to over count, because if (say) position 5 is a transition from state \( i \) to state \( j \), then position 6 must be a transition from \( j \) to some \( k \). So we are over-counting by a factor that reflects how many ways there are to arrange starting states over the sequence. Let's define \( q_i \equiv \sum_{j}{q_{ij}} \). Then the number of ways to arrange the starting states is \[ \frac{n!}{\prod_{i}{(nq_{i})!}} \] and \[ c_{Markov}(n, q) = \frac{\prod_{i}{(nq_{i})!}}{\prod_{i,j}{(nq_{ij})!}} \]
(For the sake of conscience, I will add that this is not quite right, but the actual formula, which involves counting Eulerian tours in directed graphs, is much more complicated, yet the exponential growth rate is exactly the same. So this good enough.)
Now we take logs, divide by \( n \), and use Stirling: \begin{eqnarray} \frac{1}{n}\log{\Prob{\EmpDistMarkov = q|X_0=x_0}} & \approx & \frac{1}{n}\left[\sum_{i}{n q_i (\log{n} + \log{q_i})} - \sum_{ij}{n q_{ij}(\log{n} + \log{q_{ij}})} + \sum_{i, j}{n q_{ij}\log{\Transition_{ij}}} \right]\\ & = & \sum_{i}{q_i \log{q_i}} - \sum_{ij}{q_{ij}\log{q_{ij}}} + \sum_{ij}{q_{ij}\log{\Transition_{ij}}}\\ & = & \sum_{i}{q_i \log{q_i}} - \sum_{ij}{q_{ij}\log{\frac{q_{ij}}{\Transition_{ij}}}} \end{eqnarray} Now this is correct but not (necessarily) the most illuminating form we could give. Define \( q_{i\rightarrow j} \) as \( q_{ij}/q_i \); this is, so to speak, what the joint distribution \( q \) thinks the transition distribution is, and corresponds directly to \( \Transition_{ij} \). So \( q_{ij} = q_i q_{i \rightarrow j} \), and \begin{eqnarray} \sum_{i}{q_i \log{q_i}} - \sum_{ij}{q_{ij}\log{\frac{q_{ij}}{\Transition_{ij}}}} & = & -\sum_{ij}{q_{ij} \log{\frac{q_{ij}}{\Transition_{ij} q_i}}}\\ & = & -\sum_{ij}{q_{ij} \log{\frac{q_{i \rightarrow j}}{\Transition_{ij}}}}\\ & = & -\sum_{i}{q_i \sum_{j}{q_{i \rightarrow j}\log{\frac{q_{i \rightarrow j}}{\Transition_{ij}}}}} \end{eqnarray} The inner sum, over \( j \), is the KL divergence between the real transition distribution of state \( i \), and what \( q \) thinks that distribution is. We then do a weighted average of those KL divergences.
(Alternatively, we go back to the first line of the math above, and define \( q^{(1)} \otimes p \) to be the joint distribution we get from the starting distribution of \( q \) and the true transition distribution. Then we are looking at \(D(q, q^{(1)} \times p ) \). People use this form more often in the literature, but I find it less comprehensible than what I wrote above.)
All of the above is conditional on a starting state \( x_0 \). What about the unconditional probability? Well, \[ \Prob{\EmpDistMarkov = q} = \sum_{x_0}{\Prob{\EmpDistMarkov=q|X_0=x_0} \Prob{X_0=x_0}} \] Since each of the conditional probabilities are going to zero exponentially in \( n \), their sum will be dominated by the conditional probability with the slowest decay rate. But, surprise, they all have the same decay rate. So \[ \frac{1}{n}\log{\Prob{\EmpDistMarkov=q}} \rightarrow -\sum_{i}{q_i \sum_{j}{q_{i \rightarrow j}\log{\frac{q_{i \rightarrow j}}{\Transition_{ij}}}}} \]
Chains with Complete Connections / Stochastic Deterministic Finite Automata
There's a set of states, say \( 1 \) to \( l \), which we do not observe. At time \( t \) the process in state \( S_t \). What we observe then is \( X_t \), a random function of \( S_t \). (Or a non-random function of \( S_t \) and IID noise.) What makes transitions "deterministic" is that \( S_{t+1}=\tau(S_t, X_t) \) for some transition function \( \tau \). If you want a picture, draw a graph with nodes for each hidden state, and arrows between them labeled by possible values of \( X \). (Weight the arrows by probabilities.) Notice that, given a starting state \( S_1=s_1 \) and a sequence of observations \( X_1, \ldots X_2 \), all the subsequent states are fixed recursively: \( s_2 = \tau(s_1, X_1) \), \( s_3 = \tau(s_2, X_2) \) and so on. We are interested in the joint distribution of states and observations, \[ \EmpDistCCC(s,x) = \frac{1}{n}\sum_{t=1}^{n}{\delta_{s,x}(S_t, X_t)} \] Let's write \( p_{sx} \) for the probability of observing \( x \) conditional on being in state \( s \). And, again, let's write \( q \) for the joint distribution over states and observations we're interested in. Then the probability of any one sequence \( X_{1:n} = x_{1:n} \) f type \( q \), given the starting state is \( S_1=s_1 \), is \[ \prod_{t=1}^{n}{p_{s_t x_t}} \] where \( s_t \) is defined recursively as above. Re-arranging the multiplication so it's over state-observation pairs, instead of over time-steps, \[ \prod_{s, x}{p_{sx}^{n q_{sx}}} \]
Now we're off to the races as before: \[ \Prob{\EmpDistCCC = q} = c_{CCC}(n, q) \prod_{s, x}{p_{sx}^{n q_{sx}}} \]
Again, as in the Markov case, a naive idea for the combinatorial factor would be the number of ways of arranging the necessary state-observation transitions over \( n \) positions, \( c_{CCC} = \frac{n!}{\prod_{sx}{(n q_{sx})!}} \). But, again, as in the Markov case, this over-counts by a lot: \( s_t \) and \( x_t \) together fix \( s_{t+1} \), so we're over-counting by a factor of the number of states that begin each transition. Thus we really have \[ c_{CCC}(n, q) = \frac{\prod_{s}{(nq_{s})!}}{\prod_{sx}{(nq_{sx})!}} \] defining \( q_s \equiv \sum_{x}{q_{sx}} \).
Plugging \( c_{CCC} \) in, taking the log and dividing by \( n \), \begin{eqnarray} \frac{1}{n}\log{\Prob{\EmpDistCCC = q}} & = & \frac{1}{n}\left( \sum_{s}{\log{(n q_{s})!}} - \sum_{sx}{\log{(n q_{sx})!}} + \sum_{sx}{ n q_{sx}\log{p_{sx}}} \right)\\ & \approx & \sum_{s}{ q_s (\log{n}+\log{q_s})} - \sum_{sx}{q_{sx}(\log{n} + \log{q_{sx}})} + \sum_{sx}{q_{sx}\log{p_{sx}}}\\ & = & \sum_{s}{q_s \log{q_s}} - \sum_{sx}{q_{sx}\log{\frac{q_{sx}}{p_{sx}}}}\\ & = & -\sum_{sx}{q_{sx}\log{\frac{q_{sx}/q_s}{p_{sx}}}}\\ & = & -\sum_{s}{q_s \sum_{x}{q_{s\rightarrow x}\log{\frac{q_{s\rightarrow x}}{p_{sx}}}}} \end{eqnarray} defining \( q_{s\rightarrow x} \equiv q_{sx}/q_s \) as the emission probability implied by \( q \).
(Notice, by the way, that if we identify the states and the observables, we have Markov chain, and the calculation we just did gives us back our Markov chain result. So if there's a mistake, it's at least the same mistake in both places.)
Patching Up the Holes In These Arguments
Real large deviations principles are not about the probability of getting some particular \( q \) exactly, but about the probability of falling into some set of outcomes \( B \) (and holding \( B \) fixed as \( n \) grows). Specifically, we want something like (*): \[ \lim_{n\rightarrow\infty}{\frac{1}n}\log{\Prob{\EmpDist \in B}} = -\inf_{q \in B}{J(q)} \] where \( J \) is some sort of rate function. We have just seen that, in these problems, the rate function should be the KL divergence from the true distribution \( p \), or some sort of weighted average of KL divergences.People usually prove large deviations results by getting both upper and lower bounds on the probability in question, and then showing that the two bounds go to zero at the same exponential rate. I'll sketch how to make that work here.
Now at any sample size \( n \), there are only a finite set of possible empirical distributions, let's say \( E_n \). The probability, at sample size \( n \), of falling into the set \( B \) is really the probability of falling on one of the points in \( B \cap E_n \), so \[ \Prob{\EmpDistIID \in B} = \sum_{q \in B \cap E_n}{\Prob{\EmpDistIID = q}} \] This gives us the clue as to how to get both the lower and the upper bound.
Upper bound: There is one of the \( q \) in \( B \cap E_n \) which has the highest probability, let's say \( q^*_n \). So \[ \Prob{\EmpDistIID \in B} \leq \# |B \cap E_n| \Prob{\EmpDistIID = q^*_n} \] writing $\# |A|$ for the number of points in the set $A$. (That is, we use a union bound: probability of the most probable point times the number of points.) In fact, we can even say \[ \Prob{\EmpDistIID \in B} \leq \# |E_n| \Prob{\EmpDistIID = q^*_n} \] though this will of course be an even looser bound. We're willing to tolerate such a loose bound because we're only interested in the exponential decay rate: \[ \frac{1}{n}\log{\Prob{\EmpDistIID \in B}} \leq \frac{\log{\# |E_n|}}{n} + \frac{1}{n}\log{\Prob{\EmpDistIID = q^*_n}} \] The first term, counting the number of grid points, will vanish as \( n\rightarrow \infty \), because the number of grid points only grows polynomially. And, by our arguments above, the second term will be approaching \( -D(q^*n,p) = -\min_{q \in B \cap E_n}{D(q,p)} \). Because the rate function is continuous in \( p \), the minimum over the grid is going to approach the infimum over the set.
(To see that \( \# |E_n| \) grows only polynomially in \( n \), remember that there is a finite number \( k \) of possible outcomes. Each coordinate \( q_i \) can take one of \( n+1 \) possible values (\( 0, 1/n, \ldots \frac{n-1}{n}, 1 \)), and there are \( k \) coordinates, so \( |E_n| \leq (n+1)^k \). --- This neglects the constraint that \( \sum_{i}{q_i} = 1 \), so it's yet another over-estimate, but that's fine here.)
Lower bound: The probability of the whole set has to be at least the probability of the most probable point, \[ \Prob{\EmpDistIID \in B} \geq \Prob{\EmpDistIID = q^*_n} \] Now we take the log, divide by \( n \), and argue by continuity as in the upper bound.
(The argument by continuity will need some work if \( B \) is measurable but not particularly interval-like. Suffice it to say that if we both needed a soporific, I could do some actual real analysis here to patch things up.)
The same sort of argument will also work for Markov chains and for chains with complete connections / automata, though we need to do a little more work to show that the number of grid points grows only polynomially in \( n \).
*: More exactly, we want to show that \( \limsup{n^{-1}\log{\Prob{\EmpDist \in B}}} \leq -\inf_{q \in \mathrm{cl}B}{J(q)} \) (the limsup is bounded above by the infimum over the closure of \( B \)) and that \( \liminf{n^{-1} \log{\Prob{\EmpDist \in B}}} \geq -\inf_{q \in \mathrm{int}B}{J(q)} \) (the liminf is bounded below by the infimum over the interior of \( B \)). This avoids us committing to the existence of a limiting rate for the probability to decay, and is the kind of detail that matters to giving a real proof. But even in this section, I am only giving a sketch of how the proofs go. ^