Information Theory and Large Deviations in the Foundations of Statistics

Last update: 07 Jul 2025 13:35
First version: 9 November 2019

These are three subjects which matter to me, and I'm especially interested in their interconnections. Of course, lots of people are convinced that there are profound connections between information theory and statistics, but I tend to think that the most popular flavor of this, the maximum entropy principle, is deeply misguided. ("CRS disdains his elders and betters, details at 11.") I will, in the usual way, use this notebook to record useful references, things I want to read, etc., but I also can't resist sketching some of what, in my supremely arrogant opinion, is the the right way to think about how these subjects are linked. Specifically: statistics relies on large deviations properties, which are naturally expressed information-theoretically.

Estimating Distributions, Sanov's Theorem, and Relative Entropy; or, You Can Observe a Lot Just By Looking

What Pitman calls the "fundamental theorem of statistics" is the Glivenko-Cantelli Theorem, the assertion that the empirical distribution function converges on the true distribution function. Let's define some terms. \( X_1, X_2, \ldots X_n, \ldots \) are a sequence of random variables, independent and distributed according to some common probability measure \( P \) on the space \( \mathcal{X} \). The empirical measure \( \hat{P}_n \) of a set \( A \) is \[ \hat{P}_n(A) \equiv \frac{1}{n}\sum_{i=1}^{n}{\mathbf{1}_{A}(X_i)} \] or, using the Dirac delta function, \[ \hat{P}_n \equiv \frac{1}{n}\sum_{i=1}^{n}{\delta_{X_i}} \] For one-dimensional variables, consider sets \( A = (-\infty,a] \). The Glivenko-Cantelli theorem asserts that, as \( n\rightarrow \infty \), \[ \sup_{a}{\left| \hat{P}_n(A) - P(A) \right|} \rightarrow 0 \] \( P \)-almost surely. So as we get more and more data, the sample frequencies match the probability of all such one-sided intervals arbitrarily closely. From this, it follows that the sample frequency of any interval matches the probability arbitrarily closely, and so the expectation value of any reasonable test function matches arbitrarily closely. In a phrase, \( \hat{P}_n \) converges in distribution on \( P \), almost surely. This result extends straight-forwardly to any finite-dimensional space.

How quickly does it converge? Here we need to appeal to large deviations theory, and specifically to a result called Sanov's theorem. Let's start with a ~~lie told to children~~ rough sketch. What is the probability that \( \hat{P}_n \) is close to any particular probability measure \( Q \)? Sanov's theorem says that, what \( n \) is large, \[ \Pr{\left( \hat{P}_n \approx Q \right)} \approx \exp{\left\{-nD(Q \| P)\right\}} \] where \( D(Q \| P) \) is the relative entropy or Kullback-Leibler divergence, and probability is taken under the actual distribution of the \( X_i \), i.e., under \( P \). When the space \( X \) takes values in is discrete, this is just the average log-likelihood ratio, \[ D(Q \| P) = \sum_{x}{Q(x) \log{\frac{Q(x)}{P(x)}}} \] with the understanding that \( 0\log{0} = 0 \). (Use L'Hopital's rule if you don't believe that.)

(Why is this the right rate function? Having told a lie to children, my logical inhibitions are already relaxed, so I will be utterly ~~fallacious~~ heuristic. What is the probability that \( \hat{P} \approx Q \), for any given distribution \( Q \)? We'll make the sample space discrete and finite, so every distribution is really some multinomial. When \( X_1, \ldots X_n \) are of "type \( Q \)", meaning that \( \hat{P}_n \approx Q \), the number of \( t \) with \( X_t = x \) must be \( \approx n Q(x) \) for each \( x \in \mathcal{X} \). By elementary combinatorics, the number of length-\( n \) samples of type \( Q \) is \( \frac{n!}{\prod_{x \in \mathcal{X}}{(nQ(x))!}} \). (Notice that the generating probability distribution \( P \) doesn't matter here. Also notice that I'm blithely pretending \( nQ(x) \) is an integer; like I said, this is heuristic.) The probability of the generating distribution \( P \) producing any one \( Q \) type sample is \( \prod_{x \in \mathcal{X}}{P(x)^{nQ(x)}} \), by the IID assumption. So \[ \begin{eqnarray} \Pr(\hat{P}_n \approx Q) & = & n! \prod_{x \in \mathcal{X}}{\frac{P(x)^{nQ(x)}}{(nQ(x))!}}\\ \frac{1}{n}\log{\Pr(\hat{P}_n \approx Q)} & = & \frac{1}{n}\left(\log{(n!)} + \sum_{x\in\mathcal{X}}{n Q(x)\log{P(x)} - \log{(nQ(x))!}}\right)\\ & \approx & \log{n} + \sum_{x \in \mathcal{X}}{Q(x)\log{P(x)} - Q(x)\log{(nQ(x))}}\\ & = & \log{n} + \sum_{x \in \mathcal{X}}{Q(x)\log{(P(x))} - Q(x)\log{n} - Q(x)\log{Q(x)}}\\ & = & \sum_{x \in \mathcal{X}}{Q(x)\log{\frac{P(x)}{Q(x)}}}\\ & = & - D(Q\|P) \end{eqnarray} \] using Stirling's approximation, \( \log{n!} \approx n\log{n} \), and the fact that \( Q \) is a probability distribution, so \( \sum_{x \in \mathcal{X}}{Q(x)\log{n}} = \log{n} \). Remarkably enough, this ~~tissue of fallacies~~ heuristic sketch can be made completely rigorous, even for continuous distributions [e.g., by a sequences of successively refined discretizations of the continuous space].)

Ordinarily, in information theory, if the data comes from \( P \) and we consider using the wrong distribution \( Q \), the penalty we pay for coding is \( D(P \| Q ) \), which is the other way around. But we can understand why the measures should be flipped here. If \( Q(x) = 0, P(x) > 0 \) for some \( x \), nothing special happens, that particular \( x \) just drops out of the sum. But if \( Q(x) > 0, P(x) = 0 \) for some \( x \), then \( D(Q\|P) = \infty \). Said in words, if \( Q \) merely gives zero probability to some values that \( P \) allows, it becomes exponentially unlikely that a large sample looks like \( Q \). But if \( Q \) gives positive probability to a value that \( P \) forbids, there is absolutely no chance that any sample looks like \( Q \).

When \(0 < D(Q \| P) < \infty\), Sanov's theorem tells us that the probability of getting a sample that looks like \( Q \) when drawing data from \( P \) is exponentially small. If \( D(Q \| P) = 0 \), then the probability if getting samples like \( Q \) is decaying sub-exponentially if at all, but one can show that \( D(Q \| P) = 0 \Leftrightarrow Q = P \), so that probability is actually tending towards 1.

I said that my statement of Sanov's theorem was rough. A more precise statement is that, as \( n \rightarrow \infty \), \[ -\frac{1}{n}\log{\Pr{\left(\hat{P}_n \in B\right)}} \rightarrow \inf_{Q \in B}{D(Q \| P)} \] where now \( B \) is some set of probability measures. The relative entropy is the rate function, giving the rate at which the probability of a large deviation from the expected behavior goes to zero. One way to read the \( \inf \) is is to say that the probability of the sample being in some "bad" set \( B \) is dominated by the least-bad distribution in the set --- "if an improbable event happens, it tends to happen in the least-improbable way".

If \( X \) is continuous, then we could use the probability density to define the divergence, \[ D(Q\|P) = \int_{x}{q(x) \log{\frac{q(x)}{p(x)}} dx} \]

In yet more general cases, we fall back on measure theory, and use the Radon-Nikodym derivative, \[ D(Q\|P) = \int{\log{\frac{dQ}{dP}(x)}dQ(x)} \] which reduces to the two earlier formulas in the special cases. We can still read Sanov's theorem as above.

(Strictly speaking, I should really distinguish between the probability of \( \hat{P}_n \) of being in the closure of \( B \) and the probability of its being in the interior, and between \( \limsup \) and \( \liminf \) of the rates, but those details are not important for the present purposes.)

Recall that we wanted to know how quickly the empirical distribution \( \hat{P}_n \) converges to the true distribution \( P \). Fix some neighborhood \( N_{\epsilon} \) of \( P \), with whatever way you like of gauging how far apart two distributions. Then we have \[ -\frac{1}{n}\log{\Pr{\left( \hat{P}_n \not\in N_{\epsilon} \right)}} \rightarrow \inf_{Q \not\in N_{\epsilon}}{D(Q \| P)} \] That infimum will be some increasing function of \( \epsilon \), say \( r(\epsilon) \). [If we measure the distance between probability measures in total variation, it's \( O(\epsilon^2) \).] So, roughly speaking, the probability of being more than \( \epsilon \) away from the truth is about \( \exp{\left\{ -nr(\epsilon) \right\}} \), neither more nor less. (It's exponentially small, but only exponentially small.) More formally, Sanov's theorem gives matching upper and lower bounds on the asymptotic large deviations probability.

Hypothesis Testing, the Neyman-Pearson Lemma, and Relative Entropy

Sanov's theorem thus tells us something about the efficacy of estimating a distribution by just sitting and counting. It can also tell us about hypothesis testing. We take the most basic possible hypothesis testing problem, of distinguishing between a null distribution \( P_0 \) and an alternative distribution \( P_1 \). We know, from the Neyman-Pearson lemma, that the optimal test is a likelihood-ratio test: say "1" when \[ \Lambda_n = \frac{1}{n}\sum_{i=1}^{n}{\log{\frac{p_1(x_i)}{p_0(x_i)}}} \] is above some threshold \( t_n \), and say "0" when it's below. The value of this threshold is usually picked to enforce a desired false-alarm rate \( \alpha \), i.e., it is the shadow price of power at a certain sample size.

There are two error probabilities here: the probability of saying "1" under \( P_0 \), and the probability of saying "0" under \( P_1 \). The first, the false-alarm rate, is controlled by fixing \( t \).

Since the log likelihood ratio is a function of the sample, we can write it in terms of the empirical distribution: \[ \Lambda_n = \mathbf{E}_{\hat{P}_n}\left[ \log{\frac{p_1(X)}{p_0(X)}} \right] \]

Saying that \( \Lambda_n < t_n \) is thus equivalent to saying that \( \hat{P}_n \in B(t_n) \), for some set of distributions \( B(t_n) \). Enforcing the constraint on the false alarm probability means that \[ \Pr_{P_0}{\left( \hat{P}_n \in B(t_n) \right)} = 1-\alpha \] At the same time, again by Sanov's theorem, the probability, under \( P_0 \), that \( -\Lambda_n \not \in (D(P_0\|P_1) - \epsilon, D(P_0\|P_1) + \epsilon) \) is going to zero exponentially fast in \( n \). So it must be the case that \( t_n \rightarrow -D(P_0\|P_1) \).

Now what about those miss (type II error) probabilities? This is when the test should say "1" but says instead "0". \[ \beta_n = \Pr_{P_1}{\left( \hat{P}_n \in B(t_n) \right)} \] So, using Sanov again, \[ -\frac{1}{n}\log{\beta_n} = \frac{1}{n}\log{\Pr_{P_1}{\left( \hat{P}_n \in B(t_n) \right)}} \rightarrow \inf_{Q \in B(t_n)}{D(Q \| P_1)} \] When is \( Q \in B(t_n) \)? To keep things simple, assume we can multiply and divide by \( q(x) \) everywhere we need to. \[ \begin{eqnarray*} t_n & \geq & \mathbf{E}_{Q}\left[ \log{\frac{p_1(x)}{p_0(x)}}\right]\\ t_n & \geq & \mathbf{E}_{Q}\left[ \log{\frac{p_1(x)}{q(x)}\frac{q(x)}{p_0(x)}}\right]\\ t_n & \geq & \mathbf{E}_{Q}\left[ -\log{\frac{p_1(x)}{q(x)}}+\log{\frac{q(x)}{p_0(x)}}\right]\\ t_n & \geq & D(Q\|P_0) - D(Q\|P_1)\\ D(Q\|P_1) & \geq & D(Q\|P_0) - t_n \end{eqnarray*} \] In other words, we're looking for distributions which are sufficiently closer, in divergence, to \( P_0 \) as opposed to \( P_1 \). Since \( t_n \rightarrow -D(P_0\|P_1) \), as we saw above, in the long run we need \( D(Q\|P_1) \geq D(Q\|P_0) + D(P_0\|P_1) \). To make this as small as possible, set \( Q = P_0 \). Thus \[ -\frac{1}{n}\log{\beta_n} = D(P_0 \| P_1) \]

This argument can also be made to work if the size is allowed to go to zero --- if we constrain the size to be \( \alpha_n \rightarrow 0 \). By parallel reasoning, if we wanted to require a fixed power (constrained to \( \beta \), or constrained to \( \beta_n \rightarrow 0 \) ) and let size go to zero, the best exponential rate we could attain for the false alarm rate is \( D(P_1 \| P_0) \).

To sum up, the large deviation rate function gives the error rate for optimal hypothesis tests.

The Point Where I Gave Up Writing This Out

There is a natural duality between hypothesis testing and parameter estimation. If I have a consistent estimator \( \hat{\theta} \), I can use it to test whether \( \theta = \theta_0 \) by seeing whether \( \hat{\theta} \approx \theta_0 \). In particular, estimators cannot converge too fast, or they would give us tests which would break the bounds we've just seen. Lower bounds on hypothesis testing this breed lower bounds on estimation error. This actually gives us the Cramer-Rao bound, as well as some refinements. (Though there is a slicker way of deriving Cramer-Rao from divergence.)

The General Pattern

Statistical inference depends on some kind of empirical distribution --- the marginal distribution (as above), or over pairs (as in a Markov process), or path measures, or motifs in a graph. When the probability gods are kind, that empirical distribution obeys a large deviations principle, which gives matching upper and lower bounds for fluctuations away from the data-generating distribution (at least on an exponential scale). The rate in the LDP is (usually) a relative entropy / KL divergence. This is why information-theoretic quantities control statistical inference.

Further topics I'd cover here, if I had the energy:

Cashing the promissory note above about parameter estimation, including Fisher information, and the general tricks for getting at minimax rates via Fano's inequality
Sufficiency and preservation of divergence
Extension of these results from IID data to stochastic processes
Going from source coding results to limits on inference
Bahadur efficiency
Empirical likelihood

R. R. Bahadur, Some Limit Theorems in Statistics [1971. The notation is now much more transparent, and the proofs of many basic theorems considerably simplified. But if there's a better source for statistical applications than this little book, I've yet to find it.]
Cover and Thomas, Elements of Information Theory [Specifically, the chapters on large deviations and on statistics]
Solomon Kullback, Information Theory and Statistics

Andrew Barron and Nicolas Hengartner, "Information theory and superefficiency", Annals of Statistics 26 (1998): 1800--1825
M. S. Bartlett, "The Statistical Significance of Odd Bits of Information", Biometrika 39 (1952): 228--237 [A goodness-of-fit test based on fluctuations of the entropy. JSTOR]
I. Csiszár, "Maxent, Mathematics, and Information Theory", pp. 35--50 in Kenneth M. Hanson and Richard N. Silver (eds.), Maximum Entropy and Bayesian Methods: Proceedings of the Fifteenth International Workshop on Maximum Entropy and Bayesian Methods
James C. Fu, "Large Sample Point Estimation: A Large Deviation Theory Approach", Annals of Statistics 10 (1982): 762--771
Fuqing Gao and Xingqiu Zhao, "Delta method in large deviations and moderate deviations for estimators", Annals of Statistics 39 (2011): 1211-1240, arxiv:1105.3552 [This is based on an extension of the "contraction principle" which is of independent interest]
Hrayr Harutyunyan, Maxim Raginsky, Greg Ver Steeg, Aram Galstyan, "Information-theoretic generalization bounds for black-box learning algorithms", forthcoming in NeurIPS 2021, arxiv:2110.01584
Alexander Korostelev, "A minimaxity criterion in nonparametric regression based on large-deviations probabilities", Annals of Statistics 24 (1996): 1075--1083
Solomon Kullback and R. A. Leibler, "On Information and Sufficiency", Annals of Mathematical Statistics 22 (1951): 79--86
Neri Merhav, "Bounds on Achievable Convergence Rates of Parameter Estimators via Universal Coding", IEEE Transactions on Information Theory 40 (1994): 1210--1215 [PDF reprint via Prof. Merhav. Thanks to Max Raginsky for pointing out this interesting paper to me.]
Maxim Raginsky, "Divergence-based characterization of fundamental limitations of adaptive dynamical systems", arxiv:1010.2286 [Exposition]
Jonathan Scarlett and Volkan Cevher, "An Introductory Guide to Fano's Inequality with Applications in Statistical Estimation", arxiv:1901.00555
Suresh Venkatasubramanian, "A brief note on Fano's inequality", The Geomblog, 6 August 2014
Aolin Xu, Maxim Raginsky, "Information-theoretic analysis of generalization capability of learning algorithms", arxiv:1705.07809 [Related slides from Maxim]

Miguel A. Arcones, "Bahadur Efficiency of the Likelihood Ratio Test" [PDF preprint from 2005, presumably since published...]
Bucklew, Large Deviation Techniques in Decision, Simulation, and Estimation
Imre Csiszar and Paul Shields, Information Theory and Statistics: A Tutorial [Fulltext PDF. The only reason I don't list this as "recommended" is that I'm sticking to my rule of not doing so unless I've read all of it...]
Te Sun Han, "Hypothesis Testing with the General Source", IEEE Transactions on Information Theory 46 (2000): 2415--2427, math.PR/0004121 ["The asymptotically optimal hypothesis testing problem with the general sources as the null and alternative hypotheses is studied.... Our fundamental philosophy in doing so is first to convert all of the hypothesis testing problems completely to the pertinent computation problems in the large deviation-probability theory. ... [This] enables us to establish quite compact general formulas of the optimal exponents of the second kind of error and correct testing probabbilities for the general sources including all nonstationary and/or nonergodic sources with arbitrary abstract alphabet (countable or uncountable). Such general formulas are presented from the information-spectrum point of view."]
Zhishui Hu, John Robinson, Qiying Wang, "Cramér-type large deviations for samples from a finite population", Annals of Statistics 35 (2007): 673--696, arxiv:0708.1880
Dayu Huang, Sean Meyn, "Generalized Error Exponents For Small Sample Universal Hypothesis Testing", IEEE Transactions on Information Theory 59 (2013): 8157--8181, arxiv:1204.1563
K. Iriyama, "Error Exponents for Hypothesis Testing of the General Source", IEEE Transactions on Information Theory 51 (2005): 1517--1522
D. F. Kerridge, "Inaccuracy and Inference", Journal of the Royal Statistical Society B 23 (1961): 184--194
Yuichi Kitamura, "Empirical likelihood methods in econometrics: Theory and Practice", Cowles Foundation Discussion Paper No. 1569 (2006)
David J. C. MacKay, Information Theory, Inference and Learning Algorithms [Online version]
Liam Paninski, "Asymptotic Theory of Information-Theoretic Experimental Design", Neural Computation 17 (2005): 1480--1507
Vincent Y. F. Tan, Animashree Anandkumar, Lang Tong and Alan S. Willsky, "A Large-Deviation Analysis of the Maximum-Likelihood Learning of Markov Tree Structures", IEEE Transactions on Information Theory 57 (2011): 1714--1735, arxiv:0905.0940 [Large deviations for Chow-Liu trees]
José Trashorras, Olivier Wintenberger, "Large deviations for bootstrapped empirical measures", arxiv:1110.4620
Yuhong Yang and Andrew Barron, "Information-theoretic determination of minimax rates of convergence", Annals of Statistics 27 (1999): 1564--1599