## Information Theory and Large Deviations in the Foundations of Statistics

*09 Nov 2019 16:48*

These are three subjects which are deeply important to me, and I'm deeply
interested in their interconnections. Of course, lots of people are convinced
that there are deep connections between information theory and statistics, but
I tend to think that the most popular flavor of this, the maximum entropy
principle, is deeply misguided. ("CRS disdains his
elders and betters, details at 11.") I will, in the usual way, use this
notebook to record useful references, things I want to read, etc., but I also
can't resist sketching some of what I think the *right* way to
think about how these subjects are linked. Specifically: statistics
relies on large deviations properties, which are naturally expressed
information-theoretically.

#### Estimating Distributions, Sanov's Theorem, and Relative Entropy; or, You Can Observe a Lot Just By Looking

What Pitman
calls the "fundamental theorem of statistics" is the **Glivenko-Cantelli
Theorem**, the assertion that the empirical distribution function
converges on the true distribution function. Let's define some terms. \( X_1,
X_2, \ldots X_n, \ldots \) are a sequence of random variables, independent and
distributed according to some common probability measure \( P \) on the space
\( \mathcal{X} \). The empirical measure \( \hat{P}_n \) of a set \( A \) is
\[
\hat{P}_n(A) \equiv \frac{1}{n}\sum_{i=1}^{n}{\mathbf{1}_{A}(X_i)}
\]
or, using the Dirac delta function,
\[
\hat{P}_n \equiv \frac{1}{n}\sum_{i=1}^{n}{\delta_{X_i}}
\]
For one-dimensional variables, consider sets \( A = (-\infty,a] \). The
Glivenko-Cantelli theorem asserts that, as \( n\rightarrow \infty \),
\[
\sup_{a}{\left| \hat{P}_n(A) - P(A) \right|} \rightarrow 0
\]
\( P \)-almost surely. So as we get more and more data, the sample frequencies
match the probability of all such one-sided intervals arbitrarily closely.
From this, it follows that the sample frequency of any interval matches the
probability arbitrarily closely, and so the expectation value of any reasonable
test function matches arbitrarily closely. In a phrase, \( \hat{P}_n \)
converges in distribution on \( P \), almost surely. This result extends
straight-forwardly to any finite-dimensional space.

How quickly does it converge? Here we need to appeal
to large deviations theory, and
specifically to a result called **Sanov's theorem**. Let's start
with a ~~lie told to children~~ rough sketch. What is the
probability that \( \hat{P}_n \) is close to any particular probability measure
\( Q \)? Sanov's theorem says that, what \( n \) is large,
\[
\Pr{\left( \hat{P}_n \approx Q \right)} \approx \exp{\left\{-nD(Q \| P)\right\}}
\]
where \( D(Q \| P) \) is the **relative entropy**
or **Kullback-Leibler divergence**, and probability is taken under
the actual distribution of the \( X_i \), i.e., under \( P \). When the space
\( X \) takes values in is discrete, this is just the average log-likelihood
ratio,
\[
D(Q \| P) = \sum_{x}{Q(x) \log{\frac{Q(x)}{P(x)}}}
\]
with the understanding that \( 0\log{0} = 0 \). (Use L'Hopital's rule if you
don't believe that.)

(Why is this the right rate function? Having told a lie to children, my
logical inhibitions are already relaxed, so I will be
utterly ~~fallacious~~ heuristic. What is the probability that \(
\hat{P} \approx Q \), for any given distribution \( Q \)? We'll make the
sample space discrete and finite, so every distribution is
really some multinomial. When \( X_1, \ldots X_n \) are of "type \( Q \)",
meaning that \( \hat{P}_n \approx Q \), the number of \( t \) with \( X_t = x \) must be \( \approx n Q(x) \) for each \( x \in \mathcal{X} \).
By
elementary combinatorics, the number of length-\( n \) samples of type \( Q \)
is \( \frac{n!}{\prod_{x \in \mathcal{X}}{(nQ(x))!}} \). (Notice that the generating
probability distribution \( P \) doesn't matter here. Also notice that I'm
blithely pretending \( nQ(x) \) is an integer; like I said, this is "heuristic".) The probability
of the generating distribution \( P \) producing any one \( Q \) type sample
is \( \prod_{x \in \mathcal{X}}{P(x)^{nQ(x)}} \), by the IID assumption. So
\[
\begin{eqnarray}
\Pr(\hat{P}_n \approx Q) & = & n! \prod_{x \in \mathcal{X}}{\frac{P(x)^{nQ(x)}}{(nQ(x))!}}\\
\frac{1}{n}\log{\Pr(\hat{P}_n \approx Q)} & = & \frac{1}{n}\left(\log{(n!)} + \sum_{x\in\mathcal{X}}{n Q(x)\log{P(x)} - \log{(nQ(x))!}}\right)\\
& \approx & \log{n} + \sum_{x \in \mathcal{X}}{Q(x)\log{P(x)} - Q(x)\log{(nQ(x))}}\\
& = & \log{n} + \sum_{x \in \mathcal{X}}{Q(x)\log{(P(x))} - Q(x)\log{n} - Q(x)\log{Q(x)}}\\
& = & \sum_{x \in \mathcal{X}}{Q(x)\log{\frac{P(x)}{Q(x)}}}\\
& = & - D(Q\|P)
\end{eqnarray}
\]
using Stirling's approximation, \( \log{n!} \approx n\log{n} \), and the fact that
\( Q \) is a probability distribution, so \( \sum_{x \in \mathcal{X}}{Q(x)\log{n}} = \log{n} \).
Remarkably enough, this ~~tissue
of fallacies~~ heuristic sketch can be made completely rigorous, even
for continuous distributions [e.g., by a sequences of successively refined
discretizations of the continuous space.)

Ordinarily, in information theory, if the data comes from \( P \) and we consider using the wrong distribution \( Q \), the penalty we pay for coding is \( D(P \| Q ) \), which is the other way around. But we can understand why the measures should be flipped here. If \( Q(x) = 0, P(x) > 0 \) for some \( x \), nothing special happens, that particular \( x \) just drops out of the sum. But if \( Q(x) > 0, P(x) = 0 \) for some \( x \), then \( D(Q\|P) = \infty \). Said in words, if \( Q \) just gives zero probability to some values that \( P \) allows, it becomes exponentially unlikely that a large sample looks like \( Q \). But if \( Q \) gives positive probability to a value that \( P \) forbids, there is absolutely no chance that any sample looks like \( Q \).

When \(0 < D(Q \| P) < \infty\), Sanov's theorem tells us that the probability of getting a sample that looks like \( Q \) when drawing data from \( P \) is exponentially small. If \( D(Q \| P) = 0 \), then the probability if getting samples like \( Q \) is decaying sub-exponentially if at all, but one can show that \( D(Q \| P) = 0 \Leftrightarrow Q = P \), so that probability is actually tending towards 1.

I said that my statement of Sanov's theorem was rough. A more precise
statement is that, as \( n \rightarrow \infty \),
\[
-\frac{1}{n}\log{\Pr{\left(\hat{P}_n \in B\right)}} \rightarrow \inf_{Q \in
B}{D(Q \| P)}
\]
where now \( B \) is some set of probability measures. The relative entropy is
the **rate function**, giving the rate at which the probability of
a large deviation from the expected behavior goes to zero. One way to read the
\( \inf \) is is to say that the probability of the sample being in some "bad"
set \( B \) is dominated by the least-bad distribution in the set --- "if an
improbable event happens, it tends to happen in the least-improbable way".

If \( X \) is continuous, then we could use the probability density to define the divergence, \[ D(Q\|P) = \int_{x}{q(x) \log{\frac{q(x)}{p(x)}} dx} \]

In yet more general cases, we fall back on measure theory, and use the Radon-Nikodym derivative, \[ D(Q\|P) = \int{\log{\frac{dQ}{dP}(x)}dQ(x)} \] which reduces to the two earlier formulas in the special cases. We can still read Sanov's theorem as above.

(Strictly speaking, I should really distinguish between the probability of \( \hat{P}_n \) of being in the closure of \( B \) and the probability of its being in the interior, and between \( \limsup \) and \( \liminf \) of the rates, but those details are not important for the present purposes.)

Recall that we wanted to know how quickly the empirical distribution \(
\hat{P}_n \) converges to the true distribution \( P \). Fix some neighborhood
\( N_{\epsilon} \) of \( P \), with whatever way you like of gauging how far
apart two distributions. Then we have
\[
-\frac{1}{n}\log{\Pr{\left( \hat{P}_n \not\in N_{\epsilon} \right)}} \rightarrow \inf_{Q \not\in N_{\epsilon}}{D(Q \| P)}
\]
That infimum will be some increasing function of \( \epsilon \), say \(
r(\epsilon) \). [If we measure the distance between probability measures in
total variation, it's \( O(\epsilon^2) \).] So, roughly speaking, the
probability of being more than \( \epsilon \) away from the truth is about \(
\exp{\left\{ -nr(\epsilon) \right\}} \), neither more nor less. (It's
exponentially small, but *only* exponentially small.) More formally,
Sanov's theorem gives matching upper *and lower* bounds on the
asymptotic large deviations probability.

#### Hypothesis Testing, the Neyman-Pearson Lemma, and Relative Entropy

Sanov's theorem thus tells us something about the efficacy of estimating a distribution by just sitting and counting. It can also tell us about hypothesis testing. We take the most basic possible hypothesis testing problem, of distinguishing between a null distribution \( P_0 \) and an alternative distribution \( P_1 \). We know, from the Neyman-Pearson lemma, that the optimal test is a likelihood-ratio test: say "1" when \[ \Lambda_n = \frac{1}{n}\sum_{i=1}^{n}{\log{\frac{p_1(x_i)}{p_0(x_i)}}} \] is above some threshold \( t_n \), and say "0" when it's below. The value of this threshold is usually picked to enforce a desired false-alarm rate \( \alpha \), i.e., it is the shadow price of power at a certain sample size.

There are two error probabilities here: the probability of saying "1" under \( P_0 \), and the probability of saying "0" under \( P_1 \). The first, the false-alarm rate, is controlled by fixing \( t \).

Since the log likelihood ratio is a function of the sample, we can write it in terms of the empirical distribution: \[ \Lambda_n = \mathbf{E}_{\hat{P}_n}\left[ \log{\frac{p_1(X)}{p_0(X)}} \right] \]

Saying that \( \Lambda_n < t_n \) is thus equivalent to saying that \( \hat{P}_n \in B(t_n) \), for some set of distributions \( B(t_n) \). Enforcing the constraint on the false alarm probability means that \[ \Pr_{P_0}{\left( \hat{P}_n \in B(t_n) \right)} = 1-\alpha \] At the same time, again by Sanov's theorem, the probability, under \( P_0 \), that \( -\Lambda_n \not \in (D(P_0\|P_1) - \epsilon, D(P_0\|P_1) + \epsilon) \) is going to zero exponentially fast in \( n \). So it must be the case that \( t_n \rightarrow -D(P_0\|P_1) \).

Now what about those miss (type II error) probabilities? This is when the test should say "1" but says instead "0". \[ \beta_n = \Pr_{P_1}{\left( \hat{P}_n \in B(t_n) \right)} \] So, using Sanov again, \[ -\frac{1}{n}\log{\beta_n} = \frac{1}{n}\log{\Pr_{P_1}{\left( \hat{P}_n \in B(t_n) \right)}} \rightarrow \inf_{Q \in B(t_n)}{D(Q \| P_1)} \] When is \( Q \in B(t_n) \)? To keep things simple, assume we can multiply and divide by \( q(x) \) everywhere we need to. \[ \begin{eqnarray*} t_n & \geq & \mathbf{E}_{Q}\left[ \log{\frac{p_1(x)}{p_0(x)}}\right]\\ t_n & \geq & \mathbf{E}_{Q}\left[ \log{\frac{p_1(x)}{q(x)}\frac{q(x)}{p_0(x)}}\right]\\ t_n & \geq & \mathbf{E}_{Q}\left[ -\log{\frac{p_1(x)}{q(x)}}+\log{\frac{q(x)}{p_0(x)}}\right]\\ t_n & \geq & D(Q\|P_0) - D(Q\|P_1)\\ D(Q\|P_1) & \geq & D(Q\|P_0) - t_n \end{eqnarray*} \] In other words, we're looking for distributions which are sufficiently closer, in divergence, to \( P_0 \) as opposed to \( P_1 \). Since \( t_n \rightarrow -D(P_0\|P_1) \), as we saw above, in the long run we need \( D(Q\|P_1) \geq D(Q\|P_0) + D(P_0\|P_1) \). To make this as small as possible, set \( Q = P_0 \). Thus \[ -\frac{1}{n}\log{\beta_n} = D(P_0 \| P_1) \]

This argument can also be made to work if the size is allowed to go to zero --- if we constrain the size to be \( \alpha_n \rightarrow 0 \). By parallel reasoning, if we wanted to require a fixed power (constrained to \( \beta \), or constrained to \( \beta_n \rightarrow 0 \) ) and let size go to zero, the best exponential rate we could attain for the false alarm rate is \( D(P_1 \| P_0) \).

To sum up, the large deviation rate function gives the error rate for optimal hypothesis tests.

#### The Point Where I Gave Up Writing This Out

There is a natural duality between hypothesis testing and parameter estimation. If I have a consistent estimator \( \hat{\theta} \), I can use it to test whether \( \theta = \theta_0 \) by seeing whether \( \hat{\theta} \approx \theta_0 \). In particular, estimators cannot converge too fast, or they would give us tests which would break the bounds we've just seen. Lower bounds on hypothesis testing this breed lower bounds on estimation error. This actually gives us the Cramer-Rao bound, as well as some refinements. (Though there is a slicker way of deriving Cramer-Rao from divergence.)

#### The General Pattern

Statistical inference depends on some kind of empirical distribution --- the marginal distribution (as above), or over pairs (as in a Markov process), or path measures, or motifs in a graph. When the probability gods are kind, that empirical distribution obeys a large deviations principle, which gives matching upper and lower bounds for fluctuations away from the data-generating distribution (at least on an exponential scale). The rate in the LDP is (usually) a relative entropy / KL divergence. This is why information-theoretic quantities control statistical inference.Further topics I'd cover here, if I had the energy:

- Cashing the promissory note above about parameter estimation, including Fisher information, and the general tricks for getting at minimax rates via Fano's inequality
- Sufficiency and preservation of divergence
- Extension of these results from IID data to stochastic processes
- Going from source coding results to limits on inference
- Bahadur efficiency
- Empirical likelihood

See also: Estimating Entropies and Informations; Information Theory; Large Deviations Theory; Statistics;

- Recommended, big picture:
- R. R. Bahadur, Some Limit Theorems in Statistics [1971. The notation is now much more transparent, and the proofs of many basic theorems considerably simplified. But if there's a better source for statistical applications than this little book, I've yet to find it.]
- Cover and Thomas, Elements of Information Theory [Specifically, the chapters on large deviations and on statistics]
- Solomon Kullback, Information Theory and Statistics

- Recommended, close-ups:
- Andrew Barron and Nicolas Hengartner, "Information theory and superefficiency", Annals of Statistics
**26**(1998): 1800--1825 - M. S. Bartlett, "The Statistical Significance of Odd Bits of
Information", Biometrika
**39**(1952): 228--237 [A goodness-of-fit test based on fluctuations of the entropy. JSTOR] - I. Csiszár, "Maxent, Mathematics, and Information Theory", pp. 35--50 in Kenneth M. Hanson and Richard N. Silver (eds.), Maximum Entropy and Bayesian Methods: Proceedings of the Fifteenth International Workshop on Maximum Entropy and Bayesian Methods
- James C. Fu, "Large Sample Point Estimation: A Large Deviation Theory Approach", Annals of Statistics
**10**(1982): 762--771 - Fuqing Gao and Xingqiu Zhao, "Delta method in large deviations and moderate deviations for estimators", Annals of Statistics
**39**(2011): 1211-1240, arxiv:1105.3552 [This is based on an extension of the "contraction principle" which is of independent interest] - Alexander Korostelev, "A minimaxity criterion in nonparametric regression based on large-deviations probabilities", Annals of Statistics
**24**(1996): 1075--1083 - Solomon Kullback and R. A. Leibler, "On Information and
Sufficiency",
Annals of Mathematical Statistics
**22**(1951): 79--86 - Neri Merhav, "Bounds on Achievable Convergence Rates of Parameter Estimators via Universal Coding", IEEE Transactions on Information Theory
**40**(1994): 1210--1215 [PDF reprint via Prof. Merhav. Thanks to Max Raginsky for pointing out this interesting paper to me.] - Maxim Raginsky, "Divergence-based characterization of fundamental limitations of adaptive dynamical systems", arxiv:1010.2286 [Exposition]
- Jonathan Scarlett and Volkan Cevher, "An Introductory Guide to Fano's Inequality with Applications in Statistical Estimation", arxiv:1901.00555
- Suresh Venkatasubramanian, "A brief note on Fano's inequality", The Geomblog, 6 August 2014
- Aolin Xu, Maxim Raginsky, "Information-theoretic analysis of generalization capability of learning algorithms", arxiv:1705.07809 [Related slides from Maxim]

- To read:
- Miguel A. Arcones, "Bahadur Efficiency of the Likelihood Ratio Test" [PDF preprint from 2005, presumably since published...]
- Bucklew, Large Deviation Techniques in Decision, Simulation, and Estimation
- Imre Csiszar and Paul Shields, Information Theory and
Statistics: A Tutorial
[Fulltext
PDF. The only reason I don't list this as "recommended" is that I'm
sticking to my rule of not doing so unless I've read
*all*of it...] - Te Sun Han, "Hypothesis Testing with the General Source",
IEEE Transactions on
Information Theory
**46**(2000): 2415--2427, math.PR/0004121 ["The asymptotically optimal hypothesis testing problem with the general sources as the null and alternative hypotheses is studied.... Our fundamental philosophy in doing so is first to convert all of the hypothesis testing problems completely to the pertinent computation problems in the large deviation-probability theory. ... [This] enables us to establish quite compact general formulas of the optimal exponents of the second kind of error and correct testing probabbilities for the general sources including all nonstationary and/or nonergodic sources with arbitrary abstract alphabet (countable or uncountable). Such general formulas are presented from the information-spectrum point of view."] - Zhishui Hu, John Robinson, Qiying Wang, "Cramér-type large deviations for samples from a finite population", Annals of Statistics
**35**(2007): 673--696, arxiv:0708.1880 - Dayu Huang, Sean Meyn, "Generalized Error Exponents For Small Sample Universal Hypothesis Testing",
IEEE Transactions on Information Theory
**59**(2013): 8157--8181, arxiv:1204.1563 - K. Iriyama, "Error Exponents for Hypothesis Testing of the General
Source", IEEE
Transactions on Information Theory
**51**(2005): 1517--1522 - D. F. Kerridge, "Inaccuracy and Inference", Journal of
the Royal Statistical Society B
**23**(1961): 184--194 - Yuichi Kitamura, "Empirical likelihood methods in econometrics: Theory and Practice", Cowles Foundation Discussion Paper No. 1569 (2006)
- David J. C. MacKay, Information Theory, Inference and Learning Algorithms [Online version]
- Liam Paninski, "Asymptotic Theory of Information-Theoretic
Experimental
Design", Neural
Computation
**17**(2005): 1480--1507 - Vincent Y. F. Tan, Animashree Anandkumar, Lang Tong and Alan
S. Willsky, "A Large-Deviation Analysis of the Maximum-Likelihood Learning of
Markov Tree
Structures", IEEE Transactions on Information Theory
**57**(2011): 1714--1735, arxiv:0905.0940 [Large deviations for Chow-Liu trees] - José Trashorras, Olivier Wintenberger, "Large deviations for bootstrapped empirical measures", arxiv:1110.4620
- Yuhong Yang and Andrew Barron,
"Information-theoretic determination of minimax rates of convergence",
Annals of Statistics
**27**(1999): 1564--1599