Notebooks
http://bactra.org/notebooks
Cosma's NotebooksenInformation Theory and Large Deviations in the Foundations of Statistics
http://bactra.org/notebooks/2023/05/08#information-theory-and-statistics
<P>These are three subjects which matter to me, and I'm <em>especially</em>
interested in their interconnections. Of course, lots of people are convinced
that there are profound connections between information theory and statistics,
but I tend to think that the most popular flavor of this, the maximum entropy
principle, is <a href="max-ent.html">deeply misguided</a>. ("CRS disdains his
elders and betters, details at 11.") I will, in the usual way, use this
notebook to record useful references, things I want to read, etc., but I also
can't resist sketching some of what, in my supremely arrogant opinion, is the
the <em>right</em> way to think about how these subjects are linked.
Specifically: statistics relies on large deviations properties, which are
naturally expressed information-theoretically.
<h4>Estimating Distributions, Sanov's Theorem, and Relative Entropy; or, You Can Observe a Lot Just By Looking</h4>
<P>What <a href="http://bactra.org/reviews/pitman-basic-theory/">Pitman</a>
calls the "fundamental theorem of statistics" is the <strong>Glivenko-Cantelli
Theorem</strong>, the assertion that the empirical distribution function
converges on the true distribution function. Let's define some terms. \( X_1,
X_2, \ldots X_n, \ldots \) are a sequence of random variables, independent and
distributed according to some common probability measure \( P \) on the space
\( \mathcal{X} \). The empirical measure \( \hat{P}_n \) of a set \( A \) is
\[
\hat{P}_n(A) \equiv \frac{1}{n}\sum_{i=1}^{n}{\mathbf{1}_{A}(X_i)}
\]
or, using the Dirac delta function,
\[
\hat{P}_n \equiv \frac{1}{n}\sum_{i=1}^{n}{\delta_{X_i}}
\]
For one-dimensional variables, consider sets \( A = (-\infty,a] \). The
Glivenko-Cantelli theorem asserts that, as \( n\rightarrow \infty \),
\[
\sup_{a}{\left| \hat{P}_n(A) - P(A) \right|} \rightarrow 0
\]
\( P \)-almost surely. So as we get more and more data, the sample frequencies
match the probability of all such one-sided intervals arbitrarily closely.
From this, it follows that the sample frequency of any interval matches the
probability arbitrarily closely, and so the expectation value of any reasonable
test function matches arbitrarily closely. In a phrase, \( \hat{P}_n \)
converges in distribution on \( P \), almost surely. This result extends
straight-forwardly to any finite-dimensional space.
<P>How quickly does it converge? Here we need to appeal
to <a href="large-deviations.html">large deviations</a> theory, and
specifically to a result called <strong>Sanov's theorem</strong>. Let's start
with a <strike>lie told to children</strike> rough sketch. What is the
probability that \( \hat{P}_n \) is close to any particular probability measure
\( Q \)? Sanov's theorem says that, what \( n \) is large,
\[
\Pr{\left( \hat{P}_n \approx Q \right)} \approx \exp{\left\{-nD(Q \| P)\right\}}
\]
where \( D(Q \| P) \) is the <strong>relative entropy</strong>
or <strong>Kullback-Leibler divergence</strong>, and probability is taken under
the actual distribution of the \( X_i \), i.e., under \( P \). When the space
\( X \) takes values in is discrete, this is just the average log-likelihood
ratio,
\[
D(Q \| P) = \sum_{x}{Q(x) \log{\frac{Q(x)}{P(x)}}}
\]
with the understanding that \( 0\log{0} = 0 \). (Use L'Hopital's rule if you
don't believe that.)
<P>(Why is this the right rate function? Having told a lie to children, my
logical inhibitions are already relaxed, so I will be
utterly <strike>fallacious</strike> heuristic. What is the probability that \(
\hat{P} \approx Q \), for any given distribution \( Q \)? We'll make the
sample space discrete and finite, so every distribution is
really some multinomial. When \( X_1, \ldots X_n \) are of "type \( Q \)",
meaning that \( \hat{P}_n \approx Q \), the number of \( t \) with \( X_t = x \) must be \( \approx n Q(x) \) for each \( x \in \mathcal{X} \).
By
elementary combinatorics, the number of length-\( n \) samples of type \( Q \)
is \( \frac{n!}{\prod_{x \in \mathcal{X}}{(nQ(x))!}} \). (Notice that the generating
probability distribution \( P \) doesn't matter here. Also notice that I'm
blithely pretending \( nQ(x) \) is an integer; like I said, this is heuristic.) The probability
of the generating distribution \( P \) producing any one \( Q \) type sample
is \( \prod_{x \in \mathcal{X}}{P(x)^{nQ(x)}} \), by the IID assumption. So
\[
\begin{eqnarray}
\Pr(\hat{P}_n \approx Q) & = & n! \prod_{x \in \mathcal{X}}{\frac{P(x)^{nQ(x)}}{(nQ(x))!}}\\
\frac{1}{n}\log{\Pr(\hat{P}_n \approx Q)} & = & \frac{1}{n}\left(\log{(n!)} + \sum_{x\in\mathcal{X}}{n Q(x)\log{P(x)} - \log{(nQ(x))!}}\right)\\
& \approx & \log{n} + \sum_{x \in \mathcal{X}}{Q(x)\log{P(x)} - Q(x)\log{(nQ(x))}}\\
& = & \log{n} + \sum_{x \in \mathcal{X}}{Q(x)\log{(P(x))} - Q(x)\log{n} - Q(x)\log{Q(x)}}\\
& = & \sum_{x \in \mathcal{X}}{Q(x)\log{\frac{P(x)}{Q(x)}}}\\
& = & - D(Q\|P)
\end{eqnarray}
\]
using Stirling's approximation, \( \log{n!} \approx n\log{n} \), and the fact that
\( Q \) is a probability distribution, so \( \sum_{x \in \mathcal{X}}{Q(x)\log{n}} = \log{n} \).
Remarkably enough, this <strike>tissue
of fallacies</strike> heuristic sketch can be made completely rigorous, even
for continuous distributions [e.g., by a sequences of successively refined
discretizations of the continuous space].)
<P>Ordinarily, in information theory, if the data comes from \( P \) and we
consider using the wrong distribution \( Q \), the penalty we pay for coding is
\( D(P \| Q ) \), which is the other way around. But we can understand why the
measures should be flipped here. If \( Q(x) = 0, P(x) > 0 \) for some \( x \),
nothing special happens, that particular \( x \) just drops out of the sum.
But if \( Q(x) > 0, P(x) = 0 \) for some \( x \), then \( D(Q\|P) = \infty \).
Said in words, if \( Q \) merely gives zero probability to some values that \( P
\) allows, it becomes exponentially unlikely that a large sample looks like \(
Q \). But if \( Q \) gives positive probability to a value that \( P \)
forbids, there is absolutely no chance that any sample looks like \( Q \).
<P>When \(0 < D(Q \| P) < \infty\), Sanov's theorem tells us that the
probability of getting a sample that looks like \( Q \) when drawing data from
\( P \) is exponentially small. If \( D(Q \| P) = 0 \), then the probability
if getting samples like \( Q \) is decaying sub-exponentially if at all, but
one can show that \( D(Q \| P) = 0 \Leftrightarrow Q = P \), so that
probability is actually tending towards 1.
<P>I said that my statement of Sanov's theorem was rough. A more precise
statement is that, as \( n \rightarrow \infty \),
\[
-\frac{1}{n}\log{\Pr{\left(\hat{P}_n \in B\right)}} \rightarrow \inf_{Q \in
B}{D(Q \| P)}
\]
where now \( B \) is some set of probability measures. The relative entropy is
the <strong>rate function</strong>, giving the rate at which the probability of
a large deviation from the expected behavior goes to zero. One way to read the
\( \inf \) is is to say that the probability of the sample being in some "bad"
set \( B \) is dominated by the least-bad distribution in the set --- "if an
improbable event happens, it tends to happen in the least-improbable way".
<P>If \( X \) is continuous, then we could use the probability density to
define the divergence,
\[
D(Q\|P) = \int_{x}{q(x) \log{\frac{q(x)}{p(x)}} dx}
\]
<P>In yet more general cases, we fall back on measure theory, and use the Radon-Nikodym derivative,
\[
D(Q\|P) = \int{\log{\frac{dQ}{dP}(x)}dQ(x)}
\]
which reduces to the two earlier formulas in the special cases. We can still
read Sanov's theorem as above.
<P>(Strictly speaking, I should really distinguish between the probability of
\( \hat{P}_n \) of being in the closure of \( B \) and the probability of its
being in the interior, and between \( \limsup \) and \( \liminf \) of the
rates, but those details are not important for the present purposes.)
<P>Recall that we wanted to know how quickly the empirical distribution \(
\hat{P}_n \) converges to the true distribution \( P \). Fix some neighborhood
\( N_{\epsilon} \) of \( P \), with whatever way you like of gauging how far
apart two distributions. Then we have
\[
-\frac{1}{n}\log{\Pr{\left( \hat{P}_n \not\in N_{\epsilon} \right)}} \rightarrow \inf_{Q \not\in N_{\epsilon}}{D(Q \| P)}
\]
That infimum will be some increasing function of \( \epsilon \), say \(
r(\epsilon) \). [If we measure the distance between probability measures in
total variation, it's \( O(\epsilon^2) \).] So, roughly speaking, the
probability of being more than \( \epsilon \) away from the truth is about \(
\exp{\left\{ -nr(\epsilon) \right\}} \), neither more nor less. (It's
exponentially small, but <em>only</em> exponentially small.) More formally,
Sanov's theorem gives matching upper <em>and lower</em> bounds on the
asymptotic large deviations probability.
<h4>Hypothesis Testing, the Neyman-Pearson Lemma, and Relative Entropy</h4>
<P>Sanov's theorem thus tells us something about the efficacy of estimating a
distribution by just sitting and counting. It can also tell us about
hypothesis testing. We take the most basic possible hypothesis testing
problem, of distinguishing between a null distribution \( P_0 \) and an
alternative distribution \( P_1 \). We know, from
the <a href="http://bactra.org/weblog/630.html">Neyman-Pearson lemma</a>, that
the optimal test is a likelihood-ratio test: say "1" when
\[
\Lambda_n = \frac{1}{n}\sum_{i=1}^{n}{\log{\frac{p_1(x_i)}{p_0(x_i)}}}
\]
is above some threshold \( t_n \), and say "0" when it's below. The value of
this threshold is usually picked to enforce a desired false-alarm rate \(
\alpha \), i.e., <a href="http://bactra.org/weblog/630.html">it is the shadow price of power at a certain sample size</a>.
<P>There are two error probabilities here: the probability of saying "1" under
\( P_0 \), and the probability of saying "0" under \( P_1 \). The first, the
false-alarm rate, is controlled by fixing \( t \).
<P>Since the log likelihood ratio is a function of the sample, we can write it in terms of the empirical distribution:
\[
\Lambda_n = \mathbf{E}_{\hat{P}_n}\left[ \log{\frac{p_1(X)}{p_0(X)}} \right]
\]
<P>Saying that \( \Lambda_n < t_n \) is thus equivalent to saying that \(
\hat{P}_n \in B(t_n) \), for some set of distributions \( B(t_n) \). Enforcing
the constraint on the false alarm probability means that
\[
\Pr_{P_0}{\left( \hat{P}_n \in B(t_n) \right)} = 1-\alpha
\]
At the same time, again by Sanov's theorem, the probability, under \( P_0 \),
that \( -\Lambda_n \not \in (D(P_0\|P_1) - \epsilon, D(P_0\|P_1) + \epsilon) \)
is going to zero exponentially fast in \( n \). So it must be the case that \(
t_n \rightarrow -D(P_0\|P_1) \).
<P>Now what about those miss (type II error) probabilities? This is when the
test should say "1" but says instead "0".
\[
\beta_n = \Pr_{P_1}{\left( \hat{P}_n \in B(t_n) \right)}
\]
So, using Sanov again,
\[
-\frac{1}{n}\log{\beta_n} = \frac{1}{n}\log{\Pr_{P_1}{\left( \hat{P}_n \in B(t_n) \right)}} \rightarrow \inf_{Q \in B(t_n)}{D(Q \| P_1)}
\]
When is \( Q \in B(t_n) \)? To keep things simple, assume we can multiply and divide by \( q(x) \) everywhere we need to.
\[
\begin{eqnarray*}
t_n & \geq & \mathbf{E}_{Q}\left[ \log{\frac{p_1(x)}{p_0(x)}}\right]\\
t_n & \geq & \mathbf{E}_{Q}\left[ \log{\frac{p_1(x)}{q(x)}\frac{q(x)}{p_0(x)}}\right]\\
t_n & \geq & \mathbf{E}_{Q}\left[ -\log{\frac{p_1(x)}{q(x)}}+\log{\frac{q(x)}{p_0(x)}}\right]\\
t_n & \geq & D(Q\|P_0) - D(Q\|P_1)\\
D(Q\|P_1) & \geq & D(Q\|P_0) - t_n
\end{eqnarray*}
\]
In other words, we're looking for distributions which are sufficiently closer,
in divergence, to \( P_0 \) as opposed to \( P_1 \). Since \( t_n \rightarrow
-D(P_0\|P_1) \), as we saw above, in the long run we need \( D(Q\|P_1) \geq
D(Q\|P_0) + D(P_0\|P_1) \). To make this as small as possible, set \( Q = P_0
\). Thus
\[
-\frac{1}{n}\log{\beta_n} = D(P_0 \| P_1)
\]
<P>This argument can also be made to work if the size is allowed to go to zero
--- if we constrain the size to be \( \alpha_n \rightarrow 0 \). By parallel
reasoning, if we wanted to require a fixed power (constrained to \( \beta \),
or constrained to \( \beta_n \rightarrow 0 \) ) and let size go to zero, the
best exponential rate we could attain for the false alarm rate is \( D(P_1 \|
P_0) \).
<P>To sum up, the large deviation rate function gives the error rate for
optimal hypothesis tests.
<h4>The Point Where I Gave Up Writing This Out</h4>
<P>There is a natural duality between hypothesis testing and parameter
estimation. If I have a consistent estimator \( \hat{\theta} \), I can use it
to test whether \( \theta = \theta_0 \) by seeing whether \( \hat{\theta}
\approx \theta_0 \). In particular, estimators cannot converge too fast, or
they would give us tests which would break the bounds we've just seen. Lower
bounds on hypothesis testing this breed lower bounds on estimation error. This
actually gives us the Cramer-Rao bound, as well as some refinements. (Though
there is
a <a href="https://infostructuralist.wordpress.com/2011/07/27/divergence-in-everything-cramr-rao-from-data-processing/">slicker
way</a> of deriving Cramer-Rao from divergence.)
<h4>The General Pattern</h4>
Statistical inference depends on some kind of empirical distribution --- the
marginal distribution (as above), or over pairs (as in a Markov process), or
<a href="mixtures-of-processes.html">path measures</a>,
or <a href="graph-limits.html">motifs in a graph</a>. When the probability
gods are kind, that empirical distribution obeys a large deviations principle,
which gives matching upper and lower bounds for fluctuations away from the
data-generating distribution (at least on an exponential scale). The rate in
the LDP is (usually) a relative entropy / KL divergence. This is why
information-theoretic quantities control statistical inference.
<P>Further topics I'd cover here, if I had the energy:
<ul>
<li> Cashing the promissory note above about parameter estimation, including Fisher information, and the general tricks for getting at minimax rates via Fano's inequality
<li> Sufficiency and preservation of divergence
<li> Extension of these results from IID data to stochastic processes
<li> Going from source coding results to limits on inference
<li> Bahadur efficiency
<li> Empirical likelihood
</ul>
<P>See also:
<a href="entropy-estimation.html">Estimating Entropies and Informations</a>;
<a href="information-theory.html">Information Theory</a>;
<a href="large-deviations.html">Large Deviations Theory</a>;
<a href="statistics.html">Statistics</a>;
<ul>Recommended, big picture:
<li>R. R. Bahadur, <cite>Some Limit Theorems in Statistics</cite>
[1971. The notation is now much more transparent, and the proofs of many
basic theorems considerably simplified. But if there's a better source for
statistical applications than this little book, I've yet to find it.]
<li>Cover and Thomas, <cite>Elements of Information Theory</cite> [Specifically, the chapters on large deviations and on statistics]
<li>Solomon Kullback, <cite>Information Theory and Statistics</cite>
</ul>
<ul>Recommended, close-ups:
<li>Andrew Barron and Nicolas Hengartner, "Information theory and superefficiency", <a href="http://projecteuclid.org/euclid.aos/1024691358"><cite>Annals of Statistics</cite> <strong>26</strong> (1998):
1800--1825</a>
<li>M. S. Bartlett, "The Statistical Significance of Odd Bits of
Information", <a href="http://dx.doi.org/10.2307/2334019"><cite>Biometrika</cite> <strong>39</strong> (1952): 228--237</a> [A
goodness-of-fit test based on fluctuations of the
entropy. <a href="http://www.jstor.org/stable/2334019">JSTOR</a>]
<li>I. Csiszár, "Maxent, Mathematics, and Information Theory",
pp. 35--50 in Kenneth M. Hanson and Richard N. Silver (eds.), <cite>Maximum
Entropy and Bayesian Methods: Proceedings of the Fifteenth International
Workshop on Maximum Entropy and Bayesian Methods</cite>
<li>James C. Fu, "Large Sample Point Estimation: A Large Deviation Theory Approach", <a href="http://dx.doi.org/10.1214/aos/1176345869"><cite>Annals of Statistics</cite> <strong>10</strong> (1982): 762--771</a>
<li>Fuqing Gao and Xingqiu Zhao, "Delta method in large deviations and moderate deviations for estimators", <cite>Annals of Statistics</citE>
<strong>39</strong> (2011):
1211-1240, <a href="http://arxiv.org/abs/1105.3552">arxiv:1105.3552</a> [This
is based on an extension of the "contraction principle" which is of independent
interest]
<li>Hrayr Harutyunyan, Maxim Raginsky, Greg Ver Steeg, Aram Galstyan, "Information-theoretic generalization bounds for black-box learning algorithms", forthcoming in NeurIPS 2021, <a href="http://arxiv.org/abs/2110.01584">arxiv:2110.01584</a>
<li>Alexander Korostelev, "A minimaxity criterion in nonparametric regression based on large-deviations probabilities", <a href="http://dx.doi.org/10.1214/aos/1032526957"><cite>Annals of Statistics</cite> <strong>24</strong> (1996): 1075--1083</a>
<li>Solomon Kullback and R. A. Leibler, "On Information and
Sufficiency",
<a href="http://dx.doi.org/10.1214/aoms/1177729694"><cite>Annals of Mathematical Statistics</cite> <strong>22</strong> (1951): 79--86</a>
<li>Neri Merhav, "Bounds on Achievable Convergence Rates of Parameter Estimators via Universal Coding", <a href="http://dx.doi.org/"><cite>IEEE Transactions on Information Theory</cite> <strong>40</strong> (1994): 1210--1215</a>
[<a href="http://webee.technion.ac.il/people/merhav/papers/esterrfrmunivcod.pdf">PDF reprint</a> via Prof. Merhav. Thanks to Max Raginsky for pointing out this interesting paper to me.]
<li>Maxim Raginsky, "Divergence-based characterization of fundamental limitations of adaptive dynamical systems", <a href="http://arxiv.org/abs/1010.2286">arxiv:1010.2286</a> [<a href="https://infostructuralist.wordpress.com/2010/10/13/sincerely-your-biggest-fano/">Exposition</a>]
<li>Jonathan Scarlett and Volkan Cevher, "An Introductory Guide to Fano's Inequality with Applications in Statistical Estimation", <a href="http://arxiv.org/abs/1901.00555">arxiv:1901.00555</a>
<li>Suresh Venkatasubramanian, "A brief note on Fano's inequality",
<a href="http://blog.geomblog.org/2014/08/a-brief-note-on-fanos-inequality.html">The Geomblog, 6 August 2014</a>
<li>Aolin Xu, Maxim Raginsky, "Information-theoretic analysis of generalization capability of learning algorithms", <a href="http://arxiv.org/abs/1705.07809">arxiv:1705.07809</a> [<a href="http://maxim.ece.illinois.edu/pubs/Raginsky%20NASIT%202019.pdf">Related slides</a> from Maxim]
</ul>
<ul>To read:
<li>Miguel A. Arcones, "Bahadur Efficiency of the Likelihood Ratio
Test" [<a href="http://www.math.binghamton.edu/arcones/prep/pv.pdf">PDF preprint</a> from 2005, presumably since published...]
<li>Bucklew, <cite>Large Deviation Techniques in Decision, Simulation,
and Estimation</cite>
<li>Imre Csiszar and Paul Shields, <cite>Information Theory and
Statistics: A Tutorial</cite>
[<a
href="http://www.renyi.hu/%7Ecsiszar/Publications/Information_Theory_and_Statistics%3A_A_Tutorial.pdf">Fulltext
PDF</a>. The only reason I don't list this as "recommended" is that I'm
sticking to my rule of not doing so unless I've read <em>all</em> of it...]
<li>Te Sun Han, "Hypothesis Testing with the General Source",
<a href="http://dx.doi.org/10.1109/18.887854"><cite>IEEE Transactions on
Information Theory</cite> <strong>46</strong> (2000): 2415--2427</a>,
<a href="http://arxiv.org/abs/math.PR/0004121">math.PR/0004121</a> ["The
asymptotically optimal hypothesis testing problem with the general sources as
the null and alternative hypotheses is studied.... Our fundamental philosophy
in doing so is first to convert all of the hypothesis testing problems
completely to the pertinent computation problems in the large
deviation-probability theory. ... [This] enables us to establish quite compact
general formulas of the optimal exponents of the second kind of error and
correct testing probabbilities for the general sources including all
nonstationary and/or nonergodic sources with arbitrary abstract alphabet
(countable or uncountable). Such general formulas are presented from the
information-spectrum point of view."]
<li>Zhishui Hu, John Robinson, Qiying Wang, "Cramér-type large deviations for samples from a finite population", <cite>Annals of Statistics</citE> <strong>35</strong> (2007): 673--696, <a href="http://arxiv.org/abs/0708.1880">arxiv:0708.1880</a>
<li>Dayu Huang, Sean Meyn, "Generalized Error Exponents For Small Sample Universal Hypothesis Testing",
<a href="http://dx.doi.org/10.1109/TIT.2013.2283266"><cite>IEEE Transactions on Information Theory</cite> <strong>59</strong> (2013): 8157--8181</a>,
<a href="http://arxiv.org/abs/1204.1563">arxiv:1204.1563</a>
<li>K. Iriyama, "Error Exponents for Hypothesis Testing of the General
Source", <a href="http://dx.doi.org/10.1109/TIT.2004.842774"><cite>IEEE
Transactions on Information Theory</cite> <strong>51</strong> (2005):
1517--1522</a>
<li>D. F. Kerridge, "Inaccuracy and Inference", <cite>Journal of
the Royal Statistical Society B</cite> <strong>23</strong> (1961): 184--194
<li>Yuichi Kitamura, "Empirical likelihood methods in econometrics: Theory and Practice", <a href="http://cowles.econ.yale.edu/P/cd/d15b/d1569.pdf">Cowles Foundation Discussion Paper No. 1569</a> (2006)
<li>David J. C. MacKay, <cite>Information Theory, Inference and
Learning Algorithms</cite> [<a
href="http://www.inference.phy.cam.ac.uk/mackay/itprnn/book.html">Online
version</a>]
<li>Liam Paninski, "Asymptotic Theory of Information-Theoretic
Experimental
Design", <a
href="http://neco.mitpress.org/cgi/content/abstract/17/7/1480"><cite>Neural
Computation</cite> <strong>17</strong> (2005): 1480--1507</a>
<li>Vincent Y. F. Tan, Animashree Anandkumar, Lang Tong and Alan
S. Willsky, "A Large-Deviation Analysis of the Maximum-Likelihood Learning of
Markov Tree
Structures", <a href="http://dx.doi.org/10.1109/TIT.2011.2104513"><cite>IEEE Transactions on Information Theory</cite> <strong>57</strong> (2011): 1714--1735</a>, <a href="http://arxiv.org/abs/0905.0940">arxiv:0905.0940</a>
[Large deviations for Chow-Liu trees]
<li>José Trashorras, Olivier Wintenberger, "Large deviations for bootstrapped empirical measures", <a href="http://arxiv.org/abs/1110.4620">arxiv:1110.4620</a>
<li>Yuhong Yang and Andrew Barron,
"Information-theoretic determination of minimax rates of convergence",
<a href="http://dx.doi.org/10.1214/aos/1017939142"><cite>Annals of Statistics</cite> <strong>27</strong> (1999): 1564--1599</a>
</ul>