## Estimating Entropies and Informations

*01 Jun 2018 10:51*

The central mathematical objects in information theory are the entropies of random variables. These ("Shannon") entropies are properties of the probability distributions of the variables, rather than of particular realizations. (This is unlike the Boltzmann entropy of statistical mechanics, which is an objective property of the macroscopic state, at least once we fix our partition of microscopic states into macroscopic states. Confusing the two entropies is common but bad.) The question which concerns me here is how to estimate the entropy of the distribution, given a sample of realizations. The most obvious approach, if one knows the form of the distribution but not the parameters, is to estimate the parameters and then plug in. But it feels like one should be able to do this more non-parametrically. The obvious non-parametric estimator is of course just the entropy of the empirical distribution, sometimes called the "empirical entropy". However, the empirical distribution isn't always the best estimate of the true distribution (one might perfer, e.g., some kind of kernel density estimate). For that matter, we often don't really care about the distribution, just its entropy, so some more direct estimator would be nice.

What would be really nice would be to not just have point estimates but also confidence intervals. Non-parametrically, my guess is that the only feasible way to do this is bootstrapping.

For finite alphabets, one approach would be to use something like variable length Markov chains, or causal state reconstruction, to reconstruct a get machine capable of generating the sequence. From the machine, it is easy to calculate the entropy of words or blocks of any finite length, and even the entropy rate. My experience with using CSSR is that the entropy rate estimates can get very good even when the over-all reconstruction of the structure is very poor, but I don't have any real theory on that. I suspect CSSR converges on the true entropy rate faster than do variable length Markov chains, because the former has greater expressive power, but again I don't know that for sure.

Using gzip is a bad idea (for this purpose; it works fine for data compression).

(Thanks to Sankaran Ramakrishnan for pointing out a think-o in an earlier version.)

See also:
Bootstrapping Entropy Estimates;
Complexity, Entropy and the Physics of `gzip`

- Recommended (somewhat miscellaneous):
- D. J. Albers, George Hripcsak, "Estimation of time-delayed mutual information and bias for irregularly and sparsely sampled time-series", arxiv:1110.1615
- Jose M. Amigo, Janusz Szczepanski, Elek Wajnryb and Maria
V. Sanchez-Vives, "Estimating the Entropy Rate of Spike Trains via Lempel-Ziv
Complexity",
Neural
Computation
**16**(2004): 717--736 [Normally, I have strong views on using Lempel-Ziv to measure entropy rates, but here they are using the 1976 Lempel-Ziv definitions, not the 1978 ones. The difference is subtle, but important; 1978 leads to gzip and practical compression algorithms, but very bad entropy estimates; 1976 leads, as they show numerically, to reasonable entropy rate estimates, at least for some processes. Thanks to Dr. Szczepanski for correspondence about this paper.] - J.-R. Chazottes and D. Gabrielli, "Large deviations for empirical
entropies of Gibbsian sources", Nonlinearity
**18**(2005): 2545--2563 , math.PR/0406083 [This is a very cool result which shows that block entropies, and entropy rates estimated from those blocks, obey the large deviation principle even as one lets the length of the blocks grow with the amount of data, provided the block-length doesn't grow too quickly (only logarithmically). I wish I could write papers like this.] - Simon DeDeo, Robert X. D. Hawkins, Sara Klingenstein, Tim Hitchcock. "Bootstrap Methods for the Empirical Study of Decision-Making and Information Flows in Social Systems", arxiv:1302.0907
- John W. Fisher III, Alexander T. Ihler and Paula A. Viola, "Learning Informative Statistics: A Nonparametric Approach", pp. 900--906 in NIPS 12 (1999) [PDF reprint. I'd call this more of a semi-parametric approach than a fully non-parametric one; they assume a parametric form for the dependence structure, but are agnostic about the distributions of innovations, and so try to maximize non-parametrically estimated mutual informations.]
- Yongmiao Hong and Halbert White, "Asymptotic Distribution Theory
for Nonparametric Entropy Measures of Serial Dependence",
Econometrica
**73**(2005): 837--901 [JSTOR; PDF reprint via Prof. White] - Kirthevasan Kandasamy, Akshay Krishnamurthy, Barnabas Poczos, Larry Wasserman, James M. Robins, "Nonparametric von Mises Estimators for Entropies, Divergences and Mutual Informations", NIPS 2015, arxiv:1411.4342
- Matthew B. Kennel, Jonathon Shlens, Henry D. I. Abarbanel and
E. J. Chichilnisky, "Estimating Entropy Rates with Bayesian Confidence
Intervals", Neural
Computation
**17**(2005): 1531--1576 - Alexander Kraskov, Harald Stögbauer and Peter Grassberger,
"Estimating Mutual Information", Physical Review
E
**69**(2004): 066138, cond-mat/0305641 - Dávid Pál, Barnabás Póczos, Csaba Szepesvári, "Estimation of Rényi Entropy and Mutual Information Based on Generalized Nearest-Neighbor Graphs", arxiv:1003.1954
- Liam Paninski, "Estimation of Entropy and Mutual Information",
Neural Computation
**15**(2003): 1191--1253 [Preprint; code] - Barnabas Poczos, Jeff Schneider, "On the Estimation of alpha-Divergences", AIStats 2011
- Thomas Schuermann and Peter Grassberger, "Entropy estimation of
symbol sequences," Chaos
**6**(1996): 414--427, cond-mat/0203436 - Jonathon Shlens, Matthew B. Kennel, Henry D. I. Abarbanel, E. J. Chichilnisky, "Estimating Information Rates with Confidence Intervals in Neural Spike Trains", Neural Computation
**19**(2007): 1683--1719 - Jonathan D. Victor, "Asymptotic Bias in Information Estimates and
the Exponential (Bell) Polynomials", Neural
Computation
**12**(2000): 2797--2804 [Calculates the bias in the empirical entropy, as an estimator of the true entropy, under IID sampling of a discrete space. Interestingly, the first-order (1/n) term in the bias does not depend on the actual distribution, though the higher-order terms do.] - Vincent Q. Vu, Bin Yu, Robert E. Kass, "Information In The Non-Stationary Case", arxiv:0806.3978
- Benjamin Weiss, Single Orbit Dynamics [Discusses procedures for non-parametrically estimating entropy of suitably ergodic sources, using just one realization of the process.]

- To read:
- Evan Archer, Il Memming Park, Jonathan Pillow, "Bayesian Entropy Estimation for Countable Discrete Distributions", arxiv:1302.0328
- M. S. Baptista, E. J. Ngamga, Paulo R. F. Pinto, Margarida Brito, J. Kurths, "Kolmogorov-Sinai entropy from recurrence times", arxiv:0908.3401
- Andre C. Barato, David Hartich and Udo Seifert, "Rate of Mutual Information Between Coarse-Grained Non-Markovian Variables", Journal of Statistical Physics
**153**(2013): 460--478 - D. Benedetto, E. Caglioti, G. Cristadoro, M. Degli Esposti, "Relative entropy via non-sequential recursive pair substitutions", arxiv:1007.3384 [The first two authors were also the lead authors of the epic-fail "Language Trees and Zipping" paper, arxiv:cond-mat/0108530, but perhaps they've improved.]
- Visar Berisha, Alfred O. Hero, "Empirical non-parametric estimation of the Fisher Information",
IEEE Signal Processing
Letters
**22**(2015): 988--992, arxiv:1408.1182 - Juan A. Bonachela, Haye Hinrichsen, Miguel A. Munoz, "Entropy estimates of small data sets", arxiv:0804.4561
- Salim Bouzebda and Issam Elhattab, "Uniform-in-bandwidth consistency for kernel-type estimators of Shannon's entropy", Electronic Journal of Statistics
**5**(2011): 440--459 - Clive G. Bowsher, Margaritis Voliotis, "Mutual Information and Conditional Mean Prediction Error", arxiv:1407.7165
- H. Cai, S. R. Kulkarni and S. Verdu, "Universal Entropy Estimation
via Block Sorting", IEEE Transactions on Information Theory
**50**(2004): 1551--1561 - C. J. Cellucci, A. M. Albano and P. E. Rapp, "Statistical
validation of mutual information calculations: Comparison of alternative
numerical algorithms", Physical Review
E
**71**(2005): 066208 - Ishanu Chattopadhyay, Hod Lipson, "Computing Entropy Rate Of Symbol Sources & A Distribution-free Limit Theorem", arxiv:1401.0711
- J.-R. Chazottes and E. Uglade, "Entropy estimation and fluctuations of Hitting and Recurrence Times for Gibbsian sources", math.DS/0401093
- Gabriela Ciuperca and Valerie Girardin, "Estimation of the
Entropy Rate of a Countable Markov Chain", Communications in Statistics: Theory and Methods
**36**(2007): 2543--2557 - G. Ciuperca, V. Girardin and L. Lhote, "Computation and Estimation of Generalized Entropy Rates for Denumerable Markov Chains", IEEE Transactions on Information Theory
**57**(2011): 4026--4034 [The estimation is just plugging in the MLE of the parameters, for finitely-parametrized chains, but they claim to show that works well] - J.-R. Chazottes, C. Maldonado, "Concentration bounds for entropy estimation of one-dimensional Gibbs measures", arxiv:1102.1816
- Tommy W. S. Chow and D. Huang, "Estimating Optimal Feature Subsets
Using Efficient Estimation of High-Dimensional Mutual Information", IEEE
Transactions on Neural Networks
**16**(2005): 213--224 - Peter Clifford and Ioana Ada Cosma, "A simple sketching algorithm for entropy estimation", AISTATS 2013 196--206, arxiv:0908.3961
- Marshall Crumiller, Bruce Knight, Yunguo Yu and Ehud Kaplan, "Estimating the amount of information conveyed by a population of neurons" [PDF preprint via Dr. Kaplan]
- J. M. Finn, J. D. Goettee, Z. Toroczkai, M. Anghel and B. P. Wood,
"Estimation of entropies and dimensions by nonlinear symbolic time series
analysis", Chaos
**13**(2003): 444--456 - Shuyang Gao, Greg Ver Steeg and Aram Galstyan
- "Estimating Mutual Information by Local Gaussian Approximation", UAI 2015
- "Efficient Estimation of Mutual Information for Strongly Dependent Variables", arxiv:1411.2003

`<li>Yun Gao, Ioannis Kontoyiannis, Elie Bienenstock <ul> <li>"From the entropy`

to the statistical structure of spike trains", arxiv:0710.4117

- "Estimating the entropy of binary time series: Methodology, some theory and a simulation study", arxiv:0802.4363

**17**(2005): 277--297

**10**(2009): 1469--1484

**79**(2009): 056202

**30**(2018): 885--944

**48**(2005): 633--657, arxiv:cs/0403025

**59**(2013): 6220--6242, arxiv:1201.2334

- "Order-Optimal Estimation of Functionals of Discrete Distributions", arxiv:1406.6956
- "Maximum Likelihood Estimation of Functionals of Discrete Distributions", arxiv:1406.6959

**76**(2007): 026209

**20**(2016): 2358--2361

**36**(2008): 2153--2182, arxiv:0810.5302

**79**(2009): 046208

**96**(2006): 254102

**36**(2007): 1577--1589

**22**(2010): 998--1024

*m*Bins Given Fewer Than

*m*Samples", IEEE Transactions on Information Theory

**50**(2004): 2200--2203

**17**(2005): 1962--2005

- "Bias Analysis in Entropy Estimates", cond-mat/0403192
- "Scaling behaviour of entropy estimates," cond-mat/0203409
- "A Note on Entropy Estimation",
Neural Computation
**27**(2015): 2097--2106

**58**(2012): 1940--1952

**59**(2013): 4374--4388

- "Empirical estimation of entropy functionals with confidence", arxiv:1012.4188
- "Estimation of Nonlinear Functionals of Densities With Confidence", IEEE Transactions on Information Theory
**58**(2012): 4135--4159

**26**(2014): 1840--1872

- "Entropy Estimation in Turing's Perspective", Neural Computation
**24**(2012): 1368--1389 - "A Normal Law for the Plug-in Estimator of Entropy",
IEEE Transactions on
Information Theory
**58**(2012): 2745--2747

**26**(2014): 2570--2593 [Kullback did not spell his name with an umlaut]