Estimating Entropies and Informations

01 Jun 2018 10:51

The central mathematical objects in information theory are the entropies of random variables. These ("Shannon") entropies are properties of the probability distributions of the variables, rather than of particular realizations. (This is unlike the Boltzmann entropy of statistical mechanics, which is an objective property of the macroscopic state, at least once we fix our partition of microscopic states into macroscopic states. Confusing the two entropies is common but bad.) The question which concerns me here is how to estimate the entropy of the distribution, given a sample of realizations. The most obvious approach, if one knows the form of the distribution but not the parameters, is to estimate the parameters and then plug in. But it feels like one should be able to do this more non-parametrically. The obvious non-parametric estimator is of course just the entropy of the empirical distribution, sometimes called the "empirical entropy". However, the empirical distribution isn't always the best estimate of the true distribution (one might perfer, e.g., some kind of kernel density estimate). For that matter, we often don't really care about the distribution, just its entropy, so some more direct estimator would be nice.

What would be really nice would be to not just have point estimates but also confidence intervals. Non-parametrically, my guess is that the only feasible way to do this is bootstrapping.

For finite alphabets, one approach would be to use something like variable length Markov chains, or causal state reconstruction, to reconstruct a get machine capable of generating the sequence. From the machine, it is easy to calculate the entropy of words or blocks of any finite length, and even the entropy rate. My experience with using CSSR is that the entropy rate estimates can get very good even when the over-all reconstruction of the structure is very poor, but I don't have any real theory on that. I suspect CSSR converges on the true entropy rate faster than do variable length Markov chains, because the former has greater expressive power, but again I don't know that for sure.

Using gzip is a bad idea (for this purpose; it works fine for data compression).

(Thanks to Sankaran Ramakrishnan for pointing out a think-o in an earlier version.)

See also: Bootstrapping Entropy Estimates; Complexity, Entropy and the Physics of gzip

  • M. N. Goria, N. N. Leonenko, V. V. Mergel and Pl L. Novi Inverardi, "A new class of random vector entropy estimators and its applications in testing statistical hypotheses", Journal of Nonparametric Statistics 17 (2005): 277--297
  • Peter Grassberger, "Data Compression and Entropy Estimates by Non-sequential Recursive Pair Substitution," physics/0207023 [On Jimenez-Montano, Ebeling and Poeschel]
  • Jean Hausser and Korbinian Strimmer, "Entropy Inference and the James-Stein Estimator, with Application to Nonlinear Gene Association Networks", Journal of Machine Learning Research 10 (2009): 1469--1484
  • Detlef Holstein and Holger Kantz, "Optimal Markov approximations and generalized embeddings", Physical Review E 79 (2009): 056202
  • Wentao Huang and Kechen Zhang, "Information-Theoretic Bounds and Approximations in Neural Population Coding", Neural Computation 30 (2018): 885--944
  • Marcus Hutter, Marco Zaffalon, "Distribution of Mutual Information from Complete and Incomplete Data", Computational Statistics and Data Analysis 48 (2005): 633--657, arxiv:cs/0403025
  • Siddharth Jain, Rakesh Kumar Bansal, "On Large Deviation Property of Recurrence Times", ISIT 2013, arxiv:1303.1093
  • Jiantao Jiao, Haim H. Permuter, Lei Zhao, Young-Han Kim, Tsachy Weissman, "Universal Estimation of Directed Information", IEEE Transactions on Information Theory 59 (2013): 6220--6242, arxiv:1201.2334
  • Jiantao Jiao, Kartik Venkat, Tsachy Weissman
    • "Order-Optimal Estimation of Functionals of Discrete Distributions", arxiv:1406.6956
    • "Maximum Likelihood Estimation of Functionals of Discrete Distributions", arxiv:1406.6959
  • Miguel Angel Jimenez-Montano, Werner Ebeling, and Thorsten Poeschel, "SYNTAX: A computer program to compress a sequence and to estimate its information content," cond-mat/0204134
  • David Källberg, Nikolaj Leonenko, Oleg Seleznjev, "Statistical Inference for Rényi Entropy Functionals", arxiv:1103.4977
  • Alexei Kaltchenko, "Algorithms for Estimating Information Distance with Applications to Bioinformatics and Linguistics", cs.CC/0404039
  • Shiraj Khan, Sharba Bandyopadhyay, Auroop R. Ganguly, Sunil Saigal, David J. Erickson, III, Vladimir Protopopescu, and George Ostrouchov, "Relative performance of mutual information estimation methods for quantifying the dependence among short and noisy data", Physical Review E 76 (2007): 026209
  • John M. Konstantinides and Ioannis Andreadis, "Optimal Code Length Estimates From Dependent Samples With Bounds on the Estimation Error", IEEE communications Letters 20 (2016): 2358--2361
  • Ioannis Kontoyiannis, P. H. Algoet, Yu. M. Suhov and A. J. Wyner, "Nonparametric Entropy Estimation for Stationary Processes and Random Fields, with Applications to English Text"
  • Ioannis Kontoyiannis, Maria Skoularidou, "Estimating the Directed Information and Testing for Causality", arxiv:1507.01234
  • Akshay Krishnamurthy, Kirthevasan Kandasamy, Barnabas Poczos, Larry Wasserman, "Nonparametric Estimation of Renyi Divergence and Friends", arxiv:1402.2966
  • Nikolai Leonenko, Luc Pronzato and Vippal Savani, "A class of Rényi information estimators for multidimensional densities", Annals of Statistics 36 (2008): 2153--2182, arxiv:0810.5302
  • Annick Lesne, Jean-Luc Blanc and Laurent Pezard, "Entropy estimation of very short symbolic sequences", Physical Review E 79 (2009): 046208
  • Christophe Letellier, "Estimating the Shannon Entropy: Recurrence Plots versus Symbolic Dynamics", Physical Review Letters 96 (2006): 254102
  • Johan Lim, "Estimation of the Entropy Functional from Dependent Samples", Communications in Statistics: Theory and Methods 36 (2007): 1577--1589
  • Tiger W. Lin and George N. Reeke, "A Continuous Entropy Rate Estimator for Spike Trains Using a K-Means-Based Context Tree", Neural Computation 22 (2010): 998--1024
  • Han Liu, Larry Wasserman and John D. Lafferty, "Exponential Concentration for Mutual Information Estimation with Application to Forests", NIPS 2012
  • Kevin R. Moon, Alfred O. Hero III, "Multivariate f-Divergence Estimation With Confidence", arxiv:1411.2045
  • Ilya Nemenman, "Inference of entropies of discrete random variables with unknown cardinalities," physics/0207009
  • Ilya Nemenman, William Bialek and Rob de Ruyter van Steveninck, "Entropy and information in neural spike trains: Progress on the sampling problem", 0306063
  • XuanLong Nguyen, Martin J. Wainwright, Michael I. Jordan, "Estimating divergence functionals and the likelihood ratio by convex risk minimization", arxiv:0809.0853
  • Sebastian Nowozin, "Improved Information Gain Estimates for Decision Tree Induction", arxiv:1206.4620
  • Leandro Pardo, Statistical Inference Based on Divergence Measures
  • Liam Paninski, "Estimating Entropy on m Bins Given Fewer Than m Samples", IEEE Transactions on Information Theory 50 (2004): 2200--2203
  • Angeliki Papana and Dimitris Kugiumtzis, "Evaluation of Mutual Information Estimators for Time Series", arxiv:0904.4753
  • Paulo R. F. Pinto, M. S. Baptista, Isabel S. Labouriau, "Density of first Poincaré returns, periodic orbits, and Kolmogorov-Sinai entropy", arxiv:0908.4575
  • Barnabas Poczos), Zoubin Ghahramani, Jeff Schneider, "Copula-based Kernel Dependency Measures", arxiv:1206.4682
  • G. Pola, R. S. Petersen, A. Thiele, M. P. Young and S. Panzeri, "Data-Robust Tight Lower Bounds to the Information Carried by Spike Times of a Neuronal Population", Neural Computation 17 (2005): 1962--2005
  • Avraham Ruderman, Mark Reid, Dario Garcia-Garcia, James Petterson, "Tighter Variational Representations of f-Divergences via Restriction to Probability Measures", arxiv:1206.4664
  • Luis G. Sanchez Giraldo, Murali Rao, Jose C. Principe, "Measures of Entropy from Data Using Infinitely Divisible Kernels", arxiv:1211.2459
  • Thomas Schürmann
  • J. F. Silva and S. Narayanan, "Complexity-Regularized Tree-Structured Partition for Mutual Information Estimation", IEEE Transactions on Information Theory 58 (2012): 1940--1952
  • Kumar Sricharan, "Ensemble Estimators for Multivariate Entropy Estimation", IEEE Transactions on Information Theory 59 (2013): 4374--4388
  • Kumar Sricharan, Alfred O. Hero III, "Ensemble estimators for multivariate entropy estimation", arxiv:1203.5829
  • Kumar Sricharan, Raviv Raich, Alfred O. Hero III
  • Taiji Suzuki, Masashi Sugiyama and Toshiyuki Tanaka, "Mutual information approximation via maximum likelihood estimation of desnity ratio", ISIT 2009 [PDF preprint via Prof. Sugiyama]
  • Evgeniy Timofeev, Alexei Kaltchenko, "Nearest-neighbor Entropy Estimators with Weak Metrics", arxiv:1205.5856
  • Nicholas Watters and George N. Reeke, "Neuronal Spike Train Entropy Estimation by History Clustering", Neural Computation 26 (2014): 1840--1872
  • Yihong Wu, Pengkun Yang, "Minimax rates of entropy estimation on large alphabets via best polynomial approximation", arxiv:1407.0381
  • Zhiyi Zhang
  • Zhiyi Zhang and Michael Grabchak, "Nonparametric Estimation of Küllback-Leibler Divergence", Neural Computation 26 (2014): 2570--2593 [Kullback did not spell his name with an umlaut]

  • Notebooks: