Notebooks
http://bactra.org/notebooks
Cosma's NotebooksenMaximum Entropy Methods (MaxEnt)
http://bactra.org/notebooks/2019/08/15#max-ent
<!--
<script type="text/javascript"
src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML">
</script>
!-->
<P>The maximum entropy method is usually stated in a deceptively simple way:
from among all the probability distributions compatible with empirical data,
pick the one with the
highest <a href="information-theory.html">information-theoretic</a> entropy.
To really understand where this comes from, and appreciate it at its proper
worth, we need to look at its origins in equilibrium <a href="stat-mech.html">statistical mechanics</a>.
<P>We start with an assemblage of N particles; they each have a
three-dimensional position and momentum, and possibly some internal degrees of
freedom, so the dimension of the microscopic state space, or phase space, is at
least 6N. We define some macroscopic observables; these are functions of the
microscopic state, so they partition phase space into regions where the
macrovariables are constant. A macroscopic state is a value for the
macroscopic variables, which, in extension, corresponds to one of these regions
of phase space. Boltzmann's postulate is that the probability of a macroscopic
state is proportional to the <em>volume</em> of microscopic phase space
compatible with it. The equilibrium macrostate is the most-probable, largest
volume macrostate. (There is no notion of an equilibrium microstate.) The
logarithm of this volume is (proportional) to the Boltzmann entropy, which is
the same as the thermodynamic entropy. If we could somehow fix the value of
the macroscopic variables, Boltzmann predicts a distribution over phase space
which is uniform over the compatible microstates.
<P>A remarkable result comes from this. (The essence of this goes back to
Boltzmann.) Suppose that the macroscopic observables are extensive, so that if
we divide the assemblage into two parts, the over-all value of the
macrovariables is the sum of separate contributions from the two
sub-assemblages. Make one sub-assemblage, the "system", very small in
comparison to the other, the "environment". Fix the vector of extensive
macro-variables in the over-all assemblage at the value $ t $; then
$ t = t_{\mathrm{sys}} + t_{\mathrm{env}} $. Then the <em>marginal</em>
distribution for the microstate of the system, $ x_{\mathrm{sys}} $, is not
uniform but an an <a href="exponential-families.html">exponential family</a>
distribution, with the macroscopic observables being the sufficient statistics.
The reason is that any system microstate $ x_{\mathrm{sys}} $ will have a
probability proportional to the volume of phase space in
the <em>environment</em>, $ V_{\mathrm{env}} $, which is compatible with the
corresponding macrovariable:
\[
p(x_{\mathrm{sys}}) \propto V_{\mathrm{env}}(t-t_{\mathrm{sys}}(x_{\mathrm{sys}})) = e^{H_{\mathrm{env}}(t-t_{\mathrm{sys}}(x_{\mathrm{sys}}))}
\]
where $ H = \log{V} $ is the Boltzmann entropy. To make use of this, we
compare the probability of two system microstates, $ x_1 $
and $ x_2 $, with two different values of the macroscopic
observables, $ t_1 $ and $ t_2 $:
\[
\frac{p(x_1)}{p(x_2)} = \frac{e^{H_{\mathrm{env}}(t-t_1)}}{e^{H_{\mathrm{env}}(t-t_2)}} = e^{H_{\mathrm{env}}(t-t_1) - H_{\mathrm{env}}(t-t_2)}
\]
<P>The crucial step is the next one: because the system is much smaller than
the environment, and the macroscopic observables are extensive, the total value
of the observables for the whole assemblage <i>t</i> should be much greater
than either <i>t</i><sub>1</sub> or <i>t</i><sub>2</sub>. Assuming
that <i>H</i><sub>env</sub> is a nice function, we can then expand it in a
Taylor series about <i>t</i>, and discard terms after first order:
\[
\frac{p(x_1)}{p(x_2)} \approx e^{H_{\mathrm{env}}(t) - t_1 \cdot H^{\prime} - H_{\mathrm{env}}(t) + t_2 \cdot h^{\prime}} = e^{(t_2 - t_1) \cdot H^{\prime}}
\]
where \( H^{\prime} = {\left. \nabla H_{\mathrm{env}}\right|}_t \).
(Notice that we are doing a Taylor expansion <em>inside</em> an exponential, so
the approximation is going to be rather loose, unless the system is indeed
truly quite small compared to the environment.) Thus
\[
p(x_{\mathrm{sys}}) \propto e^{- t_{\mathrm{sys}}(x_{\mathrm{sys}}) \cdot H^{\prime}}
\]
which is an exponential family. The natural sufficient statistics are the
extensive macroscopic variables, and the natural parameters are \( - H^{\prime}
\), the partial derivatives of the environment's Boltzmann
entropy <i>H</i><sub>env</sub> with respect to the extensive variables.
Thermodynamics tells us to call the components of \( - H^{\prime} \) the
intensive variables conjugate to the extensive ones, so we get (inverse)
temperature as the derivative of entropy with respect to energy, pressure for
the volume derivative, chemical potential for molecular species number, etc.
(Note that the temperature of the <em>system</em> depends on the derivative of
the <em>environment's</em> entropy. At equilibrium, since the entropy of the
whole assemblage is maximized, the gradient of environment entropy has to equal
the gradient for system entropy, which sounds better. It also tells us that
equilibrium requires constancy of the intensive thermodynamic variables.) In
this derivation, the system will have certain expectation values for the
extensive variables, but these are secondary consequences of the natural
parameters/intensive variables, which come from the environment's entropy and
the global totals of the assemblage. The system will also fluctuate around
these expectation values.
<P>If we want to deal with the whole assemblage of particles, the uniform
distribution over a sub-space is really a very awkward mathematical object. It
would be much more convenient, calculationally, to have an exponential family.
Gibbs had a brilliant idea about how to get one. Instead of fixing
the <em>actual</em> values of the macroscopic observables, fix
their <em>expectation</em> values. This does not pick out a unique
distribution, but, acting by analogy from equilibrium maximizing the Boltzmann
entropy, maximize the <strong>Gibbs entropy</strong>,
\[
S[p] = -\int{p(x) \log{p(x)} dx}
\]
where <i>x</i> is of course an abbreviation for the huge vector of coordinates
in phase space.
<P>Boltzmann's global microstate distribution is uniform over the set
\[
\left\{ x: T(x) = t \right\}
\]
Gibbs's distribution is the solution to
\[
\max_{p}{-\int{p(x) \log{p(x)} dx}} ~ \mathrm{such\ that}~\int{p(x) T(x) dx} = t, ~ \int{p(x) dx} = 1
\]
We can turn this constrained optimization problem into an unconstrained
one through introducing a Lagrange multiplier for each macroscopic
observable (and another to make sure the distribution integrates to 1):
\[
\max_{p}{-\int{p(x) \log{p(x)} dx}} + \lambda_0\left(\int{ p(x) dx} - 1\right) + \sum_{i=1}^{d}{\lambda_i \left(\int{p(x) T_i(x) dx } - t_i\right)}
\]
Take the derivative with respect to $ p(x) $, and set it equal to
zero at the maximum. (If you are worried about functional-calculus issues, I
applaud your mathematical caution, but you will never make it as a physicist.)
The solution, $ p^* $, satisfies
\[
-\log{p^*(x)} - 1 + \lambda_0 +\sum_{i=1}^{d}{\lambda_i T_i(x)} = 0
\]
yielding
\[
p^*(x) = e^{\lambda_0-1 + \lambda \cdot T(x)}
\]
which is an exponential family again. The factor \( e^{\lambda_0-1} \) is
just the normalization constant, and not very interesting. The intensive
parameters in \( \lambda \) are implicit functions of the constrained
expectation values <i>t</i>, and are the thermodynamic variables conjugate to
the extensive macroscopic observables.
<P>Since we now have an exponential family again, calculation is very easy.
Moreover, one can show that if <i>N</i> is very large and <i>t</i> is the
equilibrium value, then
\( S[p^*(t)] \approx H[t] \). (More exactly, the difference between the
two is <i>o</i>(<i>N</i>), so the specific entropies coincide.) So for many
calculational purposes, we can replace the awkward Boltzmann distribution with
the slick Gibbs distribution. Moreover, we get exactly the same form as
Boltzmann gives us for the marginal distribution of a small part of an
equilibrium assemblage. Thus we get to use the same math twice, which always
makes physicists happy. So, three cheers for Gibbs, and for
maximum entropy reasoning in equilibrium statistical mechanics.
<P>Now, ever since the pioneering work of E. T. Jaynes in the 1950s, the idea
of maximum entropy has come to mean something different, or at least broader.
Jaynes gave, or claimed to give, a completely general prescription for
inferring distributions from data, which started with the observation
that the Gibbs entropy looks <em>exactly the same</em> as Shannon's
<a href="information-theory.html">information-theoretic</a> entropy. (That's
why Shannon called it "entropy", after all.) According to Jaynes, then, the
"least biased" guess at a distribution is to find the distribution of highest
entropy such that the expectation value of our selected observables equals
their observed values. Mathematically, of course, this just leads to the same
exponential family as before, with parameters set to enforce the
expectation-value constraints. But within an exponential family, the
maximum-likelihood estimate of the parameters is the one where observations
match expectations, so the maximum entropy distribution is the
maximum-likelihood estimate within that exponential family. For the Jaynesian,
max-ent school, statistical mechanics follows from a general rule of the logic
of inductive inference, namely to maximize entropy under constraints. The
exponential family distribution is primary, and Boltzmann's
uniform-within-a-macrostate distribution is idle. Moreover, this perspective
opens, or seems to open, all sorts of connections between informational and
physical quantities. I think it is safe to say that this idea is pretty
thoroughly entrenched among physicist who are interested in the intersection of
their subject with <a href="physics-computation-information.html">information
theory and computation</a>, as I am.
<P>As a general prescription for statistical inference, however, I think
max-ent sucks. Yet lots of the distributions we encounter in practice are not
exponential families. (A subtlety: you can always get your favorite
probability density <i>f</i> by claiming to constrain log<i>f</i>, but
generally this will not give you the right <em>family</em> of distributions.)
Something then is obviously lacking in a prescription which always and only
gives us exponential families.
<P>One of the places it goes wrong is at the beginning, with the constraint
rule. The idea of constraining expectation values to match observations is
completely unmotivated in logic or probability theory, but without it of course
one gets very different distributions indeed. (See Uffink.) It gets a
deceptive amount of plausibility from our experience with unimodal (and indeed
sharply peaked) distributions, where the expectation value is close to the most
probable value (the mode), but this is very far from being the general case.
In
<a href="ergms.html">exponential families of random graphs</a>, for instance,
it is very common for their to be <em>two</em> modes, with the expectation
value in between them --- and <a href="../weblog/361.html">deeply
improbable</a>. So even though one is maximizing entropy and likelihood, and
making expectations match observations, the only natural conclusion is that
there's no way in Hell that model would give you what you actually observed.
(Jaynes was canny enough to recognize the importance of testing his models, but
the procedures he used were of course inconsistent with his supposedly-Bayesian
ideology, and it does nothing to redeem the general prescription.) As
Seidenfeld points out, taking into account exactly this sort
of <em>distribution</em> of observations is crucial to proper statistical
inference, but systematically neglected in this recipe.
<P>More broadly, there is a deep disanalogy between the situation faced by the
statistical data analyst and that faced by the statistical mechanic. The data
analyst <em>has all the data</em> --- measurements on each sample; also,
generally, some idea of the dependence structure between samples. Assuming
independence for simplicity, the data analyst can always reduce the data by
taking the empirical distribution (= order statistics or histogram, as
applicable), but no more. Constrain the expected empirical distribution and
maximize the entropy, and in general you get --- the empirical distribution.
Any further reduction of the data, beyond the empirical distribution, imposes
assumptions about what the true distribution is --- it's saying that certain
statistics alone are <a href="sufficient-statistics.html">sufficient</a>, which
is implicitly ruling out all the models in which those are <em>not</em>
sufficient. This may be a good idea, and it might even be a good idea to use
an exponential family, but dictated by the fundamental logic of inductive
inference it is not.
<P>The situation facing us in statistical mechanics is very different. Rather
than having lots of observations on comparable units or samples, we have only
measurements of the macroscopic observables, which are a small number of
coarse-grained and <em>collective</em> degrees of freedom. To be in the same
situation as the data analyst, we would have to have lots of measurements of
individual molecules. Or rather, to be in the same position as the statistical
mechanic, the data analyst would have to be forbidden from looking at
individual samples, and allowed to see only certain more-or-less complicated
functionals of the empirical distribution.
<P>This last point is in fact the clue to why, mathematically, maximum entropy
often works as an approximate method. (What follows is shamelessly ripped off
from writers like Richard Ellis and Imre Csiszar.) One of the fundamental
results in
<a href="large-deviations.html">large deviations</a> theory is something called
Sanov's Theorem, which concerns the deviations of the empirical distribution
away from the true distribution. (I'll go over the independent-samples version
now for simplicity.) It needs a moment or two of set-up. If we have two
distributions $ P $ and $ Q $, with densities $ p $ and $ q $, then their <strong>relative entropy</strong> or <strong>Kullback-Leibler divergence</strong> is
\[
D(P\|Q) \equiv \int{p(x) \log{\frac{p(x)}{q(x)}} dx} \geq 0
\]
with \( D(P \| Q) = 0 \) if and only if $ P= Q $. If we
write $ \hat{P}_n $ for the empirical distribution after $ n $
samples, then Sanov's theorem concerns the probability that it falls into some
set of distributions, generically $ A $. Specifically, it says that
\[
\frac{1}{n}\log{P(\hat{P}_n \in A)} \rightarrow -\inf_{Q \in A}{D(Q\|P)}
\]
(Actually, the precise statement is a little more complicated, to handle some
subtle points about differences between open and closed sets.) Now suppose
that the empirical distribution is already known to satisfy some constraints,
say on the averages of certain quantities, i.e., we know that $ Q $ is in
$ C $. Then, under some further regularity conditions,
\[
\frac{1}{n}\log{P(\hat{P}_n \in A| \hat{P}_n \in C)} \rightarrow -\inf_{Q \in A\cap C}{D(Q\|P)} + \inf_{R \in C}{D(R\|P)}
\]
Very approximately, then, for $ Q $ within the constraint set $ C $,
\[
P(\hat{P}_n \approx Q) \approx \exp{\left\{-n\left[D(Q\|P) - D(R(C)\|P)\right]\right\}}
\]
where $ R(C) $ is the distribution in the constraint set which
minimizes the relative entropy. If the true, generating distribution $ P $
is uniform, then \( D(Q\|P) = \mathrm{constant} - S[Q] \), so
\[
P(\hat{P}_n \approx Q) \propto \exp{\left\{n S[Q]\right\}}
\]
at least to leading order in the exponent.
<P>Let me sum up the previous long and complicated paragraph. If we draw a
very large sample from a uniform distribution, and throw out all the samples
which do not have certain average values, then with exponentially-large
probability, the <em>empirical distribution</em> of the remaining samples will
be very close to the one which maximizes the Gibbs-Shannon entropy under the
constraints. Maximizing the entropy under the constraints then provides a good
approximation to the sample distribution, though one which ignores sampling
fluctuations. Max-Ent, in other words, works not because of inductive logic,
but as a short-cut for exact probability calculations under special
circumstances --- much as Boltzmann would have said. Indeed, statistical
mechanics as a whole can be built up in a mathematically sound and (to my eyes)
pleasing manner on the basis of large deviations theory, with maximum-entropy
reasoning as a useful asymptotic hack.
<P>See also:
<a href="exponential-families.html">Exponential Families of
Probability Measures</a>;
<a href="information-theory-and-statistics.html">Information Theory and Large Deviations in the Foundations of Statistics</a>;
<a href="large-deviations.html">Large Deviations</a>;
<a href="stat-mech-foundations.html">Foundations of Statistical
Mechanics</a>;
<a href="stat-mech.html">Statistical Mechanics</a>;
<a href="statistics.html">Statistics</a>;
<a href="tsallis.html">Tsallis Statistics</a>
<ul>Recommended:
<li>I. Csiszár, "Maxent, Mathematics, and Information Theory",
pp. 35--50 in Kenneth M. Hanson and Richard N. Silver (eds.), <cite>Maximum
Entropy and Bayesian Methods: Proceedings of the Fifteenth International
Workshop on Maximum Entropy and Bayesian Methods</cite> [How
<a href="large-deviations.html">large deviations</a> results sometimes give
maximum entropy distributions as large sample approximations.]
<li>Penha Maria Cardoso Dias and Abner Shimony, "A Critique of Jaynes' Maximum Entropy
Principle", <a href="http://dx.doi.org/10.1016/0196-8858(81)90003-8"><cite>Advances in Applied Mathematics</cite> <strong>2</strong>
(1981): 172--211</a>
<li>Kenneth Friedman and Abner Shimony, "Jaynes' Maximum Entropy Prescription
and Probability Theory," <a href="http://dx.doi.org/10.1007/BF01008275"><cite>Journal of Statistical
Physics</cite> <strong>3</strong> (1971): 381--384</a>
<li>E. T. Jaynes
<ul>
<li>"Information Theory and Statistical Mechanics I,"
<cite>Physical Review</cite> <strong>106</strong> (1957): 620--630
<li>"Information Theory and Statistical Mechanics II,"
<cite>Physical Review</cite> <strong>108</strong> (1957): 171--190
<li><cite>Papers on Probability, Statistics, and Statistical
Physics</cite> [Reprints both those papers, with many other important ones by
Jaynes]
</ul>
<li><a href="http://www.stat.cmu.edu/~kass/">Robert E. Kass</a>
and <a href="http://www.stat.cmu.edu/~larry/">Larry Wasserman</a>, "The
Selection of Prior Distributions by Formal Rules", <cite>Journal of the
American Statistical Association</cite> <strong>91</strong> (1996): 1343--1370
[<a href="http://www.stat.cmu.edu/~kass/papers/rules.pdf">PDF reprint</a>]
<li>Benoit Mandelbrot, "The Role of Sufficiency and of Estimation in
Thermodynamics", <a href="http://projecteuclid.org/euclid.aoms/1177704470"><cite>Annals
of Mathematical Statistics</cite> <strong>33</strong> (1962): 1021--1038</a>
[See comments under <a href="sufficient-statistics.html">Sufficient
Statistics</a>]
<li>Teddy Seidenfeld [Demonstrations that max-ent methods are, in fact,
plagued by the same problems as the old Principle of Insufficient Reason, and
not consistent with Bayesian inference. Claims from the Albany group that
maximum entropy is always compatible with Bayesian updating are thus
incorrect.]
<ul>
<li>"Why I Am Not an Objective Bayesian: Some Reflections
Prompted by Rosenkrantz", <cite>Theory and Decision</cite> <strong>11</strong>
(1979): 413--440
<li>"Entropy and Uncertainty", pp. 259--287 in I. B. MacNeill
and G. J. Umphrey (eds.), <cite>Foundations of Statistical Inference</cite>
(1987)
</ul>
<li>Jos Uffink
<ul>
<li>"Can the Maximum Entropy Principle be explained as a
Consistency Requirement?", <cite>Studies in History and Philosophy of Modern
Physics</cite> <strong>26B</strong> (1995): 223-261 [<a
href="http://www.phys.uu.nl/~wwwgrnsl/jos/mepabst/mepabst.html">Abstract, with
links to PDF and PS</a>]
<li>"The Constraint Rule of the Maximum Entropy
Principle," <cite>Studies in History and Philosophy of Modern
Physics</cite> <strong>27</strong> (1996): 47--79 [<a
href="Http://www.phys.uu.nl/~wwwgrnsl/jos/mep2def/mep2def.html">Abstract, with
links to PDF and PS</a>]
</ul>
</ul>
<ul>Modesty forbids me to recommend:
<li>CRS, "The Backwards Arrow of Time of the Consistently Bayesian
Statistical Mechanic", <a
href="http://arxiv.org/abs/cond-mat/0410063">cond-mat/0410063</a>
</ul>
<ul>To read (thanks to Edward Burns for recommendations):
<li>K. Bandyopadhyay, A. K. Bhattacharya, Parthapratim Biswas and
D. A. Drabold, "Maximum entropy and the problem of moments: A stable
algorithm", <a
href="http://arxiv.org/abs/cond-mat/0412717">cond-mat/0412717</a>
<li>Imre Csiszar, "Why Least Squares and Maximum Entropy? An Axiomatic Approach to Inference for Linear Inverse Problems", <a href="http://dx.doi.org/10.1214/aos/1176348385"><cite>Annals of Statistics</cite> <strong>19</strong> (1991): 2032--2066</a> [This deserves to be worked through carefully, but on first examination, none of the theorems say "inferences following these axioms will be accurate" or "will be reliable", or anything like that; they just characterize
inferential methods which obey the axioms.]
<li>Peter Grunwald and A. Philip Dawid, "Game Theory, Maximum Entropy,
Minimum Discrepancy and Robust Bayesian Decision
Theory", <a href="http://dx.doi.org/10.1214/009053604000000553"><cite>Annals of
Statistics</cite> <strong>32</strong> (2004): 1367--1433</a>
<li>Patrick Haffner, Steven Phillips and Rob Schapire, "Efficient
Multiclass Implementations of L1-Regularized Maximum Entropy", <a
href="http://arxiv.org/abs/cs.LG/0506101">cs.LG/0506101</a>
<li>Prakash Ishwar and Pierre Moulin, "On the existence and characterization of the maxent distribution under general moment inequality constraints", <a href="http://dx.doi.org/10.1109/TIT.2005.853317"><cite>IEEE Transactions
on Information Theory</cite> <strong>51</strong> (2005): 3322--3333</a>, <a href="http://arxiv.org/abs/cs.IT/0506013">cs.IT/0506013</a>
<li>Oliver Johnson and Christophe Vignat, "Some results concerning
maximum Renyi entropy distributions", <a
href="http://arxiv.org/abs/math.PR/0507400">math.PR/0507400</a>
<li>Jill North, "Symmetry and Probability", <a
href="http://philsci-archive.pitt.edu/archive/00002978/">phil-sci/2978</a> [I
heard Prof. North talk about this at PSA 2006, and it sounded good, but I need
to read the details.]
</ul>
<P><em>Previous versions</em>: 2011-01-15; 2011-07-11; 2017-02-27