Attention conservation notice: Incredibly geeky exposition of statistical arcana, along with unbecoming negative emotions. Contains equations. Without some knowledge of calculus and probability, only the whinging will be clear.

An exponential
family is a collection of probability distributions over some space of
configurations, where the probability density at a particular
configuration \( x \) has the form
\[
p_{\theta}(x) = \frac{e^{-\theta \cdot T(x)}}{z(\theta)}
\]
(Note to statisticians: I'm glossing over some details about things like
the choice of reference measure here.) The vector \( T(x) \)
is a
(finite) collection of statistics, \( T_i(x) \), which we calculate on
individual configurations \( x \). \( \theta \)
is a corresponding vector of parameters, which essentially say how much weight
we give to each of the statistics. If one of those parameter is large and
positive, then configurations with large values of the matching statistic are,
all else being equal, exponentially unlikely. These parameters —
generally called the *natural* parameters — index the different
distributions. Clearly, any vector
\( \eta = f(\theta) \), where \( f \) is invertible, could also be used, but the nice algebraic
form of the density would generally be messed up. (In information geometry, we'd
say that the family is a manifold of distributions, and the natural parameters
are one coordinate system on that manifold. Changing to a different coordinate
system would be the same as reparametrization.) Finally, \( z \) is just a
normalizing factor:
\[
z(\theta) = \int{dx ~ e^{-\theta \cdot T(x)}}
\]

OK, that's an exponential family — so what? Because the probability density has this nice exponential form, many things are very easy to do. The statistics \(T_i \) turn out to be sufficient statistics, so to do essentially any kind of inference, we only need to know their values, not the complete configuration \( x \). Further, the values of the natural parameters that maximize the likelihood of the data, \( \hat{\theta} \), turn out to be the ones where the expected values of the statistics equal their observed values. I think that calculation is very cute, so I'll impose on your patience to actually go through it. \[ \begin{eqnarray*} {\left.\frac{\partial p_{\theta}(x)}{\partial \theta_i}\right|}_{\theta=\hat{\theta}} & = & 0\\ & = & {\left. \frac{z(\hat{\theta})\frac{\partial}{\partial \theta_i}e^{-\theta\cdot T(x)} - e^{-\theta\cdot T(x)}\frac{\partial}{\partial \theta_i}z(\theta)}{z^2(\theta)} \right|}_{\theta=\hat{\theta}}\\ & = & {\left. -z(\theta)T_i(x) e^{-\theta\cdot T(x)} - e^{-\theta\cdot T(x)} \frac{\partial}{\partial \theta_i}z(\theta)\right|}_{\theta=\hat{\theta}} \end{eqnarray*} \]

Now the derivative of \( z \) has a nice form:
\[
\begin{eqnarray*}
\frac{\partial}{\partial \theta_i}z(\theta) & = & \int{dx ~ \frac{\partial}{\partial \theta_i} e^{-\theta \cdot T(x)}}\\
& = & \int{dx ~ (-T_i(x)) e^{-\theta \cdot T(x)}}\\
& = & \int{dx ~ (-T_i(x)) p_{\theta}(x) z(\theta)}\\
& = &-z(\theta) \int{dx ~ T_i(x) p_{\theta}(x)}\\
& = & -z(\theta) \mathbf{E}_{\theta}[T_i]
\end{eqnarray*}
\]

(If you worry about differentiating under the integral sign, I applaud
your caution. Imagine for right now that the space of configurations is
discrete, so that integral is really just a sum.)

So \[ {\left. -z(\theta)T_i(x) e^{-\theta\cdot T(x)} + e^{-\theta\cdot T(x)}z(\theta) \mathbf{E}_{\theta}[T_i]\right|}_{\theta=\hat{\theta}} = 0 \] Canceling common factors, we're left with \[ T_i(x) = \mathbf{E}_{\hat{\theta}}[T_i] \] Q.E.D.

I could go on about the wonderful properties of exponential families at some length — say, a chapter in a first course on theoretical statistics, or even whole monographs — but I'll forbear. The art in getting exponential families to work consists of picking the right set of statistics — the right functions of the data to calculate.

Some months ago — you'll see why I disguise the details in a moment — I heard a talk a statistician gave on a subject of mutual concern to her tribe, and to physicists interested in complex systems. As I've mentioned a number of times before, these two tribes essentially never talk to each other, so this was pleasing to me. Unfortunately, the talk itself was less than successful, and it nearly resulted in a highly regrettable action on my part.

This was because the statistician was doing something very clever, which should have been quite transparent to the physicists, but was in fact utterly opaque. She was trying to fit the properties of these systems using exponential families, which is why I bring them up. Most of the talk the statistician had intended to give was about how the set of statistics most people had thought were important in these systems actually turned out not to work, but a different, much larger set did, and that these new variables could be broken down into two sorts, which corresponded to two different mechanisms in the system in question, both of which had been postulated by physicists, but whose relative contributions hadn't been clearly distinguished in data before.

If there is any *one* idea in theoretical statistics which should be
natural for physicists, I'd think it's that of an exponential family. This is
because classical statistical mechanics is *all about* one particular
exponential family, namely the Boltzmann distribution. The sufficient
statistics \( T_i \)
correspond to the extensive thermodynamic
variables, like molecular numbers, volume and energy, while the natural
parameters correspond to the intensive variables, like chemical potentials,
pressure and (inverse) temperature. Saying that you need to find the right
statistics to get good results is the same as our bit of lore that the crucial
first step is to find the right collective degrees of freedom — that you
can't hope to make progress until you've identified the order parameters, etc.
The normalizing factor is just the partition function (which is why I wrote
it \( z \)). And the equality of observed and expected values at the
maximum likelihood parameter turns out to be entropy maximization (as Jaynes
pointed out sometime in the '60s).

In fact, *none* of this was clear to the other physicists in the
audience, who were not (for the most part) dumb. They didn't get what the
normalizing factor was, they didn't really get the difference between a
statistic and a parameter, and they even had trouble understanding that finding
the parameter value for which the observed configuration is more likely than it
is with any other parameter value is not the same as finding the parameters
where the observed configuration is more likely than any other configuration
— that maximizing the likelihood is not the same as making the data the
mode. Hell, some of them had trouble understanding that the mean of a
distribution and its mode are not necessarily the same. At one point I wanted
to yell at one of them, who was being particularly obtuse, "J., can't you even
recognize a #!@% partition function?"; but that wouldn't have been proper even
if I'd been the one giving the talk, which I wasn't.

All this comes to mind, of course, because I've been writing a page explaining what I do, and why I'm doing it in a statistics department. I don't think I've ever been so glad of my new affiliation as I was when that talk ended. This was some time ago, now, and so I'm able to think more calmly, and can envision another seminar — the dual, as it were — where a physicist talked about "state" and "coarse-grained collective degrees of freedom", and statisticians were equally baffled, because they didn't realize he was talking about causal screening, low-dimensional statistics, etc.

In fact, I can only too easily envision *giving* that talk. But
that's a somewhat limited ambition: my true goal is to produce
work *both* statisticians and physicists will find incomprehensible.

Posted at June 12, 2005 21:10 | permanent link