## June 12, 2005

### Exponential Families and Hybridity (Why Oh Why Can't Physicists Learn Better Probability and Statistics, Part N)

Attention conservation notice: Incredibly geeky exposition of statistical arcana, along with unbecoming negative emotions. Contains equations. Without some knowledge of calculus and probability, only the whinging will be clear.

An exponential family is a collection of probability distributions over some space of configurations, where the probability density at a particular configuration $x$ has the form $p_{\theta}(x) = \frac{e^{-\theta \cdot T(x)}}{z(\theta)}$ (Note to statisticians: I'm glossing over some details about things like the choice of reference measure here.) The vector $T(x)$ is a (finite) collection of statistics, $T_i(x)$, which we calculate on individual configurations $x$. $\theta$ is a corresponding vector of parameters, which essentially say how much weight we give to each of the statistics. If one of those parameter is large and positive, then configurations with large values of the matching statistic are, all else being equal, exponentially unlikely. These parameters — generally called the natural parameters — index the different distributions. Clearly, any vector $\eta = f(\theta)$, where $f$ is invertible, could also be used, but the nice algebraic form of the density would generally be messed up. (In information geometry, we'd say that the family is a manifold of distributions, and the natural parameters are one coordinate system on that manifold. Changing to a different coordinate system would be the same as reparametrization.) Finally, $z$ is just a normalizing factor: $z(\theta) = \int{dx ~ e^{-\theta \cdot T(x)}}$

OK, that's an exponential family — so what? Because the probability density has this nice exponential form, many things are very easy to do. The statistics $T_i$ turn out to be sufficient statistics, so to do essentially any kind of inference, we only need to know their values, not the complete configuration $x$. Further, the values of the natural parameters that maximize the likelihood of the data, $\hat{\theta}$, turn out to be the ones where the expected values of the statistics equal their observed values. I think that calculation is very cute, so I'll impose on your patience to actually go through it. $\begin{eqnarray*} {\left.\frac{\partial p_{\theta}(x)}{\partial \theta_i}\right|}_{\theta=\hat{\theta}} & = & 0\\ & = & {\left. \frac{z(\hat{\theta})\frac{\partial}{\partial \theta_i}e^{-\theta\cdot T(x)} - e^{-\theta\cdot T(x)}\frac{\partial}{\partial \theta_i}z(\theta)}{z^2(\theta)} \right|}_{\theta=\hat{\theta}}\\ & = & {\left. -z(\theta)T_i(x) e^{-\theta\cdot T(x)} - e^{-\theta\cdot T(x)} \frac{\partial}{\partial \theta_i}z(\theta)\right|}_{\theta=\hat{\theta}} \end{eqnarray*}$

Now the derivative of $z$ has a nice form: $\begin{eqnarray*} \frac{\partial}{\partial \theta_i}z(\theta) & = & \int{dx ~ \frac{\partial}{\partial \theta_i} e^{-\theta \cdot T(x)}}\\ & = & \int{dx ~ (-T_i(x)) e^{-\theta \cdot T(x)}}\\ & = & \int{dx ~ (-T_i(x)) p_{\theta}(x) z(\theta)}\\ & = &-z(\theta) \int{dx ~ T_i(x) p_{\theta}(x)}\\ & = & -z(\theta) \mathbf{E}_{\theta}[T_i] \end{eqnarray*}$
(If you worry about differentiating under the integral sign, I applaud your caution. Imagine for right now that the space of configurations is discrete, so that integral is really just a sum.)

So ${\left. -z(\theta)T_i(x) e^{-\theta\cdot T(x)} + e^{-\theta\cdot T(x)}z(\theta) \mathbf{E}_{\theta}[T_i]\right|}_{\theta=\hat{\theta}} = 0$ Canceling common factors, we're left with $T_i(x) = \mathbf{E}_{\hat{\theta}}[T_i]$ Q.E.D.

I could go on about the wonderful properties of exponential families at some length — say, a chapter in a first course on theoretical statistics, or even whole monographs — but I'll forbear. The art in getting exponential families to work consists of picking the right set of statistics — the right functions of the data to calculate.

Some months ago — you'll see why I disguise the details in a moment — I heard a talk a statistician gave on a subject of mutual concern to her tribe, and to physicists interested in complex systems. As I've mentioned a number of times before, these two tribes essentially never talk to each other, so this was pleasing to me. Unfortunately, the talk itself was less than successful, and it nearly resulted in a highly regrettable action on my part.

This was because the statistician was doing something very clever, which should have been quite transparent to the physicists, but was in fact utterly opaque. She was trying to fit the properties of these systems using exponential families, which is why I bring them up. Most of the talk the statistician had intended to give was about how the set of statistics most people had thought were important in these systems actually turned out not to work, but a different, much larger set did, and that these new variables could be broken down into two sorts, which corresponded to two different mechanisms in the system in question, both of which had been postulated by physicists, but whose relative contributions hadn't been clearly distinguished in data before.

If there is any one idea in theoretical statistics which should be natural for physicists, I'd think it's that of an exponential family. This is because classical statistical mechanics is all about one particular exponential family, namely the Boltzmann distribution. The sufficient statistics $T_i$ correspond to the extensive thermodynamic variables, like molecular numbers, volume and energy, while the natural parameters correspond to the intensive variables, like chemical potentials, pressure and (inverse) temperature. Saying that you need to find the right statistics to get good results is the same as our bit of lore that the crucial first step is to find the right collective degrees of freedom — that you can't hope to make progress until you've identified the order parameters, etc. The normalizing factor is just the partition function (which is why I wrote it $z$). And the equality of observed and expected values at the maximum likelihood parameter turns out to be entropy maximization (as Jaynes pointed out sometime in the '60s).

In fact, none of this was clear to the other physicists in the audience, who were not (for the most part) dumb. They didn't get what the normalizing factor was, they didn't really get the difference between a statistic and a parameter, and they even had trouble understanding that finding the parameter value for which the observed configuration is more likely than it is with any other parameter value is not the same as finding the parameters where the observed configuration is more likely than any other configuration — that maximizing the likelihood is not the same as making the data the mode. Hell, some of them had trouble understanding that the mean of a distribution and its mode are not necessarily the same. At one point I wanted to yell at one of them, who was being particularly obtuse, "J., can't you even recognize a #!@% partition function?"; but that wouldn't have been proper even if I'd been the one giving the talk, which I wasn't.

All this comes to mind, of course, because I've been writing a page explaining what I do, and why I'm doing it in a statistics department. I don't think I've ever been so glad of my new affiliation as I was when that talk ended. This was some time ago, now, and so I'm able to think more calmly, and can envision another seminar — the dual, as it were — where a physicist talked about "state" and "coarse-grained collective degrees of freedom", and statisticians were equally baffled, because they didn't realize he was talking about causal screening, low-dimensional statistics, etc.

In fact, I can only too easily envision giving that talk. But that's a somewhat limited ambition: my true goal is to produce work both statisticians and physicists will find incomprehensible.

Posted at June 12, 2005 21:10 | permanent link