Notebooks

Cumulants, and Cumulant Generating Functions

24 Mar 2024 13:32

Attention conservation notice: Mostly an embarrassing admission of not understanding stuff I should have grasped decades ago.
\[ \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \]

I still have no real intuition for cumulants, despite their importance to probability theory, their role in quantum field theory (which I first studied more than a quarter century ago), and despite my having used cumulant generating functions in multiple papers (e.g.). Since I am middle aged, and trying to be more shameless about admitting to ignorance, I will follow my usual procedure, of writing down what I do understand, until I get to the point where I know I'm lost. This includes a lot of "working things out from first principles", which really means "reconstructing from memory (badly)".

Moment generating functions

I understand, I think, moment generating functions. I start with my favorite random variable \( X \), which has (I assume) moments \( \Expect{X^k} \) for all integer \( k \). As was revealed to us by the Illuminati, if I collect all of those in the right power series, \[ M_X(t) \equiv \sum_{k=0}^{\infty}{\frac{\Expect{X^k}}{k!}t^k} \] I get a function of \( t \), the moment generating function (MGF), where the derivatives at the origin "encode" the moments: \[ \left. \frac{d^k M_X}{d t^k} \right|_{t=0} = \Expect{X^k} \]

(For this to make sense dimensionally, the units of \( t \) need to be the reciprocal of whatever the units of \( X \) happen to be --- inverse kilograms, or inverse dollars, or square inches per pound, as the case happens to be. --- Also, I will sometimes drop the subscript \( X \) for brevity, but only when it can be understood from context.)

Generating Functions: Why???!?

At this point, while I'm in a confessional mood, I should mention that I never got the point of generating functions while I was a student. (I am sure my teachers explained it but I tuned them out or otherwise didn't get it.) If you start from the definition, as I gave it above, then it seems like you have to already know all the moments to get the MGF, so the MGF doesn't tell you anything. Maybe there are some circumstances where if you forget one of the moments but remember the generating function, it's easier to differentiate \( M_X \) than it is to integrate \( \int{ x^k p(x) dx } \), but that hardly seemed important enough to warrant all the fuss. The answer, which didn't click for me until embarrassingly far into my teaching career, is that generating functions are useful when there's a trick to get the generating function first, and then we can differentiate it to extract the series it encodes.

Back to the Moment Generating Function

Expectations are linear, so we can equally well write \[ M_X(t) = \Expect{\sum_{k=0}^{\infty}{\frac{X^k}{k!} t^k}} \] and, recognizing the power series inside the expectation, \[ M_X(t) = \Expect{e^{tX}} \] Indeed many sources will start with that exponential form as the definition, which makes things look a little more mysterious. But the form facilitates manipulations. For instance, what's the MGF of \( a+X \), for constant \( a \)? \[ \Expect{e^{t(a+X)}} = \Expect{e^{ta} e^{tX}} = e^{ta} \Expect{e^{tX}} = e^{ta} M_X(t) \] What's the MGF of \( b X \), for constant \( b \)? \[ \Expect{e^{t bX}} = \Expect{e^{(tb)X}} = M_X(bt) \]

Even more importantly, if \( X \) and \( Y \) are statistically independent, what's the MGF of their sum \( X+Y \)? \[ M_{X+Y}(t) = \Expect{e^{t(X+Y)}} = \Expect{e^{tX} e^{tY}} = \Expect{e^{tX}} \Expect{e^{tY}} = M_X(t) M_Y(t) \] so the MGF for the sum of two independent random variables is just the product of their individual MGFs. One can, in fact, use these three rules to give a heuristic "derivation" of the central limit theorem: define \( \overline{X}_n = n^{-1}\sum_{i=1}^{n}{X_i} \) for independent and identically distributed \( X_i \), with mean \( \mu \) and variance \( \sigma^2 \), then \[ \sqrt{n}\frac{\overline{X}_n - \mu}{\sigma} \rightarrow \mathcal{N}(0,1) \] the right-hand side being the standard Gaussian (or "Normal") distribution. (Exercise: Do this, first working out the MGF of the standard Gaussian.) I put "derivation" in scare quotes because what this shows is that the (appropriately centered and scaled) sample mean ends up having increasingly Gaussian-looking moments, and it's not quite true that convergence of all the moments implies convergence of the whole distribution.

From the Moment Generating Function to the Cumulant Generating Function and Then Cumulants

So much for the moment generating function. The cumulant generating function is defined in terms of the MGF: \[ C_X(t) \equiv \log{M_X(t)} = \log{\Expect{e^{tX}}} \] The cumulants are defined, as it were, derivatively: \[ \kappa_k \equiv \left. \frac{d^k C_X}{d t^k}\right|_{t=0} \]

(I have sometimes seen people try to motivate this by claiming to want a function that's on the same scale as \( X \), rather than exponentiated, but I'm not sure how that makes sense. As I said earlier, \( t \) needs to have units inverse to \( X \), so \( t X \) is already a dimensionless quantity...)

Some character-building work with the chain rule and quotient rule shows that \begin{eqnarray} \kappa_1 & = & \frac{M^{\prime}(0)}{M(0)}\\ & = & \Expect{X}\\ \kappa_2 & = & \frac{M(0) M^{\prime\prime}(0) - (M^{\prime}(0))^2}{M^2(0)}\\ & = & \Expect{X^2} - (\Expect{X})^2 \equiv \mathrm{Var}(X)\\ \kappa_3 & = & \frac{M^2(0) M^{\prime\prime\prime}(0) - 3 M(0) M^{\prime}(0) M^{\prime\prime}(0) + 2 (M^{\prime}(0))^3}{M^3(0)}\\ & = & \Expect{X^3} - 3\Expect{X^2}\Expect{X} + 2(\Expect{X})^3 \end{eqnarray}

The first cumulant is the first moment. The second cumulant is the second central moment, because \( \Expect{X^2} - \(\Expect{X})^2 = \Expect{(X-\Expect{X})^2} \). The third cumulant is actually also the third central moment, but this is not true of the higher cumulants.

Now here is where my failure-to-grasp begins. (Or maybe it began earlier and this is just where it becomes unmistakable.) I understand what the first three cumulants are saying about the distribution. I have no intuition for what the fourth cumulant measures, or any higher cumulant, or why all of these are measuring the same kind of thing, the way I grasp how all the moments, and all the central moments, are measuring the same kind of thing. I can show, algebraically, that the \( k^{\mathrm{th}} \) cumulant is a polynomial, of order \( k \), in the first \( k \) moments, and that the \( k^{\mathrm{th}} \) moment \( \Expect{X^k} \) is always the first term in that polynomial. (And once I've shown that, I can recover the moments from the cumulants.) But why those are the right polynomials, I can't tell you (or a student).

(I realize that a moment ago I was talking about how, as a student, I didn't see the point of the moment generating function when it seemed that we needed the moments to get it, and now I'm complaining that I don't understand the point of the cumulants which are, seemingly, best defined in terms of the cumulant generating function. The common themes here, across decades, are about my unpleasant combination of intellectual arrogance with mathematical ineptitude.)

I can tell you that when you take the sum of two independent random variables, their cumulant generating functions add, \[ C_{X+Y}(t) = \log{\Expect{e^{tX} e^{tY}}} = \log{\Expect{e^{tX}}} + \log{\Expect{e^{tY}}} = C_X(t) + C_Y(t) \] and consequently their cumulants, of whatever order, must add. I think it's the case that if you demand polynomials in the moments which add up for sums of independent random variables, regardless of their distribution, you are forced to use the cumulants, but I'm not sure of that. (That sounds like the kind of fact I could look up, if I were sufficiently motivated.) Even if so, that doesn't give me any intuition about why (say) it the third cumulant needs to be \( \Expect{X^3} - 3\Expect{X^2}\Expect{X} + 2(\Expect{X})^3 \).

The Cumulant Generating Function and Exponential Tail Bounds

Suppose the moment generating function exists. Then it takes a little work with Markov's inequality to conclude that \[ \Prob{ X > h } < e^{-th} M_X(t) = e^{-th+C_X(t))} ~. \] for any \( t > 0 \). Notice that the \( e^{-th} \) factor declines exponentially as \( h \) increases, but the \( e^{C_X(t)} \) factor doesn't, so we're getting an exponential bound on the probability of large values of \( X \). (And we'd better! Otherwise, \( \Expect{ e^{tX} } \) would be infinite.) Because this is true for any \( t \), we can optimize to get the tightest bound: \[ C_X^*(h) = \sup_{t > 0}{th - C_X(t)} \] and then \[ \Prob{X > h} \leq e^{-C^*_X(h)} \] So the cumulant generating function, through its transformed version \( C^* \), lets us upper-bound the probability of very large values of \( X \). Additivity for independent random variables then becomes a handy way of getting laws of large numbers, the upper-bound half of large deviation principles, etc.

"Understanding"

As I write this all out, it occurs to me that I am not really sure what would satisfy me here. A physical sense of what the cumulants measure would be ideal, but perhaps a bit much to hope for. Another possibility would be something like the following: I can never remember the exact coefficients of the Hermite polynomials, or the Laguerre polynomials, or any of the other orthogonal polynomial series. But I can remember that the members of each family are supposed to be orthogonal to each other, under such-and-such a distribution. Since there are \( n+1 \) coefficients in the polynomial of order \( n \), and it needs to be orthogonal to the previous \( n \) polynomials (including the order-zero, constant polynomial) and the leading term should be just \( x^n \), there are \( n+1 \) linear equations to solve for the coefficients, and I can find each successive polynomial recursively. Moreover, I get why we want orthogonal functions! So one possible answer to my puzzlement about cumulants would be "here is a desirable property of some transformation of the moments, and here's a rule for getting those transformations", with bonus points if it's a recurrence relation which tells me how to get higher cumulants once I know low-order ones. (Something like "this is the part of the kth moment you couldn't guess, somehow, from the lower moments"?) Or, alternately, "Here's a desirable property of some transformation of the moments, and here's a procedure for getting all the terms which have to go in to the kth order cumulant".

With any luck, this is all well-known in some corner of the mathematical universe, and posting this will lead to my being pointed in the right direction.

Partition functions

Supposing I'm doing equilibrium statistical mechanics, and have a bunch of discrete states \( k=0, 1, \ldots \), each with energy \( u_k \). Under the Boltzmann distribution the probability of finding the system in state \( j \) is then \[ \frac{e^{-\beta u_j}}{\sum_{k=0}^{\infty}{e^{-\beta u_k}}} \] where \( \beta = 1/k_B \tau \), \( \tau \) is the absolute temperature, and \( k_B \) is Boltzmann's constant. The normalizing factor in the denominator gets broken out as its own thing, the partition function, \[ z(\beta) \equiv \sum_{k=0}^{\infty}{e^{-\beta u_k}} \]

This isn't the moment generating function, but it's close to the moment generating function, thanks to the magic properties of exponentials: \[ \begin{eqnarray} M_U(t) & = & \Expect{e^{tU}}\\ & = & \sum_{k}{e^{t u_k} \frac{e^{-\beta u_k}}{z(\beta)}}\\ & = & \frac{\sum_{k}{e^{-(\beta-t) u_k}}}{z(\beta)}\\ & = & \frac{z(\beta-t)}{z(\beta)} \end{eqnarray} \]

That is, the MGF is the ratio of values of the partition function. It follows that \[ C_U(t) = \log{z(\beta-t)} - \log{z(\beta)} \]

So the cumulant generating function is the difference between log partition functions. In terms of extracting the cumulants, however, what we care about are derivatives with respect to \( t \), so the second term doesn't actually matter.

Now, in statistical mechanics, we know that \( f(\beta) \equiv \log{z(\beta)} \) is the (Helmholtz) free energy, so we've just convinced ourselves that the free energy is basically the cumulant generating function (at least for Boltzmann distributions). But this is a little funny; in stat. mech. we're taught to take derivatives of \( f \) at inverse temperature \( \beta \) to find the expected energy, the variance around it, etc., but to extract cumulants we take derivatives at zero \( t=0 \). But \[ C^{\prime}(0) = \left.\frac{1}{z(\beta-t)}z^{\prime}(\beta-t)\right|_{t=0} = -\frac{z^{\prime}(\beta)}{z(\beta)} \] and so on for the higher derivatives, which is what we'd get by taking derivatives of the free energy...


Notebooks: