Notebooks

Likelihood, and Maximum Likelihood, in Statistics

Last update: 01 Jul 2026 23:12
First version: 24 June 2026

We observe a random variable \( X \). (It may be a big, hairy, high-dimensional beast, with lots of components, but we'll treat it as one object for now.) We also have a probability model with an adjustable parameter \( \theta \). (This may also be an enormous infinite-dimensional object.) For each \( \theta \) we get a distribution for \( X \), say \( p(x;\theta) \equiv \mathrm{Prob}_{\theta}(X=x) \). That is, the probability model tells us, for each parameter value, the probability of any particular outcome. Ordinarily, we tend to look at how \( p(x;\theta) \) changes with \( x \) fixed, for some particular \( \theta \).

What statisticians have come to call the likelihood function is \[ L(\theta) \equiv p(X; \theta) \] This is the probability of the data as a function of the parameter. That is, it tells us the probability of observing what we did observe, as we consider varying the parameter.

A natural and compelling approach to parameter estimation is then the method of maximum likelihood: guess that the true parameter value is the one which makes the observed data as probable as possible. This is, as I said, natural and compelling, and it works (is consistent/probably-approximately-correct) under a broad range of circumstances, but unfortunately it doesn't always work.

To see a little bit about why it typically works, but doesn't always, notice that \( L(\theta) \) is a random function, i.e., a stochastic process. (It is a process "indexed", as we say in the trade, by the parameter space, which may be weird, but still a process.) The method of maximum likelihood looks for the maximum of this random function, and hopes that it converges on the true parameter value. But convergence of stochastic processes is a somewhat delicate business. In many situations, the likelihood function does converge to a sensible, deterministic limiting function which is uniquely maximized at the true parameter value. (When this happy state of affairs applies, the limiting function has nice information-theoretic interpretations.) But there are, alas, times when the convergence just does not work.

[TODO: likelihood ratio tests]

Now, I should at this point admit that the way I've defined likelihood above only works when \( X \) is discrete. If \( X \) is continuous, then one needs to work with probability densities rather than mass functions, which I think makes the rhetoric a bit less persuasive. It also opens the way, to those who've learned measure-theoretic probability, to a more general definition.

(For each \( \theta \), say \( P_{\theta} \) is a probability measure on \( \mathcal{X} \), and these are all absolutely continuous with respect to some reference measure \( M \) (not necessarily a probability measure). Then we define \( L(\theta) = \frac{dP_{\theta}}{d M}(X) \), using the Radon-Nikodym derivative. This makes the exact likelihood function relative to the choice of reference measure \( M \), but notice that for any other reference measure \( N \), we'd have \( \frac{dP_{\theta}}{d N}(X) = \frac{d P_{\theta}}{dM}(X) \frac{dM}{dN}(X) \), so changing the reference measure doesn't change relative likelihoods, the location of the maximum likelihood estimate, etc.)

I should also admit that the idea that one can simply calculate the probability of a given outcome from a probability model is often rather optimistic. This has opened up a range of pseudo-, quasi-, synthetic, and other likelihoods, which try to retain some of the formal structure, while ditching the full probability calculations. One of my reasons for breaking out this notebook is the hope that it will encourage me to wrap my head around these not-quite-likelihoods. (I think I could define the difference between a pseudo- and a quasi- likelihood if I had to, but it's embarrassing for someone in my position not to be sure.)


Notebooks:   Powered by Blosxom