Likelihood, and Maximum Likelihood, in Statistics
Last update: 01 Jul 2026 23:12First version: 24 June 2026
We observe a random variable \( X \). (It may be a big, hairy, high-dimensional beast, with lots of components, but we'll treat it as one object for now.) We also have a probability model with an adjustable parameter \( \theta \). (This may also be an enormous infinite-dimensional object.) For each \( \theta \) we get a distribution for \( X \), say \( p(x;\theta) \equiv \mathrm{Prob}_{\theta}(X=x) \). That is, the probability model tells us, for each parameter value, the probability of any particular outcome. Ordinarily, we tend to look at how \( p(x;\theta) \) changes with \( x \) fixed, for some particular \( \theta \).
What statisticians have come to call the likelihood function is \[ L(\theta) \equiv p(X; \theta) \] This is the probability of the data as a function of the parameter. That is, it tells us the probability of observing what we did observe, as we consider varying the parameter.
A natural and compelling approach to parameter estimation is then the method of maximum likelihood: guess that the true parameter value is the one which makes the observed data as probable as possible. This is, as I said, natural and compelling, and it works (is consistent/probably-approximately-correct) under a broad range of circumstances, but unfortunately it doesn't always work.
To see a little bit about why it typically works, but doesn't always, notice that \( L(\theta) \) is a random function, i.e., a stochastic process. (It is a process "indexed", as we say in the trade, by the parameter space, which may be weird, but still a process.) The method of maximum likelihood looks for the maximum of this random function, and hopes that it converges on the true parameter value. But convergence of stochastic processes is a somewhat delicate business. In many situations, the likelihood function does converge to a sensible, deterministic limiting function which is uniquely maximized at the true parameter value. (When this happy state of affairs applies, the limiting function has nice information-theoretic interpretations.) But there are, alas, times when the convergence just does not work.
[TODO: likelihood ratio tests]
Now, I should at this point admit that the way I've defined likelihood above only works when \( X \) is discrete. If \( X \) is continuous, then one needs to work with probability densities rather than mass functions, which I think makes the rhetoric a bit less persuasive. It also opens the way, to those who've learned measure-theoretic probability, to a more general definition.
(For each \( \theta \), say \( P_{\theta} \) is a probability measure on \( \mathcal{X} \), and these are all absolutely continuous with respect to some reference measure \( M \) (not necessarily a probability measure). Then we define \( L(\theta) = \frac{dP_{\theta}}{d M}(X) \), using the Radon-Nikodym derivative. This makes the exact likelihood function relative to the choice of reference measure \( M \), but notice that for any other reference measure \( N \), we'd have \( \frac{dP_{\theta}}{d N}(X) = \frac{d P_{\theta}}{dM}(X) \frac{dM}{dN}(X) \), so changing the reference measure doesn't change relative likelihoods, the location of the maximum likelihood estimate, etc.)
I should also admit that the idea that one can simply calculate the probability of a given outcome from a probability model is often rather optimistic. This has opened up a range of pseudo-, quasi-, synthetic, and other likelihoods, which try to retain some of the formal structure, while ditching the full probability calculations. One of my reasons for breaking out this notebook is the hope that it will encourage me to wrap my head around these not-quite-likelihoods. (I think I could define the difference between a pseudo- and a quasi- likelihood if I had to, but it's embarrassing for someone in my position not to be sure.)
- See also:
- Empirical Likelihood
- Large Deviations and Information Theory in the Foundations of Statistics
- Recommended, big picture:
- Stephen M. Stigler, "The Epic Story of Maximum Likelihood", Statistical Science 22 (2007): 598--620, arxiv:0804.2996
- Recommended, close-ups:
- Ronald W. Butler, "Predictive Likelihood Inference with Applications", Journal of the Royal Statistical Society B 48 (1986): 1--38 ["in the predictive setting, all parameters are nuisance parameters". JSTOR]
- Bradley Efron, "Maximum Likelihood and Decision Theory", The Annals of Statistics 10 (1982): 340--356
- Charles J. Geyer, "Le Cam Made Simple: Asymptotics of Maximum Likelihood without the LLN or CLT or Sample Size Going to Infinity", arxiv:1206.4762 [There are two separable points here. One is that much of the usual asymptotic theory of maximum likelihood follows from the quadratic form of the likelihood alone; whenever and however that is reached, those consequences follow. Approximately quadratic likelihoods imply approximations to the usual asymptotics. This is unquestionably correct. The other is some bashing of results like the law of large numbers and central limit theorem, which seems misguided to me.]
- Bruce E. Hansen, "The Likelihood Ratio Test Under Nonstandard Conditions: Testing the Markov Switching Model of GNP", Journal of Applied Econometrics 7 (1992): S61--S82 [I very much like the approach of treating the likelihood ratio as an empirical process; why haven't I seen it before? (Also, the state-of-the-art in simulating Gaussian processes must be much better now than what Hansen had in '92, which would make this even more practical. PDF reprint.]
- Christopher C. Heyde, Quasi-Likelihood and Its Applications: A General Approach to Optimal Parameter Estimation
- Lucien Le Cam, "Maximum Likelihood; An Introduction" [PDF. Not an introduction, but rather a collection of examples of where it just does not work, or at least doesn't work well. That this is presented as "an introduction" is entirely characteristic of the author.]
- Erich L. Lehmann, "On likelihood ratio tests", math.ST/0610835
- Steven G. Self and Kung-Yee Liang, "Asymptotic Properties of Maximum Likelihood Estimators and Likelihood Ratio Tests Under Nonstandard Conditions", Journal of the American Statistical Association 82 (1987): 605--610 [JSTOR]
- Quang H. Vuong, "Likelihood Ratio Tests for Model Selection and Non-Nested Hypotheses", Econometrica 57 (1989): 307--333
- To read:
- Daniel Commenges, "Statistical models: Conventional, penalized and hierarchical likelihood", Statistics Surveys 3 (2009): 1--17, arxiv:0808.4042
- Daniel Commenges, Helene Jacqmin-Gadda, Cecile Proust, and Jeremie Guedj, "A Newton-Like Algorithm for Likelihood Maximization: The Robust-Variance Scoring Algorithm", math.ST/0610402
- Joshua V Dillon, Guy Lebanon, "Stochastic Composite Likelihood", Journal of Machine Learning Research 11 (2010): 2597--2633, apparently the final version of arxiv:1003.0691
- Mathias Drton, "Likelihood ratio tests and singularities", Annals of Statistics 37 (2009): 979--1012, arxiv:math.ST/0703360
- David Hinkley, "Predictive Likelihood", Annals of Statistics 7 (1979): 718--728
- Thomas Jaki and and R. Webster West, "Maximum Kernel Likelihood Estimation", Journal of Computational and Graphical Statistics 17 (2008): 976--993
- Jiantao Jiao, Kartik Venkat, Tsachy Weissman, "Maximum Likelihood Estimation of Functionals of Discrete Distributions", arxiv:1406.6959
- Adam M. Johansen, Arnaud Doucet and Manuel Davy, "Particle methods for maximum likelihood estimation in latent variable models", Statistics and Computing 18 (2008) : 47--57
- Youngjo Lee and John A. Nelder, "Likelihood Inference for Models with Unobservables: Another View", Statistical Science 24 (2009): 255--269, arxiv:1010.0303 [with discussion and replies following]
- Richard Nickl, "Donsker-type theorems for nonparametric maximum likelihood estimators", Probability Theory and Related Fields 138 (2007): 411--449
- Yudi Pawitan, In All Likelihood: Statistical Modeling and Inference Using Likelihood
- Sylvain Rubenthaler, Tobias Ryden and Magnus Wiktorsson, "Fast simulated annealing in $\R^d$ and an application to maximum likelihood estimation", math.PR/0609353
- Xiaogang Wang and James V. Zidek, "Selecting likelihood weights by cross-validation", Annals of Statistics 33 (2005): 463--500, math.ST/0505599