Empirical Likelihood

Last update: 07 Jul 2025 13:46
First version: 4 December 2024

\[ \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \]

Yet Another Inadequate Placeholder

This is an extremely clever idea for retaining many of the conveniences of ordinary likelihood-based statistical methods, when the assumptions behind those methods are just too much to swallow. Embarrassingly, however, I can never remember the tricks, so I break this out as a notebook in large part to force myself to re-re-re-learn it, in the hope that it will stick this time.

Inference for the expected value of a function

Say we have data points $ Z_1, Z_2, \ldots Z_n $, iidly distributed with common distribution $\mu$. As every schoolchild knows, the nonparametric maximum likelihood distribution here is the empirical distribution, \[ \hat{\mu}_n(z) \equiv \frac{1}{n}\sum_{i=1}^{n}{\delta_{Z_i}(z)} \] (If you're a physicist you'd rather write the summands as $ \delta(z-Z_i) $, big deal.) This is because any other distribution either puts probability mass on points which aren't observed (hence lowering the likelihood), or mis-allocates probability among the observed points (hence lowering the likelihood).

Now suppose we think there ought to be some restriction on the distribution, and that this takes the form of some aspect of the distribution matching some function of a parameter: \[ \Expect{g(Z;\theta)} = 0 \]

(In many cases, $ g(Z;\theta) = f(Z) - h(\theta) $, but some flexibility to allow for more complicated forms is harmless at this point.) How do we maximize likelihood, non-parametrically, while respecting this constraint?

First, we need to recognize that it's still the case that putting any probability on un-observed values of $z$ just lowers the likelihood. So we are only interested in distributions which re-allocate probability among the $n$ data points. These can of course be given by $n$ numbers $ p_1, \ldots p_n $, with $ p_i \geq 0 $, $ \sum_{i=1}^{n}{p_i} = 1 $. Using Lagrange multipliers to turn the constrained problem into an unconstrained one, we get \[ \max_{p_1, \ldots p_n, \theta}{\sum_{i=1}^{n}{\log{p_i}}} - \lambda \left( \sum_{i=1}^{n}{p_i} - 1 \right) - \gamma \sum_{i=1}^{n}{p_i g(Z_i;\theta)} \] Let's pretend that the angels have told us $\theta$ so we only need to maximize over the weights. Take derivatives and set them to zero at the optimum: \[ \frac{1}{\hat{p}_i} - \hat{\lambda} - \hat{\gamma} g(Z_i;\theta) = 0 \]

(Here ends the nap-time during which I am writing this; updates to follow, inshallah.)

(Resuming during a different nap-time.) \[ \hat{p}_i = \frac{1}{\hat{\lambda} + \hat{\gamma} g(Z_i;\theta)} \] Plugging these back into the constraint equations gives us two equations in our two unknown Lagrange multipliers $\hat{\lambda}$ and $\hat{\gamma}$, so those are pinned down: \begin{eqnarray} \sum_{i=1}^{n}{\frac{1}{\hat{\lambda} + \hat{\gamma} g(Z_i;\theta)}} & = & 1\\ \sum_{i=1}^{n}{\frac{g(Z_i;\theta)}{\hat{\lambda} + \hat{\gamma} g(Z_i;\theta)}} & = & 0 \end{eqnarray} Now the second equation is equivalent to \[ \sum_{i=1}^{n}{\frac{\hat{\gamma} g(Z_i;\theta)}{\hat{\lambda} + \hat{\gamma} g(Z_i;\theta)}} = 0 \] but \[ \frac{\hat{\lambda}}{\hat{\lambda} + \hat{\gamma} g(Z_i;\theta)} = 1 - \frac{\hat{\gamma} g(Z_i;\theta)}{\hat{\lambda} + \hat{\gamma} g(Z_i;\theta)} \] and so \[ \frac{1}{\hat{\lambda} + \hat{\gamma} g(Z_i;\theta)} = \frac{1}{\hat{\lambda}\left( 1 - \frac{\hat{\gamma} g(Z_i;\theta)}{\hat{\lambda} + \hat{\gamma} g(Z_i;\theta)}\right) \] After this long excursion, go back to the first, probabilities-add-up-to-one constraint equation: \begin{equation} 1 & = & \sum_{i=1}^{n}{\frac{1}{\hat{\lambda} + \hat{\gamma} g(Z_i;\theta)}}\\ & = & \frac{1}{\hat{\lambda}}\sum_{i=1}^{n}{1 - \frac{\hat{\gamma} g(Z_i;\theta)}{\hat{\lambda} + \hat{\gamma} g(Z_i;\theta)}}\\ & = & \frac{1}{\hat{\lambda}}(n - 0)\\ n & = & \hat{\lambda} \end{equation} (There's probably an easier way to get this but I'm trying to do everything from memory.) So we have \begin{eqnarray} \hat{p}_i & = & \frac{1}{n + \hat{\gamma} g(Z_i;\theta)}\\ \sum_{i=1}^{n}{\frac{g(Z_i;\theta)}{n + \hat{\gamma} g(Z_i;\theta)}} & = & 0 \end{eqnarray} The second equation there won't generally simplify, but it does give us one equation for our one unknown $\hat{\gamma}$, as a function of $\theta$.

The value of the log empirical likelihood, at this optimum, is \[ L(\theta) = -\sum_{i=1}^{n}{\log{\left(n + \hat{\gamma}(\theta)g(Z_i;\theta)\right)} \] which we can maximize numerically.

Large Deviations
Statistics

Yuichi Kitamura, "Empirical Likelihood Methods in Econometrics: Theory and Practice", pp. 174--237 in Richard Blundell, Whitney Newey and Torsten Persson (eds.), Advances in Economics and Econometrics: Theory and Applications, Ninth World Congress, ssrn/917901 (2006)

Gianfranco Adimari and Annamaria Guolo, "A note on the asymptotic behaviour of empirical likelihood statistics", Statistical Methods and Applications 19 (2010): 463--476
Jianqing Fan and Jian Zhang, "Sieve empirical likelihood ratio tests for nonparametric functions", Annals of Statistics 32 (2004): 1858--1907, math.ST/0503667
Nils Lid Hjort, Ian W. McKeague, Ingrid Van Keilegom, "Extending the scope of empirical likelihood", Annals of Statistics 37 (2009): 1079--1111, arxiv:0904.2949
K. L. Mengersen, P. Pudlo, C. P. Robert, "Bayesian computation via empirical likelihood", arxiv:1205.5658
Ursula U. Müller, Anton Schick, Wolfgang Wefelmeyer, "Blockwise Empirical Likelihood and Efficiency for Markov Chains", Journal of Time Series Analysis forthcoming (2025+)
Art Owen, Empirical Likelihood
Hanxiang Peng and Anton Schick, "Empirical likelihood approach to goodness of fit testing", Bernoulli 19 (2013): 954--981
Susanne M. Schennach, "Point estimation with exponentially tilted empirical likelihood", Annals of Statistics 35 (2007): 634--672, arxiv:0708.1874