A very naive approach to held-out likelihood for topic models
Last update: 21 Apr 2025 21:17
First version: 19 January 2024
Attention
conservation notice: Assumes you know what
a topic model is, and care. This arose from
trying to help undergrads come up with a quick-and-dirty hack for a
data-analysis project, that they could implement with minimal coding. I
haven't checked it against the literature, and, to the extent it has any value,
it's probably due to half-digested memories of actual papers.
Say we have a fitted topic model, in the form of a set of topic distributions
, a base measure
on , and a concentration parameter . We
also have a new document which we've reduced to a bag of words. Specifically,
is the number of times word appears in the new
document, and .
We want to evaluate the probability of this document, according to this model:
This is a notoriously intractable integral, even though the first part is an easy multinomial,
and the second part is just a Dirichlet.
Now here is a very naive approach.
Define , the empirical frequency of the
word in the document. Successive words are conditionally IID, so
should, in a large document, approach the probability of
used to generate the document. In fact, for large ,
i.e., for all ,
This last equation suggests a simple way to estimate : run a linear regression of (the dependent variable) on (the independent variables).
Having obtained an estimate , I propose we evaluate
the likelihood conditional on that estimation:
using Eq. 2, and the corresponding Dirichlet
factor.
The rationale for doing so is that multinomial distributions enjoy
nice large-deviations properties, meaning
that for large samples, the probability of the empirical distribution being
even away in KL divergence from the true distribution is exponentially
small in . So the integral in Eq. 1
has the form , and I am proposing to just replace
it with evaluation at the minimizing . One could of course do a fuller Laplace approximation, perhaps justified by
the fact that an change to a parametric distribution typically
generates only an order KL divergence. (That is, .)
(I have written out some notes
explaining Laplace approximation. The essence is to Taylor-expand
around its minimum to second order, so one picks up contributions not just from the minimum, but from a region around the minimum whose
size depends on how sharp the minimum is, via the Hessian matrix of
second partial derivatives. Again, see that notebook.)
Applied here, Laplace's method says
Here the Hessian is basically the Fisher information of the conditional likelihood, Eq. (2), with respect to the within-document topic shares. On a log-likelihood scale, the whole factor in the square root will be , and so the contribution to the log-likelihood per observation will be vanishing, , compared to just using the maximand.
First draft, 28 November 2023. Corrected some obvious typos and worked in a paragraph from
other notes explaining Laplace approximation, 5 March 2024.