Notebooks

## A very naive approach to held-out likelihood for topic models

06 Mar 2024 21:16

Attention conservation notice: Assumes you know what a topic model is, and care. This arose from trying to help undergrads come up with a quick-and-dirty hack for a data-analysis project, that they could implement with minimal coding. I haven't checked it against the literature, and, to the extent it has any value, it's probably due to half-digested memories of actual papers.
$\newcommand{\DocumentIndex}{d} \newcommand{\TopicIndex}{t} \newcommand{\WordIndex}{w} \newcommand{\NumTopics}{k} \newcommand{\TopicDist}{\phi} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\TopicSharesInDoc}{\theta} \newcommand{\TopicSharesInCorpus}{\alpha}$

Say we have a fitted topic model, in the form of a set of topic distributions $$\TopicDist_1, \TopicDist_2, \ldots \TopicDist_\NumTopics$$, a base measure $$\TopicSharesInCorpus$$ on $$1:k$$, and a concentration parameter $$\alpha_0$$. We also have a new document which we've reduced to a bag of words. Specifically, $$x_\WordIndex$$ is the number of times word $$\WordIndex$$ appears in the new document, and $$\sum_{\WordIndex}{x_\WordIndex} = n$$.

We want to evaluate the probability of this document, according to this model: $$p(x; \TopicDist, \alpha, \alpha_0) = \int{p(x|\TopicSharesInDoc; \TopicDist, \alpha, \alpha_0) p(\TopicSharesInDoc; \TopicDist, \alpha, \alpha_0) d\TopicSharesInDoc} ~ (1) \label{eqn:integrated-likelihood}$$ This is a notoriously intractable integral, even though the first part is an easy multinomial, $$p(x|\TopicSharesInDoc; \TopicDist, \alpha, \alpha_0) = \prod_{\WordIndex}{\left(\sum_{\TopicIndex=1}^{\NumTopics}{\TopicSharesInDoc_{\TopicIndex} \TopicDist_{\TopicIndex}(\WordIndex)}\right)^{x_{\WordIndex}}} ~ (2) \label{eqn:conditional-likelihood}$$ and the second part is just a Dirichlet.

Now here is a very naive approach. Define $$y_{\WordIndex} = x_{\WordIndex}/n$$, the empirical frequency of the word $$\WordIndex$$ in the document. Successive words are conditionally IID, so $$y_{\WordIndex}$$ should, in a large document, approach the probability of $$\WordIndex$$ used to generate the document. In fact, for large $$n$$, $$y \approx \sum_{\TopicIndex=1}^{\NumTopics}{\TopicSharesInDoc_{\TopicIndex} \TopicDist_{\TopicIndex}}$$ i.e., for all $$\WordIndex$$, $$y_\WordIndex \approx \sum_{\TopicIndex=1}^{\NumTopics}{\TopicSharesInDoc_{\TopicIndex} \TopicDist_{\TopicIndex}(\WordIndex)}$$ This last equation suggests a simple way to estimate $$\TopicSharesInDoc$$: run a linear regression of $$y_\WordIndex$$ (the dependent variable) on $$\TopicDist_{\TopicIndex}(\WordIndex)$$ (the independent variables).

Having obtained an estimate $$\widehat{\TopicSharesInDoc}$$, I propose we evaluate the likelihood conditional on that estimation: $$p(x; \TopicDist, \alpha, \alpha_0) \approx p(x|\widehat{\TopicSharesInDoc}; \TopicDist) p(\widehat{\TopicSharesInDoc}; \TopicDist, \alpha, \alpha_0)$$ using Eq. 2, and the corresponding Dirichlet factor.

The rationale for doing so is that multinomial distributions enjoy nice large-deviations properties, meaning that for large samples, the probability of the empirical distribution being even $$h$$ away in KL divergence from the true distribution is exponentially small in $$nh$$. So the integral in Eq. 1 has the form $$\int{e^{-nf(b)} \pi(b) db}$$, and I am proposing to just replace it with evaluation at the minimizing $$b^*$$. One could of course do a fuller Laplace approximation, perhaps justified by the fact that an $$\epsilon$$ change to a parametric distribution typically generates only an order $$\epsilon^2$$ KL divergence. (That is, $$h \simeq \epsilon^2$$.)

(I have written out some notes explaining Laplace approximation. The essence is to Taylor-expand $$f$$ around its minimum $$b^*$$ to second order, so one picks up contributions not just from the minimum, but from a region around the minimum whose size depends on how sharp the minimum is, via the Hessian matrix of second partial derivatives. Again, see that notebook.)

Applied here, Laplace's method says $$p(x; \TopicDist, \alpha, \alpha_0) \approx p(x|\widehat{\TopicSharesInDoc}; \TopicDist) p(\widehat{\TopicSharesInDoc}; \TopicDist, \alpha, \alpha_0) \sqrt{\frac{(2\pi)^{\NumTopics}}{n |\nabla \nabla f(\widehat{\TopicSharesInDoc})|}}$$ Here the Hessian $$\nabla\nabla f^{\prime\prime}$$ is basically the Fisher information of the conditional likelihood, Eq. (2), with respect to the within-document topic shares. On a log-likelihood scale, the whole factor in the square root will be $$O(\log{n})$$, and so the contribution to the log-likelihood per observation will be vanishing, $$O(\frac{\log{n}}{n})$$, compared to just using the maximand.

First draft, 28 November 2023. Corrected some obvious typos and worked in a paragraph from other notes explaining Laplace approximation, 5 March 2024.