A very naive approach to held-out likelihood for topic models

06 Mar 2024 21:16

Attention conservation notice: Assumes you know what a topic model is, and care. This arose from trying to help undergrads come up with a quick-and-dirty hack for a data-analysis project, that they could implement with minimal coding. I haven't checked it against the literature, and, to the extent it has any value, it's probably due to half-digested memories of actual papers.
\[ \newcommand{\DocumentIndex}{d} \newcommand{\TopicIndex}{t} \newcommand{\WordIndex}{w} \newcommand{\NumTopics}{k} \newcommand{\TopicDist}{\phi} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\TopicSharesInDoc}{\theta} \newcommand{\TopicSharesInCorpus}{\alpha} \]

Say we have a fitted topic model, in the form of a set of topic distributions \( \TopicDist_1, \TopicDist_2, \ldots \TopicDist_\NumTopics \), a base measure \( \TopicSharesInCorpus \) on \( 1:k \), and a concentration parameter \( \alpha_0 \). We also have a new document which we've reduced to a bag of words. Specifically, \( x_\WordIndex \) is the number of times word \( \WordIndex \) appears in the new document, and \( \sum_{\WordIndex}{x_\WordIndex} = n \).

We want to evaluate the probability of this document, according to this model: \begin{equation} p(x; \TopicDist, \alpha, \alpha_0) = \int{p(x|\TopicSharesInDoc; \TopicDist, \alpha, \alpha_0) p(\TopicSharesInDoc; \TopicDist, \alpha, \alpha_0) d\TopicSharesInDoc} ~ (1) \label{eqn:integrated-likelihood} \end{equation} This is a notoriously intractable integral, even though the first part is an easy multinomial, \begin{equation} p(x|\TopicSharesInDoc; \TopicDist, \alpha, \alpha_0) = \prod_{\WordIndex}{\left(\sum_{\TopicIndex=1}^{\NumTopics}{\TopicSharesInDoc_{\TopicIndex} \TopicDist_{\TopicIndex}(\WordIndex)}\right)^{x_{\WordIndex}}} ~ (2) \label{eqn:conditional-likelihood} \end{equation} and the second part is just a Dirichlet.

Now here is a very naive approach. Define \( y_{\WordIndex} = x_{\WordIndex}/n \), the empirical frequency of the word \( \WordIndex \) in the document. Successive words are conditionally IID, so \( y_{\WordIndex} \) should, in a large document, approach the probability of \( \WordIndex \) used to generate the document. In fact, for large \( n \), \begin{equation} y \approx \sum_{\TopicIndex=1}^{\NumTopics}{\TopicSharesInDoc_{\TopicIndex} \TopicDist_{\TopicIndex}} \end{equation} i.e., for all \( \WordIndex \), \begin{equation} y_\WordIndex \approx \sum_{\TopicIndex=1}^{\NumTopics}{\TopicSharesInDoc_{\TopicIndex} \TopicDist_{\TopicIndex}(\WordIndex)} \end{equation} This last equation suggests a simple way to estimate \( \TopicSharesInDoc \): run a linear regression of \( y_\WordIndex \) (the dependent variable) on \( \TopicDist_{\TopicIndex}(\WordIndex) \) (the independent variables).

Having obtained an estimate \( \widehat{\TopicSharesInDoc} \), I propose we evaluate the likelihood conditional on that estimation: \begin{equation} p(x; \TopicDist, \alpha, \alpha_0) \approx p(x|\widehat{\TopicSharesInDoc}; \TopicDist) p(\widehat{\TopicSharesInDoc}; \TopicDist, \alpha, \alpha_0) \end{equation} using Eq. 2, and the corresponding Dirichlet factor.

The rationale for doing so is that multinomial distributions enjoy nice large-deviations properties, meaning that for large samples, the probability of the empirical distribution being even \( h \) away in KL divergence from the true distribution is exponentially small in \( nh \). So the integral in Eq. 1 has the form \( \int{e^{-nf(b)} \pi(b) db} \), and I am proposing to just replace it with evaluation at the minimizing \( b^* \). One could of course do a fuller Laplace approximation, perhaps justified by the fact that an \( \epsilon \) change to a parametric distribution typically generates only an order \( \epsilon^2 \) KL divergence. (That is, \( h \simeq \epsilon^2 \).)

(I have written out some notes explaining Laplace approximation. The essence is to Taylor-expand \( f \) around its minimum \( b^* \) to second order, so one picks up contributions not just from the minimum, but from a region around the minimum whose size depends on how sharp the minimum is, via the Hessian matrix of second partial derivatives. Again, see that notebook.)

Applied here, Laplace's method says \begin{equation} p(x; \TopicDist, \alpha, \alpha_0) \approx p(x|\widehat{\TopicSharesInDoc}; \TopicDist) p(\widehat{\TopicSharesInDoc}; \TopicDist, \alpha, \alpha_0) \sqrt{\frac{(2\pi)^{\NumTopics}}{n |\nabla \nabla f(\widehat{\TopicSharesInDoc})|}} \end{equation} Here the Hessian \( \nabla\nabla f^{\prime\prime} \) is basically the Fisher information of the conditional likelihood, Eq. (2), with respect to the within-document topic shares. On a log-likelihood scale, the whole factor in the square root will be \( O(\log{n}) \), and so the contribution to the log-likelihood per observation will be vanishing, \( O(\frac{\log{n}}{n}) \), compared to just using the maximand.

First draft, 28 November 2023. Corrected some obvious typos and worked in a paragraph from other notes explaining Laplace approximation, 5 March 2024.