Notebooks

Topic Models

05 Mar 2024 10:52

Spun off from Text Mining. What follows, between the horizontal rules, are some notes I prepared for students in the statistical-projects class in 2023, followed by my usual link-dump.


\[ \newcommand{\MixtureDist}{f_{\mathrm{mix}}} \newcommand{\DocumentIndex}{d} \newcommand{\TopicIndex}{t} \newcommand{\WordIndex}{w} \newcommand{\NumTopics}{k} \newcommand{\TopicDist}{\phi} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\TopicSharesInDoc}{\theta} \newcommand{\TopicSharesInCorpus}{\alpha} \]
Attention conservation notice: Contains lies told to children (mostly marked as such); you're probably better off reading Blei and Lafferty's review paper (link below).

1. Mixture Models

A basic mixture model is a way of defining a complicated distribution as a combination or mixture of simpler distributions. Each component or cluster of the mixture has its own distribution over stuff we can observe, say \(f_1(x), f_2(x), \ldots f_\NumTopics(x) \) if there are \(\NumTopics \) clusters. (Here \(f \) is a probability mass function if the observable \(x \) is discrete, but a probability density function if \(x \) is continuous.) There are also shares, weights or probabilities for the clusters, \(\alpha_1, \alpha_2, \ldots \alpha_\NumTopics \), with the obvious restrictions that \(\alpha_\TopicIndex \geq 0 \), \(\sum_{\TopicIndex=1}^{\NumTopics}{\alpha_\TopicIndex} = 1 \). These tell us how much of the over-all distribution comes from each cluster.

A mixture model then defines an over-all distribution for the random \(X \), as a weighted combination of the clusters' distributions for \(X \): \begin{equation} \MixtureDist(x) = \sum_{\TopicIndex=1}^{\NumTopics}{\alpha_\TopicIndex f_\TopicIndex(x)} ~ (1) \label{eqn:mixture-dist} \end{equation}

1.1 Simulating a mixture model

There are two steps to drawing from a mixture distribution, a.k.a. simulating a mixture model.

  1. Pick a random cluster \(C \), with the probability that \(C=c \) being given by \(\alpha_c \).
  2. Draw \(X \) from the corresponding cluster's distribution \(f_C \).

You can check that this procedure gives exactly the distribution for \(X \) from Eq. (1) above. (Hint: use the law of total probability.)

1.2 Probability of a new observation

If we're handed a new observation, say \(x_0 \), and want to know its probability according to a given mixture model, we just plug \(x_0 \) into the formula for \(\MixtureDist \) in Eq. (1).

If we want to know the log probability of \(x_0 \), that's \[\begin{equation} \log{\sum_{\TopicIndex=1}^{\NumTopics}{\alpha_\TopicIndex f_\TopicIndex(x_0)}} \end{equation} \] which unfortunately does not simplify (log of a sum isn't the sum of logs). If we were trying to estimate the mixture model by maximum likelihood, i.e., maximum log-likelihood, that not-simplifying would be a pain, and would lead to all sorts of computational tricks like the E-M algorithm. But if we've got an estimated mixture model, and we just want to see what probability it assigns to a new data point, there's no trickery needed.

1.3 Picking the number of mixture components

The following approach is simple and even "basic", but it works well.

  1. Randomly divide the data into two parts, a training and a testing set. A testing set of, say, 10% of the data will usually be fine.
  2. Fit mixture models with different numbers of clusters, from \(\NumTopics_{\min} \) up to \(\NumTopics_{\max} \), on the training set, and only on the training set.
    • \(\NumTopics_{\min} \) could be as small as 1, but you might feel that you know there have to be at least so many different types or clusters in your data set.
    • Similarly, \(\NumTopics_{\max} \) has to be less than the number of training points, but in practice you'll usually want to limit yourself to something much smaller, for purely computational reasons.
  3. For each value of \(\NumTopics \), calculate the log-probability of every point in your testing set.
  4. Pick th value of \(\NumTopics \) with the highest average log-probability on the testing set.

This is the value of \(\NumTopics \) which does the best job of generalizing to a new data point it didn't get to see during estimation.

2 Topic models

A topic model is a kind of mixture model for documents. Document \(\DocumentIndex \) consists of (say) \(n_\DocumentIndex \) words. We ignore the order of the words (for these purposes), and just care about how often each word appears1. We can take every word in the dictionary and count how many times word \(\WordIndex \) appears in document \(\DocumentIndex \), getting (say) \(x_{\DocumentIndex\WordIndex} \), with the constraint that \(x_{\DocumentIndex\WordIndex} \geq 0 \), \(\sum_{w}{x_{\DocumentIndex\WordIndex}} = n_\DocumentIndex \). We can also think of \[\begin{equation} p_{\DocumentIndex\WordIndex} = \frac{x_{\DocumentIndex\WordIndex}}{n_\DocumentIndex} \end{equation} \]

as a distribution over words in document \(\DocumentIndex \) --- it's the probability that if we pick a word at random from document \(\DocumentIndex \), then the word will be \(\WordIndex \).2 We'll write \(p_\DocumentIndex \) for the distribution of words within document \(\DocumentIndex \).

A topic, in this way of thinking, is a distribution over words. If there are \(\NumTopics \) topics, there will be distributions \(\TopicDist_{1}, \TopicDist_{2}, \ldots \TopicDist_\NumTopics \), each one of which will be a distribution over all the words in the dictionary. Thus \(\TopicDist_{\TopicIndex \WordIndex} \) is the probability which topic \(\TopicIndex \) gives to word \(\WordIndex \): if a document was exclusively about topic \(\TopicIndex \), what's the probability that a randomly-chosen word would be \(\WordIndex \)? The key idea of the topic model is that every document's distribution over words should be built up from the topic distributions (plus a little noise). In symbols, the idea is that \[\begin{equation} p_\DocumentIndex \approx \sum_{\TopicIndex=1}^{\NumTopics}{\TopicSharesInDoc_{\DocumentIndex\TopicIndex} \TopicDist_\TopicIndex} \end{equation} \]

i.e., that for each word \(\WordIndex \), \[\begin{equation} p_{\DocumentIndex\WordIndex} \approx \sum_{\TopicIndex=1}^{\NumTopics}{\TopicSharesInDoc_{\DocumentIndex\TopicIndex} \TopicDist_{\TopicIndex\WordIndex}} ~ (2) \label{eqn:documents-shares-topics} \end{equation} \]

The quantity \(\TopicSharesInDoc_{\DocumentIndex\TopicIndex} \) says what share of document \(\DocumentIndex \) comes from topic \(\TopicIndex \), so \(\TopicSharesInDoc_{\DocumentIndex\TopicIndex} \geq 0 \), \(\sum_{\TopicIndex=1}^{\NumTopics}{\TopicSharesInDoc_{\DocumentIndex\TopicIndex}} = 1 \).

2.1 Within-document topic shares

If we knew the topic distributions \(\TopicDist_\TopicIndex \), we could try to find the topic shares \(\TopicSharesInDoc_{\DocumentIndex\TopicIndex} \) by, essentially, regression, starting with Eq. (2). That is, we'd take the proportions \(p_{\DocumentIndex\WordIndex} \) as the target or "dependent" variable and linearly regress them on \(\TopicDist_{\TopicIndex\WordIndex} \); the coefficients in this regression would be (estimates of) \(\TopicSharesInDoc_{\DocumentIndex\TopicIndex} \). This isn't usually how it's actually done, for a variety of reasons3, but it's not a bad "origin myth" to keep around for thinking about where the topic shares come from.

Notice that if we knew the topic shares \(\TopicSharesInDoc_{\DocumentIndex\TopicIndex} \), we could take the same equation and linearly regress the \(p_{\DocumentIndex\WordIndex} \)'s on those \(\TopicSharesInDoc_{\DocumentIndex\TopicIndex} \)'s. The coefficients in that regression would be (estimates of) the \(\TopicDist_{\TopicIndex\WordIndex} \)'s. So if we know either the topic shares or the topic distributions, we could use linear regression to estimate the other. This suggests an approach called alternating least squares where we start with a guess about either \(\TopicSharesInDoc \) or \(\TopicDist \), and switch back and forth between re-estimating \(\TopicSharesInDoc \) given our current estimate of \(\TopicDist \) and re-estimating \(\TopicDist \) given our current estimate of \(\TopicSharesInDoc \). This can work but, again, it's not the usual approach for reasons described in the footnote.

2.2 Over-all topic shares

Since each document \(\DocumentIndex \) in the corpus has a share for each topic, we could aggregate these to say what's the over-all share or weight \(\alpha_\TopicIndex \) of topic \(\TopicIndex \) in the topic model. The obvious way to do it would be to just declare \(\alpha_\TopicIndex \) the average of \(\TopicSharesInDoc_{\DocumentIndex\TopicIndex} \) over documents. Again, in computational practice we usually try to maximize likelihood4, which doesn't quite amount to simple averaging.

2.3 Simulating, and the role of document-specific topic shares

I said above that when we try to simulate a basic mixture model, we randomly choose a cluster or mixture component \(C \), and then draw the observable \(X \) from that component's distribution \(f_C \). That works for drawing one word from a topic model: we chose a topic at random, and then choose a word from that topic's distribution. What do we do when we want to generate multiple words, in the same document?

One possibility is that we keep drawing words from the same topic \(C \). If we did this, every simulated document would come from one and only one topic. This would have two undesirable consequences. First, when we fit topic models, it's very rare for a document to have a single topic, or even to give weight near 1 to a single topic; this approach is unrealistic. Second, when we look at long documents, by the law of large numbers, the words-in-document distributions \(p_{\DocumentIndex\WordIndex} \) will approach the words-in-topic distribution \(\TopicDist_C \). Since there are only \(\NumTopics \) topics, there should be only \(\NumTopics \) distinct word-in-document distributions for large documents. Again, this is unrealistic: even with book-length documents, every document in the corpus has a somewhat distinct distribution.

Another possibility is that rather than picking a single random topic for each document, we go back to the distribution over topics and pick a new topic for each word. This too has undesirable consequences. First, every document will have the same distribution over topics, namely the over-all topic shares (\(\pm \) sampling noise which will go to zero as the documents get longer, by the law of large numbers). Second, every document will have the same distribution over words, namely the over-all word distribution of the topic model. (Again, \(\pm \) shrinking sampling noise.)

The very ingenious solution of the inventors of "latent Dirichlet allocation" (Blei, Ng, and Jordan 2003) was instead as follows:

  1. For each document \(\DocumentIndex \), pick a random distribution over topics \(\TopicSharesInDoc_{\DocumentIndex\TopicIndex} \), in such a way that \(\Expect{\TopicSharesInDoc_{\DocumentIndex\TopicIndex}} = \alpha_\TopicIndex \), i.e., on average within-document topic shares track corpus topic-shares. The specific trick used to do this is called the Dirichlet distribution, detailed in the next sub-subsection.
  2. For each word in document \(\DocumentIndex \), pick a random topic according to \(\TopicSharesInDoc_{\DocumentIndex\TopicIndex} \), and then pick the word from that topic's distribution.

This accomplishes several things:

  1. Each document gets to have its own distribution of words.
  2. Those word-in-document distributions are still mixtures of the basic topic distributions (\(\pm \) sampling noise that shrinks with the number of words in the document).
  3. If we generate many documents, i.e., if we simulate a whole corpus, the over-all distribution of words in the corpus will match the over-all distribution of the topic model (\(\pm \) sampling noise that shrinks with the number of documents).

2.3.1 The Dirichlet distribution

(This sub-sub-section is a little deeper into the weeds than the rest of these notes.)

I said above that the key trick is to pick a random distribution \(B \) over topics in a way which ensures that \(\Expect{B_\TopicIndex} = \alpha_\TopicIndex \). There are many ways one could do this, but the way which Blei, Ng, and Jordan (2003) chose was based on what's called the Dirichlet distribution (hence "latent Dirichlet allocation" or LDA). A Dirichlet distribution is a distribution over sets of \(\NumTopics \) real numbers, say \(B_1, B_2, \ldots B_\NumTopics \), with the restrictions that \(B_\TopicIndex \geq 0 \), \(\sum_{\TopicIndex=1}^{\NumTopics}{B_\TopicIndex} = 1 \). That is, the numbers \(B_\TopicIndex \) have to be (potentially) valid probabilities over a size-\(\NumTopics \) set5. There are \(\NumTopics \) parameters, conventionally \(\alpha_1, \alpha_2, \ldots \alpha_\NumTopics \), which are required to be \(\geq 0 \) but are not required to add up to one. The pdf at the point \(b_1, b_2, \ldots b_\NumTopics \) is \[\begin{equation} p(b_1, \ldots b_\NumTopics; \alpha_1, \ldots \alpha_\NumTopics) = \frac{1}{Z(\alpha)} \prod_{\TopicIndex=1}^{\NumTopics}{b_\TopicIndex^{\alpha_\TopicIndex - 1}} \end{equation} \] (The constant of proportionality \(Z(\alpha) \) can be worked out and involves a mess of gamma functions.) People often refer to \(\sum_{\TopicIndex=1}^{\NumTopics}{\alpha_\TopicIndex} \) as \(\alpha_0 \), but that's an abbreviation rather than an additional parameter.

By playing around with plots of this, when \(\NumTopics=2 \) or \(\NumTopics=3 \), you can convince yourself of a couple of things:

It's a bit more work --- you do need to get the constant of proportionality \(Z(\alpha) \) --- to get the expected values, but when you do, it's \(\Expect{B_\TopicIndex} = \frac{\alpha_\TopicIndex}{\alpha_0} \equiv \pi_\TopicIndex \). If we keep these ratios \(\frac{\alpha_\TopicIndex}{\alpha_0} \) constant, but scale \(\alpha_0 \) up or down, the variances \(\Var{B_\TopicIndex} \) decrease as \(\alpha_0 \) increases6.

2.4 Probability of a new document; dependence between words within a document

Suppose we have a new document, and we want to know what probability it'd get under an existing, fitted topic model. That is, we have topic shares \(\alpha_1, \ldots \alpha_\NumTopics \), and for each topic \(\TopicIndex \) we have a distribution of words \(\TopicDist_\TopicIndex \), and now we want to know the probability of a certain bag-of-words, \(x_\WordIndex \). How do we do this?

If we knew the shares of each topic in this document, \(\TopicSharesInDoc_\TopicIndex \) for topic \(\TopicIndex \), then all the words would be independent of each other. Applying Eq. (1), the probability that any random word equals \(\WordIndex \) would be \[\begin{equation} p_\WordIndex(\TopicSharesInDoc; \TopicDist, \alpha) = \sum_{\TopicIndex=1}^{\NumTopics}{\TopicSharesInDoc_\TopicIndex \TopicDist_{\TopicIndex \WordIndex}} ~ (3) \label{eqn:prob-word-given-doc-shares} \end{equation} \] The probability of the document would just be a multinomial, \[\begin{equation} p(x|\TopicSharesInDoc; \TopicDist, \alpha) = \frac{n!}{\prod_{\WordIndex}{x_{\WordIndex}!}} \prod_{\WordIndex}{p_\WordIndex(\TopicSharesInDoc)^{x_\WordIndex}} ~ (4) \end{equation} \] with the products running over all words \(\WordIndex \) in the dictionary. (Notice that th big combinatorial factor out front doesn't involve any of the model parameters, so it won't actually matter if we're trying to compare different models.)

However, we don't know \(\TopicSharesInDoc \) for this new document, and it's part of the model that the \(\TopicSharesInDoc \) is actually randomly generated for each new document. So the over-all probability of the document should be given by summing over all possible \(\TopicSharesInDoc \)'s (that's the law of total probability): \[\begin{equation} p(x; \TopicDist, \alpha) = \int{p(x|\TopicSharesInDoc; \TopicDist, \alpha) p(\TopicSharesInDoc; \alpha) d\TopicSharesInDoc} ~ (5) \label{eqn:total-prob-for-topic-model} \end{equation} \] This looks like a mess --- try writing it out more explicitly by substituting in from Eqs. (3) and (4). In fact it is a mess, and computing it is intractable (Blei, Ng, and Jordan 2003, sec. 5.1), so software packages use computationally-tractable approximations.

One quick-and-dirty approach is as follows: Think of Eq. (5) as an integral of the form \(\int{e^{nf(b)} p(b) db} \). This is a reasonable thing to do, because \(p(x|\TopicSharesInDoc; \TopicDist, \alpha) \) is going to arise from multiplying together \(n \) different terms, and \(e^{n f(b)} = {(e^{f(b)})}^n \), so \(f(b) \) is the average log-probability per word. Written in this way, it becomes plausible that the integral will be dominated by the contribution coming from the value \(b^* \) which minimizes \(f(b) \). If we consider an alternative value of \(b = b^* + \epsilon \), then \(f(b^* + \epsilon) < f(b^*) \), because \(b^* \) is the maximum, so \(e^{nf(b^*+\epsilon)} \) will be exponentially smaller than \(e^{nf(b^*)} \). Of course the "prior" factor \(p(b) \) matters too, but that is constant order in \(n \), while the exponential factor is becoming more and more important as \(n \) grows. So one crude approach is simply to find the \(\TopicSharesInDoc \) which maximizes Eq. (4), and use that value. A slightly more refined approach would try to include contributions from values of \(\TopicSharesInDoc \) close to the best-fitting one --- this is called Laplace approximation (Tierney, Kass, and Kadane 1989), and I have elaborated on this in a separate notebook.

2.4.1 Dependence between words

As I said, in the topic model, all the words in document \(\DocumentIndex \) are assumed to be statistically independent, conditional on the document's topic-shares \(\TopicSharesInDoc_\DocumentIndex \). However, unconditionally, the words are statistically dependent. To see this intuitively, imagine that one of the words in the document is "anteater". Seeing this should increase our probability that we're looking at topics which have to do with animals, which in turn increases the probability of seeing words like "jaguar" and "sloth". If, on the other hand, we see "Anabaptist", this should increase our probability that we're reading about Christian denominations, increasing the probability of words like "Amish" (who are a kind of Anabaptist). In symbols, \[\begin{eqnarray} p(x_{\WordIndex^{\prime}}|x_\WordIndex) & = & \int{p(x_{\WordIndex^{\prime}}|\TopicSharesInDoc, x_{\WordIndex}) p(\TopicSharesInDoc|x_{\WordIndex}) d\TopicSharesInDoc}\\ & = & \int{p(x_{\WordIndex^{\prime}}|\TopicSharesInDoc) p(\TopicSharesInDoc|x_{\WordIndex}) d\TopicSharesInDoc} \end{eqnarray} \] using the fact that the words are conditionally independent given the topic shares for each document. The first factor in the integrand says "what kind of words can we expect, given the topic shares?", but the second factor is "what kind of mix of topics are we looking at, given the (other) words?" Since words are informative about topics, there'll be a dependence between words.

References

Blei, David M., Andrew Y. Ng, and Michael I. Jordan. 2003. "Latent Dirichlet Allocation." Journal of Machine Learning Research 3: 993--1022. http://jmlr.csail.mit.edu/papers/v3/blei03a.html.

Tierney, Luke, Robert E. Kass, and Joseph B. Kadane. 1989. "Fully Exponential Laplace Approximations to Expectations and Variances of Nonpositive Functions." Journal of the American Statistical Association 84: 710--716. http://www.stat.cmu.edu/~kass/papers/nonlinearFunctions.pdf.

Endnotes

  1. That is, we use a bag-of-words representation of the document. (Imagine printing out the document on paper, cutting each word out with scissors, and putting the pieces in a bag.)^

  2. Pedantically: The probability that if we pick a word token at random, the word type will be \(w \). (In the sentence "The quick brown fox jumped over the lazy dog, but the cow jumped over the moon", there are four tokens of the word type "the", two tokens each of "jumped" and "over", and one token of every other word type in the sentence..)^

  3. In the first place, we know that all the \(\TopicSharesInDoc \)'s are non-negative, and \(\sum_{\TopicIndex=1}^{\NumTopics}{\TopicSharesInDoc_{\DocumentIndex\TopicIndex}} = 1 \) for each \(\DocumentIndex \). Just doing linear regression won't necessarily respect these constraints. In the second place, ordinary least squares is inefficient unless the variance is constant. But the variance of \(p_{\DocumentIndex\WordIndex} \) isn't constant across words \(\WordIndex \). Instead, higher-probability words will have higher variance (until the probability crosses \(1/2 \) —why?). We could fix this by doing weighted least squares, and impose constraints on our estimates, but at that point we're doing enough computational work that we might as well just calculate the log-likelihood and maximize it.^

  4. More exactly, most computational implementations are "Bayesian", they start with a prior distribution over all the parameters and then use Bayes's rule and the likelihood to update it (if \(\theta \) stands for the vector of all parameters and \(x \) for the complete data, \(p(\theta|x) = \frac{p(x|\theta) p(\theta)}{\int{p(x|\theta^{\prime}) p(\theta^{\prime}) d\theta^{\prime}}} \). The first factor in the numerator is the likelihood, and the second is the prior. (The denominator is what makes the calculation hard.) Programs return summary statistics about this "posterior" distribution, or samples from it, or a point estimate which is the "maximum a posteriori (MAP)" estimate, i.e., the \(\theta \) which maximizes the product \(p(x|\theta) p(\theta) \). The MAP estimate is equivalent to maximizing \(\log{p(x|\theta)} + \log{p(\theta)} \), so it's a kind of penalized or regularized maximum likelihood.^

  5. In geometry, a set like this is called the \(\NumTopics-1 \) dimensional simplex (plural, simplices). So the 1-dimensional simplex is just the unit interval \([0,1] \), the 2-dimensional simplex is the right triangle with corners at \((0,0) \), \((1,0) \) and \((0,1) \), etc.^

  6. In the main text, I followed most authorities by defining the concentration level \(\alpha_0 \) and the expected values \(\pi_\TopicIndex \) in terms of the \(\alpha_\TopicIndex \). But we could equally well take the expected values \(\pi_\TopicIndex \) and the concentration level \(\alpha_0 \) as basic, and define \(\alpha_\TopicIndex = \alpha_0 \pi_\TopicIndex \). This would be a different coordinate system or "parameterization" for the Dirichlet distributions, but it would convey exactly the same information. Something is funny here, however, because there are \(k \) parameters \(\alpha_1, \ldots \alpha_\NumTopics \), but \(\NumTopics+1 \) parameters \(\alpha_0, \pi_1, \ldots \pi_\NumTopics \). Why don't the number of parameters match?^



First draft, 28 November 2023. Fixed a few typos, and moved some material about Laplace approximation into the separate notebook on A very naive approach to held-out likelihood for topic models, 5 March 2024.


Notebooks: