Saturday, August 29, 2009

A step-by-step deriviation of collapsed Gibbs sampling of LDA

How we can derive the collapsed Gibbs sampling procedure for LDA from scratch: a step by step illustration. This is based on Wang Yi's tech report
  1. To fully describe a LDA, we need to solve the following three problems:
  1. latent variables  for each word , where  indexes a word from the whole training set, i.e., 
  2. parameters , where specifies the topic distribution for document d=m
  3. parameter and , where  specifies the word distribution for topic z = k
  1. Given the training data set, we only need to know the latent variables , because the two parameters can be considered as statistics of the association between the observed w and the corresponding z.
  2. Problem I can not be solved deterministically due to the noise in the data. So a practical solution is to estimate .
  3. Directly estimate is difficult due to the complex form of distribution in LDA. Gibbs sampling solve this problem by approximating with samples from after the burn-in period, i.e., when the Markov chain is stationary.
  4. To draw samples from , we need not know the exact form of this distribution. All we need is a function . The rest thing we need to do is to derive such a function.
  5. The conditional distribution be derived from the joint distribution . The second equality comes from the fact that only depends on .
  6. The denominator has the similar form of the numerator. So we need to derive the form of
  7. can derived by using the conditional independent property of LDA:
  8. The two distributions in step 8 are both multinomial distributions with Dirichlet conjugate prior. So their derivations are also similar. I summarize the key steps in their derivations and compare them side by side to emphasize these similarities.
Dirichlet prior
Dirichlet prior
 


  1. Substitute the results in step 9 to step 8 and then to step 6, we can express the conditional distribution as functions of the co-occurrence of word and topic , and the co-occurrence of topic and document, , and the hyperparameter, and thus we draw samples from therefrom.
  2. The procedure of the Gibbs sampling for LDA learning can be then summarized in a figure from Wang Yi's tech report:

This sampling scheme integrates out the model parameters , and this strategy is called "collapsed" Gibbs sampling.

No comments:

Post a Comment