Xiaodong's tech notes on computer vision and machine learning: A step-by-step deriviation of collapsed Gibbs sampling of LDA

Saturday, August 29, 2009

How we can derive the collapsed Gibbs sampling procedure for LDA from scratch: a step by step illustration. This is based on Wang Yi's tech report.

latent variables for each word , where indexes a word from the whole training set, i.e.,
parameters , where specifies the topic distribution for document d=m
parameter and , where specifies the word distribution for topic z = k

Given the training data set, we only need to know the latent variables , because the two parameters can be considered as statistics of the association between the observed w and the corresponding z.
Problem I can not be solved deterministically due to the noise in the data. So a practical solution is to estimate .
Directly estimate is difficult due to the complex form of distribution in LDA. Gibbs sampling solve this problem by approximating with samples from after the burn-in period, i.e., when the Markov chain is stationary.
To draw samples from , we need not know the exact form of this distribution. All we need is a function . The rest thing we need to do is to derive such a function.
The conditional distribution be derived from the joint distribution . The second equality comes from the fact that only depends on .
The denominator has the similar form of the numerator. So we need to derive the form of
can derived by using the conditional independent property of LDA:
The two distributions in step 8 are both multinomial distributions with Dirichlet conjugate prior. So their derivations are also similar. I summarize the key steps in their derivations and compare them side by side to emphasize these similarities.

Substitute the results in step 9 to step 8 and then to step 6, we can express the conditional distribution as functions of the co-occurrence of word and topic , and the co-occurrence of topic and document, , and the hyperparameter, and thus we draw samples from therefrom.
The procedure of the Gibbs sampling for LDA learning can be then summarized in a figure from Wang Yi's tech report: