Now after implementing the Gibbs sampling for the LDA learning, I can understand the usefulness of such generative process. The generative process in the LDA model actually defines a joint distribution for all the observed words
and all the unobserved topic indicator
, 

Generally, for a model with observed variable x's and hidden variable z's, the idea of Gibbs sampling is sampling one of the hidden variables conditioning on the other hidden and observed variables, i.e., draw samples from
. This sampling procedure should be done for all i's in a fixed order or in a random order. To derive this conditional distribution, we usually need to start from the full joint distribution: 
The generative process of a topic model then justifies its usefulness in the Gibbs sampling prodecure.
for each word
, where
indexes a word from the whole training set, i.e., 
, where
specifies the topic distribution for document
, where
specifies the word distribution for topic
.
after the burn-in period, i.e., when the Markov chain is stationary.
. The rest thing we need to do is to derive such a function.
. The second equality comes from the fact that 









, and the co-occurrence of topic and document,
, and the hyperparameter, and thus we draw samples from therefrom. 
, and this strategy is called "collapsed" Gibbs sampling. 











