Monday, August 31, 2009

Generative process and Gibbs sampling

I have been confused by the generative process of the generative models/topic models for years. Almost any paper regarding the topic models will have a description about how to generate the words from the model. I wonder why we need such a generative process? Aren't the words exist in the documents? Why we need to "generate" them?

Now after implementing the Gibbs sampling for the LDA learning, I can understand the usefulness of such generative process. The generative process in the LDA model actually defines a joint distribution for all the observed words  and all the unobserved topic indicator ,
 
and use this joint distribution we can get conditional distribution

which defines the Gibbs updating rule for LDA.

Generally, for a model with observed variable x's and hidden variable z's, the idea of Gibbs sampling is sampling one of the hidden variables conditioning on the other hidden and observed variables, i.e., draw samples from
. This sampling procedure should be done for all i's in a fixed order or in a random order. To derive this conditional distribution, we usually need to start from the full joint distribution:


The generative process of a topic model then justifies its usefulness in the Gibbs sampling prodecure.

A Matlab demo of collapsed Gibbs sampling for learning LDA

The equations of the conditional posterior is based on the technical note by Yi Wang,
"Gibbs Sampling and Latent Diriclet Allocation: The Gritty Details"

I have summarized a step-by-step illustration of the derivation of this method in a previous post:
A step-by-step deriviation of collapsed Gibbs sampling of LDA

The codes can be downloaded from my SkyDrive


Saturday, August 29, 2009

papers: Pachinko Allocation

I found a few papers about Pachinko Allocation (PA) model. Basically, this is an enhanced LDA model: in LDA model, topics are assumed to be independent, while PA introduced a tree-structure to capture the correlations and hierarchies of the topics. It sounds a reasonable assumption. This model has been extended to non-parametric version and to object recognition.


Pachinko Allocation (PA) model:



Extension of PA to non-parametric model:
Applications in object recognition:
Code of PAM at Wei Li's webpage

A step-by-step deriviation of collapsed Gibbs sampling of LDA

How we can derive the collapsed Gibbs sampling procedure for LDA from scratch: a step by step illustration. This is based on Wang Yi's tech report
  1. To fully describe a LDA, we need to solve the following three problems:
  1. latent variables  for each word , where  indexes a word from the whole training set, i.e., 
  2. parameters , where specifies the topic distribution for document d=m
  3. parameter and , where  specifies the word distribution for topic z = k
  1. Given the training data set, we only need to know the latent variables , because the two parameters can be considered as statistics of the association between the observed w and the corresponding z.
  2. Problem I can not be solved deterministically due to the noise in the data. So a practical solution is to estimate .
  3. Directly estimate is difficult due to the complex form of distribution in LDA. Gibbs sampling solve this problem by approximating with samples from after the burn-in period, i.e., when the Markov chain is stationary.
  4. To draw samples from , we need not know the exact form of this distribution. All we need is a function . The rest thing we need to do is to derive such a function.
  5. The conditional distribution be derived from the joint distribution . The second equality comes from the fact that only depends on .
  6. The denominator has the similar form of the numerator. So we need to derive the form of
  7. can derived by using the conditional independent property of LDA:
  8. The two distributions in step 8 are both multinomial distributions with Dirichlet conjugate prior. So their derivations are also similar. I summarize the key steps in their derivations and compare them side by side to emphasize these similarities.
Dirichlet prior
Dirichlet prior
 


  1. Substitute the results in step 9 to step 8 and then to step 6, we can express the conditional distribution as functions of the co-occurrence of word and topic , and the co-occurrence of topic and document, , and the hyperparameter, and thus we draw samples from therefrom.
  2. The procedure of the Gibbs sampling for LDA learning can be then summarized in a figure from Wang Yi's tech report:

This sampling scheme integrates out the model parameters , and this strategy is called "collapsed" Gibbs sampling.

a better way to write Blog

I found a very good online editor, zoho, which provides much richer formats. You can just used as an alternative to Word or WinEdt. Combine it with codecogs equationeditor, we can write blogs with complex equations easily. My previous post is done with these tools.

Later on, I found it is best to use zoho exclusively, because zoho can export your file in Latex format that can be reused in your other documents. codecogs also provide Latex code, but we need to add some HTML codes to put it to our blog, which make the blog messy. But zoho has no such problem.

Conjugacies in Bernulli distribution and its multivariate case

I am reading
Parameter estimation for text analysis
by Gregor Heinrich and
Distributed Gibbs Sampling of Latent Topic Models: The Gritty Details

by Yi Wang now, hoping to grasp the technical details of Gibbs sampling in LDA model. Both are excellent materials for newbies. Gregor's report provide more background knowledge such as Bayesian learning, Bayes network and related distributions. The derivation of Gibbs sampling for LDA only contains the second half of this report. Yi's report expands the derivations of the latter part. So it is better to read Gregor's report before Yi's. I do not know this point and start with Yi's and found it will be easier otherwise.

The following is a summary of the conjugacy of the Bernulli distribution and its multivariate counter part, summarized from Gregor's report.



bi-variate
multi-variate
single trial


multiple trials, n times

conjugate prior to p


conjugate prior's normalization term


posterior of given n observations


likelihood of n observations in terms of hyprparameter, by integrating out parameter p