Saturday, November 28, 2009

fast and parallel Gibbs sampling for LDA

Gibbs sampling for LDA is very simple to understand and implement, especially the collapsed Gibbs sampling. But one drawbacks of GS is its complexity is linear to the number of word tokens. This problem is even more serious when we apply LDA-based approaches to computer vision problems where we use visual words in images to replace words in documents. To maximize our chance to detect the object in an image, we need large number of visual word tokens. It is more and more popular to extract features at dense regular grids over images, and to one extreme, someone extract features at every pixel with several scales. Also we often need to extract several types of features and hope them to be complementary to each other since we usually do not which type of feature is more useful for a particular object category. Combine these factors together, there are often more than 10k ~ 50k word tokens per image extracted. For Gibbs sampling, this is a nightmare!

So a fast Gibbs sampling or parallel Gibbs sampling are absolutely rescues. There are two such papers recently, with published codes (that is great!):

PLDA: Parallel Latent Dirichlet Allocation for Large-scale Applications by Wang Yi et al at Google, code

here is a comment from LingPipe's blog:
Porteous et al. (2008) Fast Collapsed Gibbs Sampling for Latent Dirichlet Allocation

Another paper related to topic model inference in large scale corpus is

No comments:

Post a Comment