Sunday, October 4, 2009

Author-topic model and transformed LDA

Latent Dirichlet Allocation (LDA) is essentially a generative model for document analysis rather than classification, and it is an unsupervised rather than supervised learning algorithm. Given a new document, the output of LDA is the topic proportion instead of document category. So LDA can not be directly used for classification.

Author-topic model (ATM), on the other hand, can be used in classification, as long as we view the author as the category label.

Comparison between the above two models can be summarized as follows, where the figures are from the UAI paper by M. Rosen-Zvi, T. Griffiths, M. Steyvers, P. Smyth, 2004:

LDA
ATM


generative process:
  • choose
  • for each of the  words in document d
    • choose
    • choose

generative process:
  • for each of the  words in document d
    • choose an author x from  , the author set of document d following a uniform distribution
    • choose
    • choose

Notice the most significant difference in ATM compared to LDA is that the topic mixture weight is not generated for each document; rather, there are finite number of possible topics mixture weights, which is specified by the author information in each document.

For document classification, if we view the author as the class label, and let is a scalar, the ATM model can be directly applied.

My interest on the ATM model is due to Sudderth's transformed LDA model, which reduces to an ATM when ignore the spatial transformation (see the part inside the big red square).





No comments:

Post a Comment