Latent Dirichlet Allocation (LDA) is essentially a generative model for document analysis rather than classification, and it is an unsupervised rather than supervised learning algorithm. Given a new document, the output of LDA is the topic proportion instead of document category. So LDA can not be directly used for classification.
Author-topic model (ATM), on the other hand, can be used in classification, as long as we view the author as the category label.
Comparison between the above two models can be summarized as follows, where the figures are from the
UAI paper by M. Rosen-Zvi, T. Griffiths, M. Steyvers, P. Smyth, 2004:
LDA
| ATM
|

| 
|
generative process:
- choose
 - for each of the
words in document d- choose
 - choose

| generative process:
- for each of the
words in document d- choose an author x from
, the author set of document d following a uniform distribution 
- choose
 - choose

|
Notice the most significant difference in ATM compared to LDA is that the topic mixture weight

is not generated for each document; rather, there are finite number of possible topics mixture weights, which is specified by the author information in each document.
For document classification, if we view the author as the class label, and let

is a scalar, the ATM model can be directly applied.
My interest on the ATM model is due to
Sudderth's transformed LDA model, which reduces to an ATM when ignore the spatial transformation (see the part inside the big red square).
No comments:
Post a Comment