Monday, November 30, 2009

papers: prototype theory

In my previous post, "basic level classes and subordinate class", I mentioned Aharon Bar-Hillel's paper Subordinate class recognition using relational object models . Now I am going to build topic models for object hierarchies and need to have a better understanding of the prototype theory by Rosch. Here are some papers I found about this topic:

the seminal paper: Basic Objects in Natural Categories, cognitive psychology 1976. Another link

several blogs on this theory:

http://mixingmemory.blogspot.com/2005/03/basics-of-basic-level.html

http://poorbuthappy.com/ease/archives/2003/11/20/1944/basic-level-categories

http://everything2.com/title/basic+level+categories

Sunday, November 29, 2009

papers: Estimation of Dirichlet Distribution Parameters

Recently, I am interested in apply Pachinko Allocation topic models to the object recognition problems. Mixtures of Hierarchical Topics with Pachinko Allocation, ICML 2007 mentioned several methods in training the hPAM model, and here are the related papers:

Gibbs EM:

An introduction to MCMC for machine learning, Machine Learning 2003
implementations of the Monte Carlo EM Algorithm, 2001

fixed point iteration method for estimating the Dirichlet parameters:

Estimating a Dirichlet distribution, Minka 2000, and code

Maximum Likelihood Estimation of Dirichlet Distribution Parameters by Jonathan Huang

papers: syntax and topic model

Syntactic constraint is an important ingredient in NLP. At the beginning, topic models, such as LDA, assume bag-of-word model and thus ignore the syntax. Later on, this constraint is added to the topic model to improve the modeling power. Here are a few papers regarding this issue:

Integrating topics and syntax, NIPS 2005
Style and Topic Language Model Adaptation Using HMM-LDA, EMNLP 2006
Hidden Topic Markov Models, AISTATS 2007, presentation video
Topic Modeling: Beyond Bag-of-Words, ICML 2006, slides
Syntactic Topic Models, NIPS 2008, supplement materials
Bayesian Modeling of Dependency Trees Using Hierarchical Pitman-Yor Priors, In Proceedings of the Workshop on Prior Knowledge for Text and language (held in conjunction with ICML/UAI/COLT), 2008

Topical N-grams: Phrase and Topic Discovery, with an Application to Information Retrieval, ICML 2007, technical report

paper: Rethinking LDA: Why Priors Matter

Rethinking LDA: Why Priors Matter, Hanna M. Wallach David Mimno Andrew McCallum, NIPS 2009

Abstract:

Implementations of topic models typically use symmetric Dirichlet priors with fixed concentration parameters, with the implicit assumption that such “smoothing parameters” have little practical effect. In this paper, we explore several classes of structured priors for topic models. We find that an asymmetric Dirichlet prior over the document–topic distributions has substantial advantages over a symmetric prior, while an asymmetric prior over the topic–word distributions provides no real benefit. Approximation of this prior structure through simple, efficient hyperparameter optimization steps is sufficient to achieve these performance gains. The prior structure we advocate substantially increases the robustness of topic models to variations in the number of topics and to the highly skewed word frequency distributions common in natural language. Since this prior structure can be implemented using efficient algorithms that add negligible cost beyond standard inference techniques, we recommend it as a new standard for topic modeling.

Saturday, November 28, 2009

paper: On Smoothing and Inference for Topic Models

On Smoothing and Inference for Topic Models, UAI 2009
abstract:

Latent Dirichlet analysis, or topic modeling, is a flexible latent variable framework for modeling high-dimensional sparse count data. Various learning algorithms have been developed in recent years, including collapsed Gibbs sampling, variational inference, and maximum a posteriori estimation, and this variety motivates the need for careful empirical comparisons. In this paper, we highlight the close connections between these approaches. We find that the main differences are attributable to the amount of smoothing applied to the counts. When the hyperparameters are optimized, the differences in performance among the algorithms diminish significantly. The ability of these algorithms to achieve solutions of comparable accuracy gives us the freedom to select computationally efficient approaches. Using the insights gained from this comparative study, we show how accurate topic models can be learned in several seconds on text corpora with thousands of documents.

paper: Multilevel Bayesian Models of Categorical Data Annotation

A paper I found from LingPipe's blog:

Multilevel Bayesian Models of Categorical Data Annotation

It seems to be close related to image annotation. More comments will follow after reading it.

fast and parallel Gibbs sampling for LDA

Gibbs sampling for LDA is very simple to understand and implement, especially the collapsed Gibbs sampling. But one drawbacks of GS is its complexity is linear to the number of word tokens. This problem is even more serious when we apply LDA-based approaches to computer vision problems where we use visual words in images to replace words in documents. To maximize our chance to detect the object in an image, we need large number of visual word tokens. It is more and more popular to extract features at dense regular grids over images, and to one extreme, someone extract features at every pixel with several scales. Also we often need to extract several types of features and hope them to be complementary to each other since we usually do not which type of feature is more useful for a particular object category. Combine these factors together, there are often more than 10k ~ 50k word tokens per image extracted. For Gibbs sampling, this is a nightmare!

So a fast Gibbs sampling or parallel Gibbs sampling are absolutely rescues. There are two such papers recently, with published codes (that is great!):

PLDA: Parallel Latent Dirichlet Allocation for Large-scale Applications by Wang Yi et al at Google, code

Fast Collapsed Gibbs Sampling For Latent Dirichlet Allocation at UCI, code

here is a comment from LingPipe's blog:

Porteous et al. (2008) Fast Collapsed Gibbs Sampling for Latent Dirichlet Allocation

Another paper related to topic model inference in large scale corpus is

Efficient Methods for Topic Model Inference on Streaming Document Collections

Friday, November 27, 2009

book: Data Analysis Using Regression and Multilevel/Hierarchical Models

LingPipe's blog
Finkel and Manning (2009) Hierarchical Bayesian Domain Adaptation
mentioned a book
Data Analysis Using Regression and Multilevel/Hierarchical Models.
After going through the table of contents, I found this book may be quite useful for me, especially read it together with Bishop's book PRML.

Thursday, November 26, 2009

paper: Boosted Bayesian Network Classifiers

Jing Yushi has a ICML 2005/ ML2008 paper:
Boosted Bayesian Network Classifiers
It seems very interesting to me. Now I am using topic models to implement my ideas. But generative models usually can not beat discriminative classifiers such as SVM in many cases. It is of interests of the generative guys to combine these two methods to benefit from both. Jing's paper show the boosted version of Naive Bayesian. Can we develop boosted topic models? It is good direction. I Googled and find no such work so far.

Google's image swirl

Jing Yushi @ Google just announced a new Google lab tool: Google Image Swirl, which is built on his previous work VisualRank .
I am really excited by this work, and also admire Yushi has such a good opportunity to realize the ideas in research domain to a real stuff workable on a real-world platform. Sure, there is still long to go but at least we see some light of the dawn.

Sunday, November 8, 2009

Google's new toy

Google's new toy: a portable search panel

If only one word is permitted to used to describe this Google's new toy, it will be "cool". It integrates the techniques of localization (probability with GPS or other instruments), image retrieval, OCR, object recognition, and combines with Google's cloud computing capability. OMG, it seems to me part of computer visioner's dream will soon come true, and many of us will lose our jobs on the other hand :(

Later on, I found this is a product concept design, not a Google product. This is the link of the author's blog.
Future of Internet Search: Mobile Vision

Saturday, November 7, 2009

A clever way to derive the collapsed Gibbs sampling for LDA

Inspired by Tom Griffiths technical report, Gibbs Sampling in the Generative Model of Latent Dirichlet Allocation, I derive the collapsed Gibbs sampling for LDA by myself using the tricks in Tom's report. These tricks are universally applicable to other topic models:

simplify the conditional property by employing Bayes theorem and
d-separation property
derive the results of conditional probability directly from the result of
the predictive likelihood of Dirichlet/multinomial distribution

There is no lengthy and complex computation like those in Wang Yi's note or Gregor Heinrich's note. It is easy to understand and has intuitive explanation for the formulas involved. I wrote up a report for my derivation, as a complementary to Tom's note:
Derivation of Collapsed Gibbs Sampling for LDA

Sunday, October 25, 2009

paper: DeltaLDA

DeltaLDA is a modification of the Latent Dirichlet Allocation (LDA) model which uses two different topic mixing weight priors to jointly model two corpora with a shared set of topics, where one topic mixing weight prior to model the normal pattern and the other for the abnormal pattern.
The graphical model:

An illustration of topic mixture weights in two scenarios:

This looks like quite similar to the Adapted Vocabularies for Generic Visual Categorization, ECCV 2006 in the way they split the topic/vocabulary into two sets, though there are fundamental difference in their underneath mechanism.

Friday, October 23, 2009

paper: Sketch2Photo: Internet Image Montage

Amazing realistic montage!

A few students from Tsinghua Univ. present this montage using images downloaded from the internet.

The links are as follows:
project web page
The demo on youtube
The paper on ACM SIGGRAPH ASIA 2009, ACM Transactions on Graphics

papers: visual attribute and object class recognition

Several recent papers discussed the methods to extract visual attributes from images and/or use these attributes for object class recognition. We can view visual attributes as another type of annotation. While image annotation is applied to individual image, visual attributes are specified to an object class; image annotations are usually words, i.e., discrete values, visual attributes can be either discrete (e.g. color={red, blue, ...}) or continuous values (e.g., average size). Here are a few papers I am reading:

Describing Objects by their Attributes, CVPR 2009
Joint learning of visual attributes, object classes and visual saliency, ICCV 2009
Learning To Detect Unseen Object Classes by Between-Class Attribute Transfer, CVPR 2009
Learning Visual Attributes, NIPS 2007

Here is a blog about visual attribute from Tombone.

Sunday, October 18, 2009

segmentation vs. recognition

In What is segmentation-driven object recognition?,
Tomasz finally remarked that the learning-driven segmentation may be a hot topic in the next few years. I totally agree with him. The problem for us is how to design such algorithms to be robust to intra-class variations, scales and pose. This remains a challenging problem in recognition community.

In the comments of this blog, someone suggest to check the most up-to-date segmentation results in PASCAL 2009.

basic level classes and subordinate class

Comments: the following paper provides a good insight into the role of generative and discriminative models in learning a large number of object categories, i.e., we can use the generative models to distinguish categories at basic level, and discriminative models to differentiate lower-level and similar categories.

In Subordinate class recognition using relational object models
Aharon Bar-Hillel, Daphna Weinshall, NIPS, 2006, the authors illustrate some interesting points:

"Human categorization is fundamentally hierarchical, where categories are organized in tree-like hierarchies.

higher nodes close to the root describe inclusive classes (like vehicles),
intermediate nodes describe more specific categories (like motorcycles),
lower nodes close to the leaves capture fine distinctions between objects (e.g., cross vs. sport motorcycles).

Intuitively one could expect such hierarchy to be learnt either bottom-up or top-down (or both), but surprisingly, this is not the case. In fact, there is a well defined intermediate level in the hierarchy, called basic level, which is learnt first [11]...."

"The primary role of basic level categories seems related to the structure of objects in the world. In [13], Tversky & Hemenway promote the hypothesis that the explanation lies in the notion of parts.Their experiments show that

basic level categories (like cars and flowers) are often described as a combination of distinctive parts (e.g., stem and petals), which are mostly unique.
higher levels (superordinate and more inclusive) are more often described by their function (e.g., ’used for transportation’),
lower levels (sub-ordinate and more specific) are often described by part properties (e.g., red petals) and other fine details."

Based on these assumptions, Bar-Hillel and Weinshall proposed a two stage approach for subordinate class recognition:

First we should learn a generative model for the basic category. Using such a model, the object parts should be identified in each image, and their descriptions can be concatenated into an ordered vector. This stage is used to solve the correspondence problem: features in the same entry in two different image vectors correspond since they implement the same part.
In a second stage, the distinction between subordinate classes can be done by applying standard machine learning tools, like SVM, to the resulting ordered vectors, since the correspondence problem has been solved in the first stage.

Another paper reinforce this idea from the psychology study: Comparison Processes in Category learning: From Theory to Behavior, Rubi Hammer, Aharon Bar-Hillel, Tomer Hertz, Daphna Weinshall and Shaul Hochstein, Brain Research, Special issue on 'Brain and Vision', 2008.

Wednesday, October 14, 2009

a good summary on generative vs. discriminative models

The GenDisc2009 NIPS workshop is call for papers. Though I have no time to catch up the deadline, I found the brief discussion on the generative vs. discriminative models are quite useful. In case I lose the link or the link is broken in the future, I copy some contents as follows:

In generative approaches for prediction tasks, one models a joint distribution on inputs and outputs and parameters are typically estimated using a likelihood-based criterion. In discriminative approaches, one directly models the mapping from inputs to outputs (either as a conditional distribution or simply as a prediction function); parameters are estimated by optimizing objectives related to various loss functions. Discriminative approaches have shown better performance given enough data, as they are better tailored to the prediction task and appear more robust to model misspecification. Despite the strong empirical success of discriminative methods in a wide range of applications, when the structures to be learned become more complex than the amount of training data (e.g., in machine translation, scene understanding, biological process discovery), some other source of information must be used to constrain the space of candidate models (e.g., unlabeled examples, related data sources or human prior knowledge). Generative modeling is a principled way of encoding this additional information, e.g., through probabilistic graphical models or stochastic grammar rules. Moreover, they provide a natural way to use unlabeled data and are sometimes more computationally efficient.

Theoretical analysis of generative versus discriminative learning has a long history in statistics, where the focus was on asymptotic analyses (e.g. [Efron 75]). Ng and Jordan provided an initial comparison of generative versus discriminative learning in the non-asymptotic regime in the most cited paper on the topic in machine learning [Ng 02]. For a few years, this paper was one of the only machine learning papers providing a theoretical comparison, and was responsible for the conventional wisdom: "use generative learning for small amount of data and discriminative learning for large amounts". Recently, there has been new advances on our theoretical understanding [Liang 08, Xue 08] and their combination [Bouchard 07, Xue 09].

On the empirical side, combinations of discriminative and generative methodologies have been explored by several authors [Raina 04, Bouchard 04, McCallum 06, Bishop 07, Schmah 09] in many fields such as natural language processing, speech recognition, and computer vision. In particular, the recent "deep learning" revolution of neural networks relies heavily on a hybrid generative-discriminative approach: an unsupervised generative learning phase ("pre-training") is followed by discriminative fine-tuning. Given these recent trends, a workshop on the interplay of generative and discriminative learning seem especially relevant.

Hybrid generative-discriminative techniques face computational challenges. For some models, training these hybrids is akin to the discriminative training of generative models, which is a notoriously hard problem ([Bottou 91] for discriminatively trained HMM, [Jebara 04, Salojarvi 05] for EM-like algorithms), though for other models, learning can be in fact simple [Raina 04, Wettig 03]. Alternatively, the use of generative models in predictive settings has been be explored, e.g., through the use of Fisher kernels [Jaakkola 98] or other probabilistic kernels. One of the goal of the workshop will be to highlight the connections between these approaches.

The aim of this workshop is .... (ignored)

References

[Bishop 07] C. M. Bishop and J. Lasserre, Generative or Discriminative? getting the best of both worlds. In Bayesian Statistics 8, Bernardo, J. M. et al. (Eds), Oxford University Press. 3–23, 2007.

[Bottou 91] L. Bottou, Une approche théorique de l'apprentissage connexionniste: Applications à la reconnaissance de la parole. Doctoral dissertation, Université de Paris XI, 1991.

[Bouchard 04] G. Bouchard and B. Triggs, The tradeoff between generative and discriminative classifiers. In J. Antoch, editor, Proc. of COMPSTAT'04, 16th Symposium of IASC, volume 16. Physica-Verlag, 2004.

[Bouchard 07] G. Bouchard, Bias-variance tradeoff in hybrid generative-discriminative models. In proc. of the Sixth International conference on Machine Learning and Applications (ICMLA 07), Cincinnati, Ohio, USA, 13-15 December 2007.

[Efron 75] B. Efron, The Efficiency of Logistic Regression Compared to Normal Discriminant Analysis. Journal of the American Statistical Association, 70(352), 892—898, 1975.

[Greiner 02] R. Greiner and W. Zhou. Structural extension to logistic regression: Discriminant parameter learning of belief net classifiers. In Proceedings of the Eighteenth Annual National Conference on Artificial Intelligence (AAAI-02), 167–173, 2002.

[Jaakkola 98] T. Jaakkola and D. Haussler. Exploiting generative models in discriminative classifiers. In Advances in Neural Information Processing Systems 11, 1998.

[Jaakkola 99] T. Jaakkola, M. Meila, and T. Jebara. Maximum entropy discrimination. In Advances in Neural Information Processing Systems 12. MIT Press, 1999.

[Jebara 04] T. Jebara, Machine Learning - Discriminative and Generative. International Series in Engineering and Computer Science, Springer, Vol. 755, 2004.

[Liang 08] P. Liang and M. I. Jordan, An asymptotic analysis of generative, discriminative, and pseudo-likelihood estimators. In Proceedings of the 25th International Conference on Machine Learning (ICML), 2008.

[McCallum 06] A. McCallum, C. Pal, G. Druck and X. Wang, Multi-Conditional Learning: Generative/Discriminative Training for Clustering and Classification. AAAI, 2006.

[Ng 02] A. Y. Ng and M. I. Jordan, On Discriminative vs. Generative Classifiers: A comparison of logistic regression and Naive Bayes. In Advances in Neural Information Processing Systems 14, 2002.

[Salojarvi 05] J. Salojärvi, K. Puolamäki and S. Kaski, Expectation maximization algorithms for conditional likelihoods. In Proceedings of the 22nd International Conference on Machine Learning (ICML), 2005.

[Schmah 09] T. Schmah, G. E Hinton, R. Zemel, S. L. Small and S. Strother, Generative versus discriminative training of RBMs for classification of fMRI images. In Advances in Neural Information Processing Systems 21, 2009.

[Wettig 03] H. Wettig, P. Grünwald, T. Roos, P. Myllymäki and H.Tirri, When discriminative learning of Bayesian network parameters is easy. In Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence (IJCAI 2003), 491-496, August 2003

[Xue 08] J.-H Xue and D.M. Titterington, Comment on "discriminative vs. generative classifiers: a comparison of logistic regression and naive Bayes". Neural Processing Letters, 28(3), 169-187, 2008.

[Xue 09] J.-H Xue and D.M. Titterington, Interpretation of hybrid generative/discriminative algorithms. Neurocomputing, 72(7-9), 1648-1655, 2009.

Thursday, October 8, 2009

test

I just found the editor in Google Docs is a pretty handy tool to write blog with pictures, tables and Latex equations. The following is an example. It seems the only feature lacking in Google is the capability to download the file as Latex source. But this is not so crucial. When I need to write a big a document in Latex, I will write using WinEdt or other Latex editor and upload the PDF file to the blog, just as I have done in the previous few technical notes.

$\phi_{ji} | \phi_{j1}, .., \phi_{j, i-1}, \alpha_0, G_0 \sim \sum_{t=1}^{T_j} \frac{n_{jt}}{\alpha_0 + i -1} \delta_{\psi_{jt}} + \frac{\alpha_0}{\alpha_0 + i-1} G_0$

Wednesday, October 7, 2009

an easy way to write Latex codes in Blogger

I just found an easy way to write Latex codes directly in Blogger.
See How To Install Latex On Blogger/Blogspot
Following the steps described there, I can write the Latex codes between double dollar signs as I do in TexCenter or WinEdt, and get the desired equation. It is really cool!

A test:
The Latex codes are:
${\phi_{ji} | \phi_{j1}, .., \phi_{j, i-1}, \alpha_0, G_0 \sim \sum_{t=1}^{T_j} \frac{n_{jt}}{\alpha_0 + i -1} \delta_{\psi_{jt}} + \frac{\alpha_0}{\alpha_0 + i-1} G_0}$
and the result is:
$\phi_{ji} | \phi_{j1}, .., \phi_{j, i-1}, \alpha_0, G_0 \sim \sum_{t=1}^{T_j} \frac{n_{jt}}{\alpha_0 + i -1} \delta_{\psi_{jt}} + \frac{\alpha_0}{\alpha_0 + i-1} G_0$

It works very well.

I have tried to immigrate from Blogger to WordPress for the Latex capability in WordPress. But I found there are often errors in render Latex in WordPress, which makes this feature less attractive. Also I did not figure out how to change font and font size in WordPress. So I keep staying in Blogger. I hope Google can soon release a much power editor, with buttons on top the editor to enable insert Latex and symbols, and as well as tables. These features are the most frequently used features in my experience. Before Google doing so, I think zoho will be my favorite.

Sunday, October 4, 2009

Author-topic model and transformed LDA

Latent Dirichlet Allocation (LDA) is essentially a generative model for document analysis rather than classification, and it is an unsupervised rather than supervised learning algorithm. Given a new document, the output of LDA is the topic proportion instead of document category. So LDA can not be directly used for classification.

Author-topic model (ATM), on the other hand, can be used in classification, as long as we view the author as the category label.

Comparison between the above two models can be summarized as follows, where the figures are from the UAI paper by M. Rosen-Zvi, T. Griffiths, M. Steyvers, P. Smyth, 2004:

LDA	ATM

generative process: choose for each of the words in document d choose choose	generative process: for each of the words in document d choose an author x from , the author set of document d following a uniform distribution choose choose

Notice the most significant difference in ATM compared to LDA is that the topic mixture weight

is not generated for each document; rather, there are finite number of possible topics mixture weights, which is specified by the author information in each document.

For document classification, if we view the author as the class label, and let

is a scalar, the ATM model can be directly applied.

My interest on the ATM model is due to Sudderth's transformed LDA model, which reduces to an ATM when ignore the spatial transformation (see the part inside the big red square).

a blog about faculty job hunting

I found a blog talking about faculty job hunting, which is very useful:

http://nlpers.blogspot.com/2009/09/some-notes-on-job-search.html

Sunday, September 20, 2009

a good review article for LDA

I happened to find a review article for LDA and its application for text, vision and music.
The link is Latent Dirichlet Allocation for Text, Images, and Music
and the slides is here

They are worth to read carefully.

Friday, September 18, 2009

papers: employing semantic hierarchy in object recognition

Semantic hierarchy could play an important role in object recognition. For example, if we know mini-van is a type of car, and we have already a model for car vs. the rest of the world, then we only need to differentiate mini-van from the car, which reduces lots of work. Similar idea has been noticed by object recognition researchers and there are several papers in recent years:

Exploiting object hierarchy: Combining models from different category levels, ICCV 2007
Learning and using taxonomies for fast visual categorization, CVPR 2008
Unsupervised discovery of visual object class hierarchies, CVPR 2008
Constructing category hierarchies for visual recognition, ECCV 2008
Unsupervised learning of visual taxonomies, CVPR 2008
Semantic Hierarchies for Visual Object Recognition, CVPR 2007
Latent Topic Random Fields: Learning using a taxonomy of labels, CVPR 2008

where some are listed in Trevor's course page.

Thursday, September 17, 2009

papers: supervised or discriminative topic model

Topic models are originally designed for topic discovery/clustering, not for classification. To use topic models for classification task, we have modify the structure of the topic model to add the class label and use it to bias the topic discovery process.

the following paper present some supervised/discriminative topic models by machine learning guys:

MedLDA: Maximum Margin Supervised Topic Models for Regression and Classification, ICML 2009
DiscLDA: Discriminative Learning for Dimensionality Reduction and Classification, NIIPS 2008
Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora, EMNLP 2009
Supervised topic models, NIPS 2007

some supervised/discriminative topic models by computer vision guys:

A Bayesian Hierarchical Model for Learning Natural Scene Categories, CVPR 2005
Simultaneous Image Classification and Annotation, CVPR 2009
Spatially coherent latent topic model for concurrent object segmentation and classification, ICCV 2007
Towards Total Scene Understanding Classification, Annotation and Segmentation in an Automatic Framework, CVPR 2009
What, where and who? Classifying events by scene and object recognition, ICCV 2007
Learning Hierarchical Models of Scenes, Objects, and Parts, ICCV 2005

Wednesday, September 16, 2009

papers: extend topic model to deal with temporal dependency

Using topic models to analyze the tread and change of topics along the time line in a document corpus is definitely a cool idea. It has plenty of potentials in video analysis, human action understanding, etc. The following are a few papers related to this idea:

Dynamic Topic Models, ICML 2006
Continuous Time Dynamic Topic Models, UAI 2008
Hidden Topic Markov Models, AISTATS 2007

non-parametric models:

an HDP-HMM model is described in Yee Whye Teh's HDP paper
An HDP-HMM for Systems with State Persistence, ICML 2008
Infinite Hierarchical Hidden Markov Models. K. Heller, Y.W. Teh and D. Gorur. AISTATS 2009

Here is a good discussion on several infinite HMM:
a blog by Jurgen Van Gael, discussing several infinite HMM

Monday, September 14, 2009

Hierarchical Dirichlet Process: A Gentle Introduction

I wrote up a gentle introduction on HDP based on my two previous post:
Chinese restaurant process and Chinese restaurant franchise
from DP mixture to hierarchical DP mixture

The PDF file is here.

Friday, September 11, 2009

Chinese restaurant process and Chinese restaurant franchise

An illustration of Chinese restaurant process (CRP) and Chinese restaurant franchise (CRF). Materials are from Yee Whye Teh's 2004 technical report on "Hierarchical Dirichlet Process"

Chinese restaurant process (CRP):

For a set of random variables

distributed according to

, we have the following conditional distribution:

This can be described in a Chinese restaurant process metaphor:

Consider a Chinese restaurant with an unbounded number of tables,
The first customer sits at table 1
Suppose there are K tables occupied before the i-th customer comes, he can sit at

The relationship between

and

can be best illustrated as in the following figure.

random variables	meaning	metophor
	random variables	customer i
	distinct values within all	table k
	the number of associated to	the number of customers sitting around table k

Chinese restaurant franchise (CRF):

An essentially two-level Chinese restaurant process:

Within a restaurant, customers choose tables
Within all restaurants, tables choose dishes

In both levels, the choosing follows the Chinese restaurant process as illustrated above.

At the restaurant level, customers choose tables according to the following distribution:

At this level, the metaphor of CRF is the same as the one of CRP as described above, except some changes on symbols

Consider restaurant j with an unbounded number of tables,
The first customer sits at table 1
Suppose there are tables occupied before the i-th customer comes, he can sit at

The relationship between

and

can be best illustrated as in the following figure:

At the franchise level, table choose dishes according to the following distribution:

This can be described in a Chinese restaurant franchise metaphor:

Consider a Chinese restaurant franchise, whose J restaurants share a menu with unbounded number of dishes,
At each table of each restaurant, one dish is ordered from the public menu by the first customer who sits there, and it is shared among all customers who sit at that table. Multiple tables at multiple restaurants can serve the same dish
Suppose there are tables occupied before the i-th customer comes restaurant j and there are total K dishes has been ordered among all restaurants in the franchise. He can sit at an occupied table or a new table with certain probability, as described above. If he sits at an occupied table, he shares the dish that has been ordered at that table. If he sits at a new table, he order a dish for that table according to its popularity among the whole franchise, while a new dish can also be tried.

The relationship between

and

can be best illustrated as in the following figure.

random variables	meaning	metophor
	random variables	customer i in restaurant j
	distinct values of in group j	table t in restaurant j
	index of associated to ,	the table taken by customer i in restaurantj,.i.e., Table()=, Customer() =
	the number of associated to in group j	the number of customers sitting around table t in restaurantj
	distinct values in within all groups,	dish k, which is shared within all restaurants
	index of associated to ,	the dish ordered by table t in restaurantj, i.e., Dish() = , Table() =
	the number of associated to in group j	the number of tables ordered dish k in restaurantj
	, i.e., the number of associated to over all j	the total number of tables ordered dish k within all restaurants