Wednesday, December 8, 2010

Ph.D. proposal exam passed

I just passed my Ph.D. proposal exam on Monday, Dec. 6th, 2010. The title of the proposal is "Bridging the Semantic Gap: Image and Video Understanding by Integrating Vision and Language". The hypothesis of this thesis is that many vision problems can not be solved solely based on the visual data. Knowledge and reasoning process need to be integrated in the loop, which are provided by language. This idea is embodied by several recent projects, as described in my research web page.


Now I need to work hard to finish the rest work towards my Ph.D. Hopefully, they can be done in next year.

Twenty Questions Game and Object Recognition

Objects can be defined by many features/parts/attributes, each of which can be viewed as a test. The problem of object recognition/detection is then solved by combining outputs of these tests. Instead of performing all possible tests, a smart way is to select a small set of tests without sacrificing the recognition quality/accuracy. The process of selectinvision.ucsd.edu/sites/default/files/Visipedia20q.pdfg the right tests can be formulated as a 20-question game, and the recognition of object is achieved by sequentially asking a question to an Oracle, and analyzing the results returned by the Oracle. The criterion of selecting next question is the information gain brought by the answer of the question. This approach is also called "Active Testing" in "An Active Testing Model for Tracking Roads in Satellite Images", PAMI 1996.

So far, the earliest work using this idea for object recognition is due to Donald Geman of JHU, described in his 1993 technical report "Shape Recognition and Twenty Questions". Each test is a local functional of the image loosely corresponding to configurations (vertex labels) resembling "endings", "junctions", and "turns", or a invariant relations (relational labels) between two vertex labels, i.e., "same class", "same orientation".

The most recent work is "Active Testing for Face Detection and Localization", PAMI 2010, "Visual Recognition with Humans in the Loop", ECCV2010a, and "Indoor Scene Recognition Through Object Detection Using Adaptive Objects Search", ECCV 2010b. In the PAMI 2010 paper, the tests are specific type of image functional (i.e., proportion of edges in particular orientation and scale) within a local region. In the ECCV 2010b paper, the tests are object detectors. In the ECCV 2010a paper, the tests are object attributes while the Oracle is human.

This idea can be extended in many aspects. In the application domain, it can be used in scene and activity recognition; regarding the questions to ask, we can ask many richer questions besides What, e.g., Where, How Many, How Big, etc. We are currently investigating these problems.

Wednesday, December 1, 2010

Microsoft Kinect: the next generation of HCI device?

With the release of Kinect, Microsoft becomes a star in the eyes of researchers of computer vision and HCI. It is really a cool idea to control your computer with your hands and body, without attaching/holding any other devices. It provides us with numerous possibilities. I think there will be boom of games and VR/AR applications using Kinect in the next few years.

Kinect for Xbox 360 review

Open source Kinect driver

Kinect's open-source ambitions