One of the great problems in computer vision is to say what people are
doing from a picture or a video of them doing it. There are numerous
applications, ranging from building models of how people behave to advance architectural design, to surveillance. This problem is very hard indeed, for several reasons. We don't really have a clean vocabulary for what
people are doing, particularly for everyday activity. Often, the interesting
stuff is very rare indeed, and people just walk. People can do a
lot of different things. Finally, what they're doing looks different when seen from different angles.
There are a series of themes in my work here, which I break out in the list of papers. I have been fortunate to work with several talented people on this topic, including Deva Ramanan, Nazli Ikizler, Ali Farhadi, Alex Sorokin and Du Tran.
I have done quite a lot of work with NSF support, which is also described on this page.
David A. Forsyth, Okan Arikan, Leslie Ikemoto, James O' Brien, Deva
Ramanan, Computational
Studies of Human Motion: Part 1, Tracking and Motion Synthesis, Foundations and Trends¨ in Computer Graphics and Vision
Volume 1 Issue 2/3 (255pp), 2006
We built a motion synthesizer that could produce a sequence of actions which had specified labels. It was natural to couple this to a motion analyzer, and then keep the labels. This produced quite good labellings of sequences.
The difficulty with the original motion synthesizer approach was one couldn't label activities one hadn't seen examples of. We built a method to construct models of human activity from motion capture that could be strung together, both across the body and across time. This meant that we could search for motions for which we had no visual example, using a simple query language based around finite state automata.
We would like to build discriminative models of particular activities, and spot them in video. A natural place to start is word-spotting in sign language.
Here we want to build models of activity from one view, and then use them in another viewpoint; or we might want to build models using a dictionary, and find examples in real video.
Ali Farhadi, David Forsyth, Ryan White, "Transfer Learning
in Sign Language," IEEE Conference on Computer Vision and Pattern
Recognition, 2007