labelled badminton image

Labelling Human Activity

One of the great problems in computer vision is to say what people are doing from a picture or a video of them doing it.  There are numerous applications, ranging from building models of how people behave to advance architectural design, to surveillance. This problem is very hard indeed, for several reasons.   We don't really have a clean vocabulary for what people are doing, particularly for everyday activity.  Often, the interesting stuff is very rare indeed, and people just walk.  People can do a lot of different things.  Finally, what they're doing looks different when seen from different angles.

There are a series of themes in my work here, which I break out in the list of papers. I have been fortunate to work with several talented people on this topic, including Deva Ramanan, Nazli Ikizler, Ali Farhadi, Alex Sorokin and Du Tran.

I have done quite a lot of work with NSF support, which is also described on this page.


David A. Forsyth, Okan Arikan, Leslie Ikemoto, James O' Brien, Deva Ramanan, Computational Studies of Human Motion: Part 1, Tracking and Motion Synthesis, Foundations and Trends¨ in Computer Graphics and Vision Volume 1 Issue 2/3 (255pp), 2006 

Labelling using a motion synthesizer

We built a motion synthesizer that could produce a sequence of actions which had specified labels. It was natural to couple this to a motion analyzer, and then keep the labels. This produced quite good labellings of sequences.

  1. O. Arikan, Forsyth, D.A., and J. O'Brien ``Motion Synthesis from Annotations,'' SIGGRAPH 2003, San Diego, CA, Jul. 2003, in ACM Transactions on Graphics, vol. 33:3, 402-408, 2003.   
  2. Ramanan, D. and Forsyth, D. A. "Automatic Annotation of Everyday Movements" Neural Info. Proc. Systems (NIPS), Vancouver, Canada, Dec 2003. poster (Longer technical report version)

Labelling using a generative model

The difficulty with the original motion synthesizer approach was one couldn't label activities one hadn't seen examples of. We built a method to construct models of human activity from motion capture that could be strung together, both across the body and across time. This meant that we could search for motions for which we had no visual example, using a simple query language based around finite state automata.

  1. Nazli Ikizler and David Forsyth, “Searching video for complex activities with finite state models” IEEE Conference on Computer Vision and Pattern Recognition, 2007
  2. N. Ikizler and D.A. Forsyth, "Searching for Complex Human Activities with No Visual Examples," Int. J. Computer Vision, (accepted for publication, electronic version available online already), 2008.

Labelling using a discriminative model

We would like to build discriminative models of particular activities, and spot them in video. A natural place to start is word-spotting in sign language.

  1. A. Farhadi and D. A. Forsyth, "Aligning ASL for Statistical Translation Using a Discriminative Word Model", Proc CVPR 2006, pp1471-1476, 2006  

Transfer learning

Here we want to build models of activity from one view, and then use them in another viewpoint; or we might want to build models using a dictionary, and find examples in real video.

  1. Ali Farhadi, David Forsyth, Ryan White, "Transfer Learning in Sign Language," IEEE Conference on Computer Vision and Pattern Recognition, 2007