Activity modelling using finite state

We have built a method of representing human activities that allows a collection of motions, seen in video, to be queried without any video examples of activity.  We use a simple and effective query language, built on a finite state machine.  Our approach is based on units of activity at segments of the body.  These units are learned from motion capture data that was labelled with activity tags for other purposes (some 4 years ago).  Our units can be composed across space and across the body to produce complex queries, such as legs-walk-arms-walk then arms-pickup then legs-walk-arms-carry.  The presence of search units is inferred automatically by tracking the body, lifting the tracks to 3D and comparing to our models. We have obtained interim results for a large range of queries applied to a collection of some 74 video sequences of complex motion and activity; our collection includes multiple actors wearing several different sets of clothing.   This method has been written up and submitted for publication.  This work is joint with Nazli Ikizler and appeared as

Nazli Ikizler and David Forsyth, “Searching video for complex activities with finite state models” IEEE Conference on Computer Vision and Pattern Recognition, 2007

A journal version is in review currently.

Finding: We have compared with discriminative methods applied to tracker data; our method offers significantly improved performance. We
have obtained experimental evidence that our method is robust to view direction and is unaffected by the changes of clothing

A set of sample results
Activity modelling using finite state automata: Our representation can give quite accurate results for complex motion queries, regardless of the clothing worn by the subject. Left: The results of ranking for 15 queries over our video collection (using k=40 in k-means). In this image, a colored pixel indicates a relevant video. An ideal search would result in an image where all the colored pixels are on the left of the image. Each color represents a different outfit. Note that the choice of the outfit doesn't affect the performance. We have three types of query here. Type I: single activities where there is a different action for legs and arms (ex: walk-carry). Type II: two consecutive actions like crouch followed by a
run. Type III: activities that are more complex, consisting of three consecutive actions where different body parts may be doing different things (ex: walk-stand-walk for legs; walk-wave-walk for arms.  Right: Average precision values for each query. Overall, mean average precision for HMM models using 40 clusters is 0.5636.