Class notes for January 29

Egon Pasztor
CS 294 -- Computer Games
Scribe Notes for Thursday, January 29, 2004

Concluding "Combinatorial Motion Synthesis" from previous class

In the first part of class, we watch several movies corresponding to papers from the "combinatorial motion synthesis" papers.

Movie -- from Interactive motion generation from examples. by O. Arikan and D. Forsyth.
-- This is the one where the football-player-character learns to follow a path sketched on the ground plane, and his motion becomes more believable as the algorithm progresses.
- The iterative refinement process produces motions that are appropriate on short time scales, but the user cannot specify large-scale properties.
- Intermediate position or body-pose constraints can be given, but he can't be told, for example, to wave during his walk. An example is shown where the character is told to move from a start to end position, and the converged solution is a football-crouch-backward-run. He cannot be told to run, say, backward for only half the distance.
Movie -- from Motion Synthesis from Annotations." by Arikan, O., Forsyth, D. A., O'Brien, J. F.
-- This is the one where the same football-character's motion capture data has been annotated with labels like "walk", "run" or "kneel". The video shows a user scripting a sequence of annotations on a timeline. There are verbs and adjectives, like a region where it says "run", and halfway through that region "backward" is requested.
- David: It is interesting that we do not have a good formal system for annotating or labeling motion.
  - There is something called Labanotation where different styles of motion were classified, but in this system it is hard to say "Just walk and I don't care about the details".
- James: Also note when comparing these videos to others, that we do not use stick figures.
  - Humans have a very highly evolved sense of whether human motion is realistic, but this sense becomes less sharp for stick figures.
  - This idea also applies to skinning. We will be more discerning of the motion of a character with realistic skin geometry than one with poor skinning.
Movie -- from Motion Graphs by Lucas Kovar, Michael Gleicher, and Frederic Pighin.
-- This shows a stick figure character walking along a path which spells out the works "Motion Graphs". The concept -- "sticking together fragments from a large motion-capture sequence" -- is similar to the preceding.
- The big question which this approach is how to ensure that what your character is doing locally in consistent with your longterm global motion plan. This is an unsolved question.
Movie -- from Snap Together Motion: Assembling Run-Time Animation by Michael Gleicher, Hyun Joon Shin, Lucas Kovar, and Andrew Jepsen.
-- This one shows the same stick figure character as above, but constrained to move along a flower-petal-shaped transition graph. The user uses a video game controller to make the character perform a sequence corresponding to one petal-shaped loop.
- The big comment made about this video is how crisply the shadows on the ground plane met up with the characters feet as he stepped. The video had no footskate errors, and we noticed that the shadows highlighted this success.

"Motion synthesis by smoothing" paper discussion

Next, we begin discussing the first of the papers that we were supposed to have read for class: Motion texture: A two-level statistical model for character motion synthesis by Y. Li, T. Wang, and H-Y. Shum.

General Overview -- this is an algorithm that begins with some motion capture data of a distinctive motion, and purports to generate an infinite stream of "similar" motion for some definition of "similar". The algorithm begins with a stream of motion capture data.
- Using Bayesian machine-learning techniques, the algorithm attempts to model the data as the output of a set of linear dynamic models connected by a Markov transition network. That is, the parameters of the linear dynamic models, and the probabilities of transitions between them are learned so as to maximize the likelihood that the given (input) motion could be generated by the model. Uh.. or something like that.
- Then, with this model learned, it can be run indefinitely to generate more motion data with similar qualities to the original data.
David: The idea is conceptually pleasing. "You want to think about motion occurring in chunks" .. and you want to stick the chunks together. The algorithm effectively represents both large-time-scale and small-time-scale motion in the same model.

However, the paper contains technical errors.

David writes the following equation, from the paper, up onto the board.

This formula is not generally true, suggesting that the authors were lax on their mathematical rigor. (David suggests that a paper which handles these ideas with a correct mathematical treatment would be a successful one.)
The paper claims that their Markov network of linear models is converged using the EM algorithm -- however the implementation steps they describe are not consistent with EM.
- The paper takes a motion sequence, and tries to segment it into blocks, with each block containing data generatable by a different linear model. David describes this as "moving beads along a line", where each "bead" represents a transition point from one linear dynamic model to another. The algorithm moves the beads around until they converge to an optimum set of positions.
- This is not really consistent with the EM algorithm as it is generally described. However, the class is unable to think of an alternate algorithm that was literally running EM, which made as much sense.
- We also note that the paper ran on an input sequence of 48800 frames in total which were broken into 246 blocks. Thus each block contained approx 200 frames, which is barely a few seconds of motion. So each "linear dynamic model" in this paper is actually capable of generating only very short blips of motion.
- But is this bad? A student asks David if he expects an LDS should be able to generate long sequences of motion, and David replies that he actually isn't sure, and has believed both yes and no at various points in time, and research with Okan has led him to alter his opinion on this, back and forth, over the years.
David puts up the following equations which describe a Linear Dynamic Model as a visual aid.

The algorithm learns models by minimizing the difference between predicted poses generated by the model and actual poses in the training data. David points out that this is good only if the residual doesn't carry information -- but looks like white noise. At least, to be rigorous, the residual should be examined, and it looks like this paper didn't.

We watch the movie associated with this paper, and the attention of the class is captured by seeing how the transition points are connected.

The transition-point-connection problem as this scribe understands it:
- If the linear dynamic models could really be counted on to generate streams of realistic data, there would be no connection issue. One could imagine a linear-dynamic-model P, started off in some initial state, generating a sequence of body pose frames. Then, at some point the large-scale transition network decides to transition to a new linear-dynamic-model Q, to generate the next few hundred frames. As long as Q is started off in a state that matches the last frame from P, the transition would be seamless.
- However, the paper explains that since each linear-dynamic-model is trained to generate only a brief blip of motion -- 200 frames we decided in the preceding sections -- their linear-dynamic-models actually have this property of sort of petering out over time -- the quality of the motion drops the longer the LDS is run until it reaches a "steady state". The video didn't show this (and in fact the class didn't discuss it) but I'd imagine it wouldn't look pretty when mapped onto a skeleton.
- To correct this problem, the algorithm stores along with each linear dynamic model a representation of the initial pose that the LDS is supposed to start on -- the first two frames from the training data, the paper states. When synthesizing new motion, each LDS always begins in its specific initial pose. Therefore, it is essential that the final frame generated before a transition to a new LDS matches the initial pose of the new LDS.
Therefore the motion synthesis blocks are constrained by both an initial and a final state, and the class debates for some time about how this might be possible.
- The class discusses a "toy" situation where an LDS is to be run for only three frames, but the first and last frames are given as constraints. The following is drawn on the board:
  
  The model has learned and fixed A -- so if both the first and the last frames are given, then the last equation can be seen as a constraint on the noise variables N_2 and N_3
Thus class consensus is that the noise should be used as a control signal, the noise in each frame carefully chosen so that (a) the noise is as small as possible while (b) the noise gradually leads the LDS to the necessary target final frame.
- David suggests that appropriate noise values can be found using dynamic programming.
- It is suggested that getting the noise to be "as small as possible" might defeat the purpose of getting variety in the motion. Perhaps it's better to aim for a specific non-zero amount of noise?
However, the paper doesn't take this approach. Instead, the noise is chosen randomly and fixed. In this formulation, the two iterations of the LDS in our toy problem can be seen as two conflicting equations for X_2. Thus, rather than an under-constrained system for the noise, we have an over-constrained system for the intermediate frames, and the paper describes a least-squares method to select the intermediate frames that are actually shown.
Class consensus is that it would be better to use the noise as a control signal, however, there is some debate.

Final movie

With time running out, we watch the movie for the paper Motion capture assisted animation: Texturing and synthesis by K. Pullen and C. Bregler.

In this video, the user sets up a series of keyframes which are matched to specific frames in the motion capture data. The user keyframes only some joints, and the computer searches though the motion capture data for fragments where the joints do the appropriate thing.
We do not really have time to discuss it, and likely discussion will be moved to the following class.

End