Egon Pasztor
CS 294 -- Computer Games
Scribe Notes for Thursday, January 29, 2004
Concluding "Combinatorial Motion Synthesis" from previous class
|
In the first part of class, we watch several movies corresponding to
papers from the "combinatorial motion synthesis" papers.
Movie -- from Interactive
motion generation from examples. by O. Arikan and D. Forsyth.
-- This is the one where the football-player-character learns to follow
a path sketched on the ground plane, and his motion becomes more believable
as the algorithm progresses.
The iterative refinement process produces motions that are
appropriate on short time scales, but the user cannot specify large-scale
properties.
Intermediate position or body-pose constraints can be given, but he
can't be told, for example, to wave during his walk. An example is shown
where the character is told to move from a start to end position, and the
converged solution is a football-crouch-backward-run. He cannot be told
to run, say, backward for only half the distance.
Movie -- from Motion Synthesis from Annotations." by Arikan, O., Forsyth, D. A., O'Brien, J. F.
-- This is the one where the same football-character's motion capture
data has been annotated with labels like "walk", "run" or "kneel". The
video shows a user scripting a sequence of annotations on a timeline.
There are verbs and adjectives, like a region where it says "run",
and halfway through that region "backward" is requested.
Movie -- from Motion
Graphs by Lucas Kovar, Michael Gleicher, and Frederic Pighin.
-- This shows a stick figure character walking along a path which
spells out the works "Motion Graphs". The concept -- "sticking together
fragments from a large motion-capture sequence" -- is similar to the
preceding.
Movie -- from Snap Together
Motion: Assembling Run-Time Animation by Michael Gleicher, Hyun Joon
Shin, Lucas Kovar, and Andrew Jepsen.
-- This one shows the same stick figure character as above, but constrained
to move along a flower-petal-shaped transition graph. The user uses a video
game controller to make the character perform a sequence corresponding to
one petal-shaped loop.
"Motion synthesis by smoothing" paper discussion
|
Next, we begin discussing the first of the papers that we were supposed to
have read for class: Motion texture: A two-level statistical model for character motion synthesis by
Y. Li, T. Wang, and H-Y. Shum.
General Overview -- this is an algorithm that begins with
some motion capture data of a distinctive motion, and purports to generate
an infinite stream of "similar" motion for some definition of "similar".
The algorithm begins with a stream of motion capture data.
Using Bayesian machine-learning techniques, the algorithm
attempts to model the
data as the output of a set of linear dynamic models connected by a
Markov transition network. That is, the parameters of the linear dynamic
models, and the probabilities of transitions between them are learned
so as to maximize the likelihood that the given (input) motion
could be generated by the model. Uh.. or something like that.
Then, with this model learned, it can be run indefinitely to
generate more motion data with similar qualities to the original data.
David: The idea is conceptually pleasing. "You want to think
about motion occurring in chunks" .. and you want to stick the chunks
together. The algorithm effectively represents both large-time-scale
and small-time-scale motion in the same model.
However, the paper contains technical errors.
David writes the following equation, from the paper,
up onto the board.
 
This formula is not generally true, suggesting that the authors
were lax on their mathematical rigor. (David suggests that a paper
which handles these ideas with a correct mathematical
treatment would be a successful one.)
The paper claims that their Markov network of linear models is
converged using the EM algorithm -- however the implementation steps they
describe are not consistent with EM.
The paper takes a motion sequence, and tries to segment it into
blocks, with each block containing data generatable by a different
linear model. David describes this as "moving beads along a line",
where each "bead" represents a transition point from one linear
dynamic model to another. The algorithm moves the beads around until
they converge to an optimum set of positions.
This is not really consistent with the EM algorithm as it is generally
described. However, the class is unable to think of an alternate algorithm
that was literally running EM, which made as much sense.
We also note that the paper ran on an input sequence of 48800 frames
in total which were broken into 246 blocks. Thus each block contained
approx 200 frames, which is barely a few seconds of motion. So each
"linear dynamic model" in this paper is actually capable of generating only
very short blips of motion.
But is this bad? A student asks David if he expects an LDS
should be able to generate long sequences of
motion, and David replies that he actually isn't sure, and has believed both
yes and no at various points in time, and research with Okan has led him to
alter his opinion on this, back and forth, over the years.
David puts up the following equations which describe a Linear Dynamic
Model as a visual aid.
 
The algorithm learns models by minimizing the difference between predicted
poses generated by the model and actual poses in the training data.
David points out that this is good only if the residual doesn't
carry information -- but looks like white noise. At least, to be rigorous,
the residual should be examined, and it looks like this paper didn't.
We watch the movie associated with this paper, and the attention
of the class is captured by seeing how the transition points are
connected.
The transition-point-connection problem as this scribe understands it:
If the linear dynamic models could really be counted on to generate streams
of realistic data, there would be no connection issue. One could imagine
a linear-dynamic-model P, started off in some initial state, generating
a sequence of body pose frames. Then, at some point the large-scale transition
network decides to transition to a new linear-dynamic-model Q, to generate
the next few hundred frames. As long as Q is started off in a state that
matches the last frame from P, the transition would be seamless.
However, the paper explains that since each linear-dynamic-model is
trained to generate only a brief blip of motion -- 200 frames we decided
in the preceding sections -- their linear-dynamic-models actually
have this property of sort of petering out over time -- the quality of the
motion drops the longer the LDS is run until it reaches a "steady state".
The video didn't show this (and in fact the class didn't discuss it)
but I'd imagine it wouldn't look pretty when mapped onto a skeleton.
To correct this problem, the algorithm stores along with each linear
dynamic model a representation of the initial pose that the LDS is
supposed to start on -- the first two frames from the training data, the paper states.
When synthesizing new motion, each LDS always begins in its specific initial pose.
Therefore, it is essential that the final frame generated before a transition to
a new LDS matches the initial pose of the new LDS.
Therefore the motion synthesis blocks are constrained by both an initial
and a final state, and the class debates for some time about how this might be
possible.
The class discusses a "toy" situation where an LDS is to be run for
only three frames, but the first and last frames are given as constraints.
The following is drawn on the board:
 
The model has learned and fixed A -- so if both the first and the last
frames are given, then the last equation can be seen as a constraint on the noise
variables N_2 and N_3
Thus class consensus is that the noise should be used as a control signal,
the noise in each frame carefully chosen so that (a) the noise is as small
as possible while (b) the noise gradually leads the LDS to the necessary
target final frame.
David suggests that appropriate noise values can be found using
dynamic programming.
It is suggested that getting the noise to be "as small as possible" might
defeat the purpose of getting variety in the motion. Perhaps it's better
to aim for a specific non-zero amount of noise?
However, the paper doesn't take this approach. Instead, the noise is
chosen randomly and fixed. In this formulation, the two iterations of the
LDS in our toy problem can be seen as two conflicting equations for X_2.
Thus, rather than an under-constrained system for the noise, we have an over-constrained
system for the intermediate frames, and the paper describes a least-squares method to
select the intermediate frames that are actually shown.
Class consensus is that it would be better to use the noise as a control signal,
however, there is some debate.
With time running out, we watch the movie for the paper
Motion
capture assisted animation: Texturing and synthesis by K. Pullen and C. Bregler.
In this video, the user sets up a series of keyframes which are
matched to specific frames in the motion capture data. The user
keyframes only some joints, and the computer searches though
the motion capture data for fragments where the joints do the appropriate thing.
We do not really have time to discuss it, and likely discussion will be moved to the
following class.
|