Forecasting Characteristic 3D Poses of Human Actions

Forecasting Characteristic 3D Poses of Human Actions

Forecasting characteristic 3D poses focuses on predicting future human actions based on short sequences of observed poses. The research, authored by Christian Diller, Thomas Funkhouser, and Angela Dai, introduces a probabilistic approach to model multi-modal distributions of likely future poses. This work is essential for applications in robotics, human-robot interaction, and video generation. The study evaluates a new dataset of annotated characteristic poses, demonstrating significant improvements over traditional time-based forecasting methods. Ideal for researchers and practitioners in computer vision and machine learning, this paper provides insights into human motion prediction methodologies.

Key Points

  • Introduces a probabilistic approach for forecasting characteristic 3D poses in human actions.
  • Evaluates a new dataset of annotated characteristic poses for improved human motion prediction.
  • Demonstrates significant performance improvements over traditional time-based forecasting methods.
  • Applicable for robotics, human-robot interaction, and video generation tasks.
420
/ 10
Forecasting Characteristic 3D Poses of Human Actions
Christian Diller
1
Thomas Funkhouser
2
Angela Dai
1
1
Technical University of Munich
2
Google
Target
Action
Time seconds
0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5
“Drink”
“Play”
“Pass”
Input
Target
Target
Input
Input
Figure 1. For a real-world 3d skeleton sequence of a human performing an action, we propose to forecast the semantically meaningful
characteristic 3d pose, representing the action goal for this sequence. As input, we take a short observation of a sequence of consecu-
tive poses leading up to the target characteristic pose. Thus, we propose to take a goal-oriented approach, predicting the key moments
characterizing future behavior, instead of predicting continuous motion, which can occur at varying speeds with predictions more easily
diverging for longer-term (>1s) predictions. We develop an attention-driven probabilistic approach to capture the most likely modes of
possible future characteristic poses.
Abstract
We propose the task of forecasting characteristic 3d
poses: from a short sequence observation of a person,
predict a future 3d pose of that person in a likely action-
defining, characteristic pose for instance, from observing
a person picking up an apple, predict the pose of the per-
son eating the apple. Prior work on human motion predic-
tion estimates future poses at fixed time intervals. Although
easy to define, this frame-by-frame formulation confounds
temporal and intentional aspects of human action. Instead,
we define a semantically meaningful pose prediction task
that decouples the predicted pose from time, taking inspira-
tion from goal-directed behavior. To predict characteristic
poses, we propose a probabilistic approach that models the
possible multi-modality in the distribution of likely char-
acteristic poses. We then sample future pose hypotheses
from the predicted distribution in an autoregressive fash-
ion to model dependencies between joints. To evaluate our
method, we construct a dataset of manually annotated char-
acteristic 3d poses. Our experiments with this dataset sug-
gest that our proposed probabilistic approach outperforms
state-of-the-art methods by 26% on average.
1. Introduction
Future human pose forecasting is fundamental towards a
comprehensive understanding of human behavior, and con-
sequently towards achieving higher-level perception in ma-
chine interactions with humans, such as autonomous robots
or vehicles. In fact, prediction is considered to play a foun-
dational part in intelligence [3, 9, 13]. In particular, predict-
ing the 3d pose of a human in the future lays a basis for both
structural and semantic understanding of human behavior,
and for an agent to take fine-grained anticipatory action to-
wards the forecasted future. For example, a robotic surgical
assistant should predict in advance where best to place a tool
to assist the surgeon’s next action, what sensor viewpoints
15914
will be best to observe the surgeon’s next manipulation, and
how to position itself to be out of the way at critical future
moments.
Recently, we have seen notable progress in the task of fu-
ture 3d human motion prediction from an initial observa-
tion of a person, forecasting the 3d behavior of that person
up to 1 second in the future [10,17,2123]. Various meth-
ods have been developed, leveraging RNNs [10, 12, 17,23],
graph convolutional neural networks [20, 22], and atten-
tion [21, 28]. However, these approaches all take a tem-
poral approach towards forecasting future 3d human poses,
and predict poses at fixed time intervals to imitate the fixed
frame rate of camera capture. This makes it difficult to pre-
dict longer-term (several seconds) behavior, which requires
predicting both the time-based speed of movement as well
as the higher-level goal of the future action.
Thus, we propose to decouple the temporal and inten-
tional behavior, and introduce a new task of forecasting
characteristic 3d poses of a person’s future action: from
a short pose sequence observation of a human, the goal is to
predict a future pose of the person in a characteristic, action-
defining moment. This has many potential applications,
including HRI, surveillance, visualization, simulation, and
content creation. It could be used to predict the hand-off
point when a robot is passing an object to a person; to de-
tect and display future poses worthy of alerts in a safety
monitoring system; to coordinate grasps when assisting a
person lifting a heavy object; to assist tracking through oc-
clusions; or to predict future keyframes, as is done in video
generation [18, 25].
Fig. 2 visualizes the difference between this new task and
the traditional, time-based approach: our task is to predict
a next characteristic pose at action-defining moments (blue
dots) rather than at fixed time-intervals (red dots). As shown
in Fig. 1, the characteristic 3d poses are more semantically
meaningful and rarely occur at exactly the same times in the
future. We believe that predicting possible future character-
istic 3d poses takes an important step towards forecasting
Joint Location
Time
Continuos Movement
𝒄
𝟎
𝒄
𝟏
𝒄
𝟑
Characteristic Poses
𝒙
𝟎
Poses at fixed time steps
Joint Location
Time
𝒄
𝟐
𝒙
𝟏
𝒙
𝟐
𝒙
𝟑
𝒙
𝟒
𝒙
𝟓
𝒙
𝟔
𝒙
𝟕
𝒙
𝟖
𝒙
𝟗
“pick up”
“drink”
“put down”
“step
back”
Figure 2. These plots show the salient difference between our
new task (left) and the traditional one (right). The orange curve
depicts the motion of one joint (e.g., hand position as a person
drinks from a glass). It represents a typical piecewise continuous
motion, which has discrete action-defining characteristic poses at
cusps of the motion curves (e.g., grasping the glass on the table,
putting it to ones mouth, etc.) separating smooth trajectories con-
necting them (e.g., raising or lowering the glass). Our task is to
predict future characteristic poses (blue dots on left) rather than
in-between poses at regular time intervals (red points on right).
human action, by understanding the objectives underlying
a future action or movement separately from the speed at
which they occur.
Since future characteristic 3d poses often occur at
longer-term intervals (> 1s) in the future, there may be mul-
tiple likely modes of the characteristic poses, and we must
capture this multi-modality in our forecasting. Rather than
deterministic forecasting, as is an approach in many 3d hu-
man pose forecasting approaches [2022], we develop an
attention-driven prediction of probability heatmaps repre-
senting the likelihood of each human pose joint in its future
location. This enables generation of multiple, diverse hy-
potheses for the future pose. To generate a coherent pose
prediction across all pose joints’ potentially multi-modal fu-
tures, we make autoregressive predictions for the end effec-
tors of the actions (e.g., predicting the right hand, then the
left hand conditioned on the predicted right hand location)
this enables a tractable modeling of the joint distribution
of the human pose joints.
To demonstrate our proposed approach, we introduce a
new benchmark on characteristic 3d pose prediction. We
annotate characteristic keyframes in sequences from the
GRAB [27] and Human3.6M [15] datasets. Experiments on
this benchmark show that our probabilistic approach outper-
forms time-based state of the art by 26% on average.
In summary, we present the following contributions:
We propose the task of forecasting characteristic 3d
poses: predicting likely next action-defining future
moments from a sequence observation of a person, to-
wards goal-oriented understanding of pose forecasting.
We introduce an attention-driven, probabilistic ap-
proach to tackle this problem and model the most
likely modes for the next characteristic pose, and show
that it outperforms state of the art.
We autoregressively model the multi-modal distribu-
tion of future pose joint locations, casting pose predic-
tion as a product of conditional distributions of end ef-
fector locations (e.g., hands), and the rest of the body.
We introduce a dataset and benchmark on our charac-
teristic 3d pose prediction, comprising 1535 annotated
characteristic pose frames from the GRAB [27] and
Human3.6M [15] datasets.
2. Related Work
Deterministic Human Motion Forecasting. Many
works have focused on human motion forecasting, cast as
a sequential task to predict a sequence of human poses ac-
cording to the fixed frame rate capture of a camera. For this
sequential task, recurrent neural networks have been widely
used for human motion forecasting [1, 7, 10, 11, 17, 23, 31].
Such approaches have achieved impressive success in
15915
Encoder
Attention
Input Pose
Sequence
Previous
Joint
Predictions
(if any)
Multi-Modal Heatmap
Heatmap
Sampling
𝑘 Skeleton Samples (without offsets)
Offsets
+
Per-Voxel Offsets
𝑘 Skeleton Samples
Figure 3. Overview of our approach for characteristic 3d pose prediction. From an input observed pose sequence, as well as any prior
joint predictions, we leverage attention to learn inter-joint dependencies, and decode a 3d volumetric heatmap representing the probability
distribution for the next joint to be predicted as well as a per-voxel offset field of same size for improved joint placement. This enables
autoregressive sampling to obtain final pose hypotheses characterizing likely characteristic 3d poses.
shorter-term prediction (up to 1s, occasionally sev-
eral seconds for longer term predictions), but the RNN
summarization of history into a fixed-size representation
struggles to maintain the long-term dependencies needed
for forecasting further into the future.
To address some of the drawbacks of RNNs, non-
recurrent models have also been adopted, encoding tempo-
ral history with convolutional or fully connected networks
[5, 19, 22], or attention [21, 28]. Li et al. [34] proposed an
auto-conditioned approach enabling synthesizing pose se-
quences up to 300 seconds of periodic-like motions (walk-
ing, dancing). However, these works all focus on frame-by-
frame synthesis, with benchmark evaluation of up to 1000
milliseconds. Instead of a frame-by-frame synthesis, we
propose a goal-directed task to capture perception of longer-
term human action, which not only lends itself towards fore-
casting more semantically meaningful key moments, but en-
ables a more predictable evaluation: as seen in Fig. 1, there
can be significant ambiguity in the number of pose frames
to predict towards a key or goal pose, making frame-based
evaluation difficult in longer-term forecasting.
Multi-Modal Human Motion Forecasting. While 3d
human motion forecasting has typically been addressed in a
deterministic fashion, several recent works have introduced
multi-modal future pose sequence predictions. These ap-
proaches leverage well-studied approaches for multi-modal
predictions, such as generative adversarial networks [4] and
variational autoencoders [2, 32, 33]. For instance, Aliakbar-
ian et al. [2] stochastically combines random noise with pre-
vious pose observations, leading to more diverse sequence
predictions. Yuan et al. [33] learns a set of mapping func-
tions which are then used for sampling from a trained VAE,
leading to increased diversity in the sequence predictions
than simple random sampling. In contrast to these time-
based approaches, we consider goal-oriented prediction of
characteristic poses, and model multi-modality explicitly
as predicted heatmaps for body joints in an autoregressive
fashion to capture inter-joint dependencies.
Goal-oriented Forecasting. While a time-based, frame-
by-frame prediction is the predominant approach towards
future forecasting tasks, several works have proposed to
tackle goal-oriented forecasting. Recently, Jayaraman et
al. [18] proposed to predict “predictable” future video
frames in a time-agnostic fashion, and represent the predic-
tions as subgoals for a robotic tasks. Pertsch et al. [25] pre-
dict future keyframes representing a future video sequence
of events. Cao et al. [6] plan human trajectories from an
image and 2d pose history, first predicting 2d goal locations
for a person to walk to in order to synthesize the path. In-
spired by such goal-based abstractions, we aim to represent
3d human actions as its key, characteristic poses.
3. Method Overview
Given a sequence of N 3d pose observations X
1:N
=
[x
1
, x
2
, ..., x
N
] of a person, our aim is to estimate a charac-
teristic 3d pose of that person, characterizing the intent of
the person’s future action. We take J joint locations (rep-
resented as their 3d coordinates) for each pose of the input
sequence, i.e. x
i
R
J×3
. From this input sequence, we
predict a joint distribution of J probability heatmaps H
j
and finally, sample K output pose hypotheses Y
1:K
, char-
acterized by their J 3d joints: y
i
R
J×3
. By representing
probability heatmaps for the joint predictions, we can cap-
ture multiple different modes in likely characteristic poses,
enabling more diverse future pose prediction. We note that
we are the first to propose using volumetric heatmaps for fu-
ture human pose forecasting, to the best of our knowledge,
while previous work used them for the more deterministic
task of pose estimation from multiple images [16, 29].
From the input sequence, we develop a neural network
architecture to predict a probability heatmap over a volu-
metric 3d grid for each joint, corresponding to likely future
positions of that joint. This enables effective modeling of
multi-modality, but remains tied to a discrete grid, so we
15916
/ 10
End of Document
420
You May Also Like

FAQs of Forecasting Characteristic 3D Poses of Human Actions

What is the main objective of forecasting characteristic 3D poses?
The main objective of forecasting characteristic 3D poses is to predict future human actions by analyzing short sequences of observed poses. This approach focuses on identifying key moments that define the intent of actions, such as reaching for an object or preparing to drink. By decoupling temporal aspects from intentional behavior, the research aims to enhance the understanding of human actions in various applications, including robotics and surveillance.
How does the proposed method improve upon traditional forecasting techniques?
The proposed method improves upon traditional forecasting techniques by utilizing a probabilistic framework that captures the multi-modal distribution of likely future poses. Unlike conventional time-based methods that predict poses at fixed intervals, this approach focuses on predicting semantically meaningful characteristic poses that align with the goals of human actions. This results in more accurate and diverse predictions, allowing for better anticipation of human behavior.
What datasets were used to evaluate the forecasting method?
The forecasting method was evaluated using a new dataset constructed from the GRAB and Human3.6M datasets. These datasets include a variety of human actions captured through motion capture technology, providing a comprehensive set of annotated characteristic poses. The evaluation demonstrates the effectiveness of the proposed method in accurately predicting future poses based on observed sequences.
What applications can benefit from the findings of this research?
The findings of this research can benefit various applications, including robotics, where understanding human actions is crucial for effective human-robot interaction. Additionally, the method can be applied in surveillance systems to predict and analyze human behavior in real-time. Other potential applications include video generation and animation, where accurate pose forecasting enhances the realism of character movements.

Related of Forecasting Characteristic 3D Poses of Human Actions