Forecasting Characteristic 3D Poses Of Human Actions

Forecasting Characteristic 3D Poses of Human Actions

Christian Diller

Thomas Funkhouser

Angela Dai

Technical University of Munich

Google

Target

Action

Time seconds

0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5

“Drink”

“Play”

“Pass”

Input

Target

Input

Figure 1. For a real-world 3d skeleton sequence of a human performing an action, we propose to forecast the semantically meaningful

characteristic 3d pose, representing the action goal for this sequence. As input, we take a short observation of a sequence of consecu-

tive poses leading up to the target characteristic pose. Thus, we propose to take a goal-oriented approach, predicting the key moments

characterizing future behavior, instead of predicting continuous motion, which can occur at varying speeds with predictions more easily

diverging for longer-term (>1s) predictions. We develop an attention-driven probabilistic approach to capture the most likely modes of

possible future characteristic poses.

Abstract

We propose the task of forecasting characteristic 3d

poses: from a short sequence observation of a person,

predict a future 3d pose of that person in a likely action-

defining, characteristic pose – for instance, from observing

a person picking up an apple, predict the pose of the per-

son eating the apple. Prior work on human motion predic-

tion estimates future poses at fixed time intervals. Although

easy to define, this frame-by-frame formulation confounds

temporal and intentional aspects of human action. Instead,

we define a semantically meaningful pose prediction task

that decouples the predicted pose from time, taking inspira-

tion from goal-directed behavior. To predict characteristic

poses, we propose a probabilistic approach that models the

possible multi-modality in the distribution of likely char-

acteristic poses. We then sample future pose hypotheses

from the predicted distribution in an autoregressive fash-

ion to model dependencies between joints. To evaluate our

method, we construct a dataset of manually annotated char-

acteristic 3d poses. Our experiments with this dataset sug-

gest that our proposed probabilistic approach outperforms

state-of-the-art methods by 26% on average.

1. Introduction

Future human pose forecasting is fundamental towards a

comprehensive understanding of human behavior, and con-

sequently towards achieving higher-level perception in ma-

chine interactions with humans, such as autonomous robots

or vehicles. In fact, prediction is considered to play a foun-

dational part in intelligence [3, 9, 13]. In particular, predict-

ing the 3d pose of a human in the future lays a basis for both

structural and semantic understanding of human behavior,

and for an agent to take fine-grained anticipatory action to-

wards the forecasted future. For example, a robotic surgical

assistant should predict in advance where best to place a tool

to assist the surgeon’s next action, what sensor viewpoints

15914

will be best to observe the surgeon’s next manipulation, and

how to position itself to be out of the way at critical future

moments.

Recently, we have seen notable progress in the task of fu-

ture 3d human motion prediction – from an initial observa-

tion of a person, forecasting the 3d behavior of that person

up to ≈ 1 second in the future [10,17,21–23]. Various meth-

ods have been developed, leveraging RNNs [10, 12, 17,23],

graph convolutional neural networks [20, 22], and atten-

tion [21, 28]. However, these approaches all take a tem-

poral approach towards forecasting future 3d human poses,

and predict poses at fixed time intervals to imitate the fixed

frame rate of camera capture. This makes it difficult to pre-

dict longer-term (several seconds) behavior, which requires

predicting both the time-based speed of movement as well

as the higher-level goal of the future action.

Thus, we propose to decouple the temporal and inten-

tional behavior, and introduce a new task of forecasting

characteristic 3d poses of a person’s future action: from

a short pose sequence observation of a human, the goal is to

predict a future pose of the person in a characteristic, action-

defining moment. This has many potential applications,

including HRI, surveillance, visualization, simulation, and

content creation. It could be used to predict the hand-off

point when a robot is passing an object to a person; to de-

tect and display future poses worthy of alerts in a safety

monitoring system; to coordinate grasps when assisting a

person lifting a heavy object; to assist tracking through oc-

clusions; or to predict future keyframes, as is done in video

generation [18, 25].

Fig. 2 visualizes the difference between this new task and

the traditional, time-based approach: our task is to predict

a next characteristic pose at action-defining moments (blue

dots) rather than at fixed time-intervals (red dots). As shown

in Fig. 1, the characteristic 3d poses are more semantically

meaningful and rarely occur at exactly the same times in the

future. We believe that predicting possible future character-

istic 3d poses takes an important step towards forecasting

Joint Location

Time

Continuos Movement

𝒄

𝟎

𝒄

𝟏

𝒄

𝟑

Characteristic Poses

𝒙

𝟎

Poses at fixed time steps

Joint Location

Time

𝒄

𝟐

𝒙

𝟏

𝒙

𝟐

𝒙

𝟑

𝒙

𝟒

𝒙

𝟓

𝒙

𝟔

𝒙

𝟕

𝒙

𝟖

𝒙

𝟗

“pick up”

“drink”

“put down”

“step

back”

Figure 2. These plots show the salient difference between our

new task (left) and the traditional one (right). The orange curve

depicts the motion of one joint (e.g., hand position as a person

drinks from a glass). It represents a typical piecewise continuous

motion, which has discrete action-defining characteristic poses at

cusps of the motion curves (e.g., grasping the glass on the table,

putting it to ones mouth, etc.) separating smooth trajectories con-

necting them (e.g., raising or lowering the glass). Our task is to

predict future characteristic poses (blue dots on left) rather than

in-between poses at regular time intervals (red points on right).

human action, by understanding the objectives underlying

a future action or movement separately from the speed at

which they occur.

Since future characteristic 3d poses often occur at

longer-term intervals (> 1s) in the future, there may be mul-

tiple likely modes of the characteristic poses, and we must

capture this multi-modality in our forecasting. Rather than

deterministic forecasting, as is an approach in many 3d hu-

man pose forecasting approaches [20–22], we develop an

attention-driven prediction of probability heatmaps repre-

senting the likelihood of each human pose joint in its future

location. This enables generation of multiple, diverse hy-

potheses for the future pose. To generate a coherent pose

prediction across all pose joints’ potentially multi-modal fu-

tures, we make autoregressive predictions for the end effec-

tors of the actions (e.g., predicting the right hand, then the

left hand conditioned on the predicted right hand location)

– this enables a tractable modeling of the joint distribution

of the human pose joints.

To demonstrate our proposed approach, we introduce a

new benchmark on characteristic 3d pose prediction. We

annotate characteristic keyframes in sequences from the

GRAB [27] and Human3.6M [15] datasets. Experiments on

this benchmark show that our probabilistic approach outper-

forms time-based state of the art by 26% on average.

In summary, we present the following contributions:

• We propose the task of forecasting characteristic 3d

poses: predicting likely next action-defining future

moments from a sequence observation of a person, to-

wards goal-oriented understanding of pose forecasting.

• We introduce an attention-driven, probabilistic ap-

proach to tackle this problem and model the most

likely modes for the next characteristic pose, and show

that it outperforms state of the art.

• We autoregressively model the multi-modal distribu-

tion of future pose joint locations, casting pose predic-

tion as a product of conditional distributions of end ef-

fector locations (e.g., hands), and the rest of the body.

• We introduce a dataset and benchmark on our charac-

teristic 3d pose prediction, comprising 1535 annotated

characteristic pose frames from the GRAB [27] and

Human3.6M [15] datasets.

2. Related Work

Deterministic Human Motion Forecasting. Many

works have focused on human motion forecasting, cast as

a sequential task to predict a sequence of human poses ac-

cording to the fixed frame rate capture of a camera. For this

sequential task, recurrent neural networks have been widely

used for human motion forecasting [1, 7, 10, 11, 17, 23, 31].

Such approaches have achieved impressive success in

15915

Encoder

Attention

Input Pose

Sequence

Joint

Predictions

(if any)

Multi-Modal Heatmap

Heatmap

Sampling

𝑘 Skeleton Samples (without offsets)

Offsets

Per-Voxel Offsets

𝑘 Skeleton Samples

Figure 3. Overview of our approach for characteristic 3d pose prediction. From an input observed pose sequence, as well as any prior

joint predictions, we leverage attention to learn inter-joint dependencies, and decode a 3d volumetric heatmap representing the probability

distribution for the next joint to be predicted as well as a per-voxel offset field of same size for improved joint placement. This enables

autoregressive sampling to obtain final pose hypotheses characterizing likely characteristic 3d poses.

shorter-term prediction (up to ≈ 1s, occasionally sev-

eral seconds for longer term predictions), but the RNN

summarization of history into a fixed-size representation

struggles to maintain the long-term dependencies needed

for forecasting further into the future.

To address some of the drawbacks of RNNs, non-

recurrent models have also been adopted, encoding tempo-

ral history with convolutional or fully connected networks

[5, 19, 22], or attention [21, 28]. Li et al. [34] proposed an

auto-conditioned approach enabling synthesizing pose se-

quences up to 300 seconds of periodic-like motions (walk-

ing, dancing). However, these works all focus on frame-by-

frame synthesis, with benchmark evaluation of up to 1000

milliseconds. Instead of a frame-by-frame synthesis, we

propose a goal-directed task to capture perception of longer-

term human action, which not only lends itself towards fore-

casting more semantically meaningful key moments, but en-

ables a more predictable evaluation: as seen in Fig. 1, there

can be significant ambiguity in the number of pose frames

to predict towards a key or goal pose, making frame-based

evaluation difficult in longer-term forecasting.

Multi-Modal Human Motion Forecasting. While 3d

human motion forecasting has typically been addressed in a

deterministic fashion, several recent works have introduced

multi-modal future pose sequence predictions. These ap-

proaches leverage well-studied approaches for multi-modal

predictions, such as generative adversarial networks [4] and

variational autoencoders [2, 32, 33]. For instance, Aliakbar-

ian et al. [2] stochastically combines random noise with pre-

vious pose observations, leading to more diverse sequence

predictions. Yuan et al. [33] learns a set of mapping func-

tions which are then used for sampling from a trained VAE,

leading to increased diversity in the sequence predictions

than simple random sampling. In contrast to these time-

based approaches, we consider goal-oriented prediction of

characteristic poses, and model multi-modality explicitly

as predicted heatmaps for body joints in an autoregressive

fashion to capture inter-joint dependencies.

Goal-oriented Forecasting. While a time-based, frame-

by-frame prediction is the predominant approach towards

future forecasting tasks, several works have proposed to

tackle goal-oriented forecasting. Recently, Jayaraman et

al. [18] proposed to predict “predictable” future video

frames in a time-agnostic fashion, and represent the predic-

tions as subgoals for a robotic tasks. Pertsch et al. [25] pre-

dict future keyframes representing a future video sequence

of events. Cao et al. [6] plan human trajectories from an

image and 2d pose history, first predicting 2d goal locations

for a person to walk to in order to synthesize the path. In-

spired by such goal-based abstractions, we aim to represent

3d human actions as its key, characteristic poses.

3. Method Overview

Given a sequence of N 3d pose observations X

1:N

, x

, ..., x

] of a person, our aim is to estimate a charac-

teristic 3d pose of that person, characterizing the intent of

the person’s future action. We take J joint locations (rep-

resented as their 3d coordinates) for each pose of the input

sequence, i.e. x

∈ R

J×3

. From this input sequence, we

predict a joint distribution of J probability heatmaps H

and finally, sample K output pose hypotheses Y

1:K

, char-

acterized by their J 3d joints: y

∈ R

J×3

. By representing

probability heatmaps for the joint predictions, we can cap-

ture multiple different modes in likely characteristic poses,

enabling more diverse future pose prediction. We note that

we are the first to propose using volumetric heatmaps for fu-

ture human pose forecasting, to the best of our knowledge,

while previous work used them for the more deterministic

task of pose estimation from multiple images [16, 29].

From the input sequence, we develop a neural network

architecture to predict a probability heatmap over a volu-

metric 3d grid for each joint, corresponding to likely future

positions of that joint. This enables effective modeling of

multi-modality, but remains tied to a discrete grid, so we

15916

/ 10

420

100 Examples Of Prepositions

1687587583.4 Unit 1 Biochemistry Of Lipids

3 Parallel and Perpendicular Lines

365 Day Bible Reading Plan for Daily Devotion

A Guide to Eating After Gallbladder Surgery

A preservação da biodiversidade e o manejo sustentável dos ecossistemas

FAQs of Forecasting Characteristic 3D Poses of Human Actions

What is the main objective of forecasting characteristic 3D poses?

The main objective of forecasting characteristic 3D poses is to predict future human actions by analyzing short sequences of observed poses. This approach focuses on identifying key moments that define the intent of actions, such as reaching for an object or preparing to drink. By decoupling temporal aspects from intentional behavior, the research aims to enhance the understanding of human actions in various applications, including robotics and surveillance.

How does the proposed method improve upon traditional forecasting techniques?

The proposed method improves upon traditional forecasting techniques by utilizing a probabilistic framework that captures the multi-modal distribution of likely future poses. Unlike conventional time-based methods that predict poses at fixed intervals, this approach focuses on predicting semantically meaningful characteristic poses that align with the goals of human actions. This results in more accurate and diverse predictions, allowing for better anticipation of human behavior.

What datasets were used to evaluate the forecasting method?

The forecasting method was evaluated using a new dataset constructed from the GRAB and Human3.6M datasets. These datasets include a variety of human actions captured through motion capture technology, providing a comprehensive set of annotated characteristic poses. The evaluation demonstrates the effectiveness of the proposed method in accurately predicting future poses based on observed sequences.

What applications can benefit from the findings of this research?

The findings of this research can benefit various applications, including robotics, where understanding human actions is crucial for effective human-robot interaction. Additionally, the method can be applied in surveillance systems to predict and analyze human behavior in real-time. Other potential applications include video generation and animation, where accurate pose forecasting enhances the realism of character movements.

Forecasting Characteristic 3D Poses of Human Actions

Key Points

100 Examples Of Prepositions

1687587583.4 Unit 1 Biochemistry Of Lipids

3 Parallel and Perpendicular Lines

365 Day Bible Reading Plan for Daily Devotion

A Guide to Eating After Gallbladder Surgery

A preservação da biodiversidade e o manejo sustentável dos ecossistemas

FAQs of Forecasting Characteristic 3D Poses of Human Actions

Related of Forecasting Characteristic 3D Poses of Human Actions

Causative Agents of Bacterial Mortality in Marine Ecosystems

Anaerobic Degradation of Oil Hydrocarbons by Bacteria

Host-specificity and Functional Diversity of Mycorrhizal Fungi

Community Succession and Decomposition of Microbial Biomass

Heat Shock Response in Hyperthermophilic Microorganisms

Temperature-Substrate Controversy in Bacterial Growth