
will be best to observe the surgeon’s next manipulation, and
how to position itself to be out of the way at critical future
moments.
Recently, we have seen notable progress in the task of fu-
ture 3d human motion prediction – from an initial observa-
tion of a person, forecasting the 3d behavior of that person
up to ≈ 1 second in the future [10,17,21–23]. Various meth-
ods have been developed, leveraging RNNs [10, 12, 17,23],
graph convolutional neural networks [20, 22], and atten-
tion [21, 28]. However, these approaches all take a tem-
poral approach towards forecasting future 3d human poses,
and predict poses at fixed time intervals to imitate the fixed
frame rate of camera capture. This makes it difficult to pre-
dict longer-term (several seconds) behavior, which requires
predicting both the time-based speed of movement as well
as the higher-level goal of the future action.
Thus, we propose to decouple the temporal and inten-
tional behavior, and introduce a new task of forecasting
characteristic 3d poses of a person’s future action: from
a short pose sequence observation of a human, the goal is to
predict a future pose of the person in a characteristic, action-
defining moment. This has many potential applications,
including HRI, surveillance, visualization, simulation, and
content creation. It could be used to predict the hand-off
point when a robot is passing an object to a person; to de-
tect and display future poses worthy of alerts in a safety
monitoring system; to coordinate grasps when assisting a
person lifting a heavy object; to assist tracking through oc-
clusions; or to predict future keyframes, as is done in video
generation [18, 25].
Fig. 2 visualizes the difference between this new task and
the traditional, time-based approach: our task is to predict
a next characteristic pose at action-defining moments (blue
dots) rather than at fixed time-intervals (red dots). As shown
in Fig. 1, the characteristic 3d poses are more semantically
meaningful and rarely occur at exactly the same times in the
future. We believe that predicting possible future character-
istic 3d poses takes an important step towards forecasting
Joint Location
Time
Continuos Movement
𝒄
𝟎
𝒄
𝟏
𝒄
𝟑
Characteristic Poses
𝒙
𝟎
Poses at fixed time steps
Joint Location
Time
𝒄
𝟐
𝒙
𝟏
𝒙
𝟐
𝒙
𝟑
𝒙
𝟒
𝒙
𝟓
𝒙
𝟔
𝒙
𝟕
𝒙
𝟖
𝒙
𝟗
“pick up”
“drink”
“put down”
“step
back”
Figure 2. These plots show the salient difference between our
new task (left) and the traditional one (right). The orange curve
depicts the motion of one joint (e.g., hand position as a person
drinks from a glass). It represents a typical piecewise continuous
motion, which has discrete action-defining characteristic poses at
cusps of the motion curves (e.g., grasping the glass on the table,
putting it to ones mouth, etc.) separating smooth trajectories con-
necting them (e.g., raising or lowering the glass). Our task is to
predict future characteristic poses (blue dots on left) rather than
in-between poses at regular time intervals (red points on right).
human action, by understanding the objectives underlying
a future action or movement separately from the speed at
which they occur.
Since future characteristic 3d poses often occur at
longer-term intervals (> 1s) in the future, there may be mul-
tiple likely modes of the characteristic poses, and we must
capture this multi-modality in our forecasting. Rather than
deterministic forecasting, as is an approach in many 3d hu-
man pose forecasting approaches [20–22], we develop an
attention-driven prediction of probability heatmaps repre-
senting the likelihood of each human pose joint in its future
location. This enables generation of multiple, diverse hy-
potheses for the future pose. To generate a coherent pose
prediction across all pose joints’ potentially multi-modal fu-
tures, we make autoregressive predictions for the end effec-
tors of the actions (e.g., predicting the right hand, then the
left hand conditioned on the predicted right hand location)
– this enables a tractable modeling of the joint distribution
of the human pose joints.
To demonstrate our proposed approach, we introduce a
new benchmark on characteristic 3d pose prediction. We
annotate characteristic keyframes in sequences from the
GRAB [27] and Human3.6M [15] datasets. Experiments on
this benchmark show that our probabilistic approach outper-
forms time-based state of the art by 26% on average.
In summary, we present the following contributions:
• We propose the task of forecasting characteristic 3d
poses: predicting likely next action-defining future
moments from a sequence observation of a person, to-
wards goal-oriented understanding of pose forecasting.
• We introduce an attention-driven, probabilistic ap-
proach to tackle this problem and model the most
likely modes for the next characteristic pose, and show
that it outperforms state of the art.
• We autoregressively model the multi-modal distribu-
tion of future pose joint locations, casting pose predic-
tion as a product of conditional distributions of end ef-
fector locations (e.g., hands), and the rest of the body.
• We introduce a dataset and benchmark on our charac-
teristic 3d pose prediction, comprising 1535 annotated
characteristic pose frames from the GRAB [27] and
Human3.6M [15] datasets.
2. Related Work
Deterministic Human Motion Forecasting. Many
works have focused on human motion forecasting, cast as
a sequential task to predict a sequence of human poses ac-
cording to the fixed frame rate capture of a camera. For this
sequential task, recurrent neural networks have been widely
used for human motion forecasting [1, 7, 10, 11, 17, 23, 31].
Such approaches have achieved impressive success in
15915