TEMPO: Efficient Multi-View Pose Estimation, Tracking, and Forecasting

Rohan Choudhury, Kris M. Kitani, László A. Jeni
Robotics Institute, Carnegie Mellon University

TEMPO accurately efficiently estimates and tracks poses from multiple camera views.


Existing volumetric methods for predicting 3D human pose estimation are accurate, but computationally expensive and optimized for single time-step prediction. We present TEMPO, an efficient multi-view pose estimation model that learns a robust spatiotemporal representation, improving pose accuracy while also tracking and forecasting human pose. We significantly reduce computation compared to the state-of-the-art by recurrently computing per-person 2D pose features, fusing both spatial and temporal information into a single representation. In doing so, our model is able to use spatiotemporal context to predict more accurate human poses without sacrificing efficiency. We further use this representation to track human poses over time as well as predict future poses. Finally, we demonstrate that our model is able to generalize across datasets without scene-specific fine-tuning. TEMPO achieves 10% better MPJPE with a 33x improvement in FPS compared to TesseTrack on the challenging CMU Panoptic Studio dataset.

How does it work?

A brief explanation of TEMPO.

Sample Results

We present some sample video results on YouTube. Click the thumbnails to play the videos.

Sample results for the pizza sequence.

Sample results for the Ian sequence.

Sample results for the haggling sequence.

Sample results for the band sequence.

Related Work


          title={TEMPO: Efficient Multi-View Pose Estimation, Tracking, and Forecasting},
          author={Choudhury, Rohan and Kitani, Kris M. and Jeni, Laszlo A.},
          booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},