Abstract: Learning to capture human motion is essential to 3D human pose and shape estimation from monocular video. However, the existing methods mainly rely on recurrent or convolutional operation to model such temporal information, which limits the ability to capture non-local context relations of human motion. To address this problem, we propose a motion pose and shape network (MPS-Net) to effectively capture humans in motion to estimate accurate and temporally coherent 3D human pose and shape from a video. Specifically, we first propose a motion continuity attention (MoCA) module that leverages visual cues observed from human motion to adaptively recalibrate the range that needs attention in the sequence to better capture the motion continuity dependencies. Then, we develop a hierarchical attentive feature integration (HAFI) module to effectively combine adjacent past and future feature representations to strengthen temporal correlation and refine the feature representation of the current frame. By coupling the MoCA and HAFI modules, the proposed MPS-Net excels in estimating 3D human pose and shape in the video. Though conceptually simple, our MPS-Net not only outperforms the state-of-the-art methods on the 3DPW, MPI-INF-3DHP, and Human3.6M benchmark datasets, but also uses fewer network parameters. The video demos can be found at this https URL.
Capturing Humans in Motion: Temporal-Attentive 3D Human Pose and Shape Estimation from Monocular Video [CVPR 2022]
Our Motion Pose and Shape Network (MPS-Net) is to effectively capture humans in motion to estimate accurate and temporally coherent 3D human pose and shape from a video.
Pleaser refer to our arXiv report for further details.
Check our YouTube videos below for 5 minute video presentation of our work.
Getting Started
MPS-Net has been implemented and tested on Ubuntu 18.04 with python >= 3.7.
Clone the repo:
git clone https://github.com/MPS-Net/MPS-Net_release.git
Installation
Install the requirements using virtualenv
:
cd $PWD/MPS-Net_release
source scripts/install_pip.sh
Download the Required Data
You can just run:
source scripts/get_base_data.sh
or
You can download the required data and the pre-trained MPS-Net model from here. You need to unzip the contents and the data directory structure should follow the below hierarchy.
${ROOT}
|-- data
| |-- base_data
| |-- preprocessed_data
Evaluation
Run the commands below to evaluate a pretrained model on 3DPW test set.
# dataset: 3dpw
python evaluate.py --dataset 3dpw --cfg ./configs/repr_table1_3dpw_model.yaml --gpu 0
You should be able to obtain the output below:
PA-MPJPE: 52.1, MPJPE: 84.3, MPVPE: 99.7, ACC-ERR: 7.4
Running the Demo
We have prepared a demo code to run MPS-Net on arbitrary videos. To do this you can just run:
python demo.py --vid_file sample_video.mp4 --gpu 0
sample_video.mp4 demo output:
sample_video2.mp4 demo output:
Citation
@inproceedings{WeiLin2022mpsnet,
title={Capturing Humans in Motion: Temporal-Attentive 3D Human Pose and Shape Estimation from Monocular Video},
author={Wei, Wen-Li and Lin, Jen-Chun and Liu, Tyng-Luh and Liao, Hong-Yuan Mark},
booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2022}
}
License
This project is licensed under the terms of the MIT license.
References
The base codes are largely borrowed from great resources VIBE and TCMR.