🤖 AI Summary
Human3R is a unified, feed-forward system for online 4D human–scene reconstruction from casually captured monocular video that jointly recovers global multi-person SMPL-X bodies, dense 3D scene geometry, and camera trajectories in a single forward pass. Built on the CUT3R backbone and using parameter-efficient visual prompt tuning, Human3R encodes each frame into image tokens with patch-level detection; detected head tokens are concatenated with Multi‑HMR/ViT‑DINO human-prior tokens to form human prompts that act as discriminative ID queries. Those prompts self-attend to image tokens and cross-attend to a scene state to produce temporally consistent human tokens in the world frame. Importantly, only human-related layers are fine-tuned while CUT3R parameters remain frozen, yielding a compact model that, after one day of training on the synthetic BEDLAM dataset on a single GPU, runs in real time (~15 FPS) with an 8 GB memory footprint.
Significance for the AI/ML community lies in eliminating heavy multi-stage pipelines (SLAM, depth pre-processing, iterative contact refinement, separate detection) and delivering a single-stage, efficient baseline that is competitive or state-of-the-art across global motion, local mesh recovery, video depth, and camera pose tasks. Practical implications include easier deployment for AR/VR, telepresence, and robotics; however, limitations remain—human–scene penetrations and coarse interactions persist and could benefit from downstream contact-aware optimization or more expressive interaction modules.
Loading comments...
login to comment
loading comments...
no comments yet