PAM: A Pose-Appearance-Motion Engine for Sim-to-Real HOI Video Generation

cs.CV Mingju Gao, Kaisen Yang, Huan-ang Gao, Bohan Li, Ao Ding, Wenyi Li, Yangcheng Yu, Jinkun Liu, Shaocong Xu, Yike Niu, Haohan Chi, Hao Chen, Hao Tang, Li Yi, Hao Zhao · Mar 23, 2026
Local to this browser
What it does
Hand-object interaction (HOI) video generation is currently split between pose-only synthesis, static appearance generation, and motion methods requiring ground-truth first frames. This paper introduces PAM, a three-stage...
Why it matters
83 to 29. 13 on DexYCB compared to prior work.
Main concern
PAM delivers state-of-the-art quantitative results on DexYCB and OAKINK2, with particularly strong hand pose accuracy ($MPJPE$ of 19. 37 mm vs.
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

Hand-object interaction (HOI) video generation is currently split between pose-only synthesis, static appearance generation, and motion methods requiring ground-truth first frames. This paper introduces PAM, a three-stage Pose–Appearance–Motion engine that generates high-resolution HOI videos from only initial/target poses and object geometry, achieving true sim-to-real transfer. The system combines GraspXL for pose trajectory generation, Flux for appearance synthesis with multimodal ControlNet conditioning, and CogVideoX for motion generation, producing 480×720 videos while improving FVD from 38.83 to 29.13 on DexYCB compared to prior work.

Critical review
Verdict
Bottom line

PAM delivers state-of-the-art quantitative results on DexYCB and OAKINK2, with particularly strong hand pose accuracy ($MPJPE$ of 19.37 mm vs. 30.05 mm for CosHand) and significantly higher resolution (480×720 vs. 256×256/256×384). The three-stage decoupled design enables controllable generation and demonstrates practical utility by allowing downstream hand pose estimators trained on 50\% real data plus synthetic videos to match 100\% real data baselines. However, the pipeline's reliance on sequential step-wise processing introduces error propagation risks, and the substantial computational cost (301 seconds per video) may limit accessibility.

“On DexYCB, we obtain an FVD of 29.13 (vs. 38.83 for InterDyn), and MPJPE of 19.37 mm (vs. 30.05 mm for CosHand), while generating higher-resolution 480×720 videos compared to 256×256/256×384 baselines.”
PAM paper · Abstract
What holds up

The multi-condition design integrating depth, segmentation, and hand keypoints is rigorously validated through ablations (Table 3), showing that combining all three modalities yields optimal performance ($FVD$ 29.13 vs. 30.00 for depth-only). The downstream task validation (Section 4.5) robustly demonstrates the sim-to-real value: augmenting with 3,400 synthetic videos allows matching full real-data performance with only 50\% real samples, confirming the data quality.

“For a downstream hand pose estimation task using SimpleHand, augmenting training with 3,400 synthetic videos (207k frames) allows a model trained on only 50% of the real data plus our synthetic data to match the 100% real baseline.”
PAM paper · Abstract
Main concerns

The pipeline suffers from error propagation between decoupled stages; as noted in Section 7.7, "Stage-I geometric errors (e.g., interpenetration or missing contact) can propagate, leading to physically implausible interactions even if the generated video appears photorealistic." Additionally, Stage-III quality heavily depends on Stage-II appearance guidance, creating a fragile dependency chain. The computational requirements are formidable—Stage-II alone consumes 41.4 GB of memory and the full pipeline requires 301 seconds per video on an NVIDIA H20 GPU (Table 5)—raising scalability concerns. The zero-shot generalization claim (Section 4.6) is limited to qualitative results without quantitative metrics on the cross-domain (single-hand to bimanual) transfer.

“Stage-I geometric errors (e.g., interpenetration or missing contact) can propagate, leading to physically implausible interactions even if the generated video appears photorealistic.”
PAM paper · Section 7.7
Evidence and comparison

Comparisons to InterDyn and CosHand favorably highlight PAM's resolution and FVD scores, though the higher resolution (480×720 vs. 256×256/256×384) may confound direct metric comparisons as baselines operate at lower resolution. The critique that ManiVideo "requires human appearance data, which is not available from simulators" (Section 2.3) is accurate but underscores that PAM shifts the dependency to simulator-provided geometry rather than eliminating external data requirements. The paper should report resolution-normalized metrics or control experiments to ensure fairness, as higher resolution typically improves perceptual metrics like FVD and LPIPS.

“ManiVideo introduces an occlusion-aware representation but requires human appearance data, which is not available from simulators like GraspXL.”
PAM paper · Section 2.3
Reproducibility

Reproducibility is significantly hampered by the absence of code release or project repository links in the main text (beyond the Flux GitHub reference) and the prohibitive hardware requirements (training on 8× NVIDIA H800 GPUs). While hyperparameters are detailed (AdamW, $1\times 10^{-4}$ learning rate, 8,000 steps), exact model checkpoints for the fine-tuned Flux and CogVideoX components are not mentioned, and the staged pipeline requires multiple distinct pretrained models (GraspXL, Hamer for evaluation) complicating independent reproduction. The 301-second inference time per video (Table 5) further limits large-scale replication studies without substantial compute resources.

Abstract

Hand-object interaction (HOI) reconstruction and synthesis are becoming central to embodied AI and AR/VR. Yet, despite rapid progress, existing HOI generation research remains fragmented across three disjoint tracks: (1) pose-only synthesis that predicts MANO trajectories without producing pixels; (2) single-image HOI generation that hallucinates appearance from masks or 2D cues but lacks dynamics; and (3) video generation methods that require both the entire pose sequence and the ground-truth first frame as inputs, preventing true sim-to-real deployment. Inspired by the philosophy of Joo et al. (2018), we think that HOI generation requires a unified engine that brings together pose, appearance, and motion within one coherent framework. Thus we introduce PAM: a Pose-Appearance-Motion Engine for controllable HOI video generation. The performance of our engine is validated by: (1) On DexYCB, we obtain an FVD of 29.13 (vs. 38.83 for InterDyn), and MPJPE of 19.37 mm (vs. 30.05 mm for CosHand), while generating higher-resolution 480x720 videos compared to 256x256 and 256x384 baselines. (2) On OAKINK2, our full multi-condition model improves FVD from 68.76 to 46.31. (3) An ablation over input conditions on DexYCB shows that combining depth, segmentation, and keypoints consistently yields the best results. (4) For a downstream hand pose estimation task using SimpleHand, augmenting training with 3,400 synthetic videos (207k frames) allows a model trained on only 50% of the real data plus our synthetic data to match the 100% real baseline.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.