UniMotion: A Unified Framework for Motion-Text-Vision Understanding and Generation
UniMotion addresses the fragmentation in human motion modeling by unifying motion, text, and RGB understanding/generation within a single 1.5B parameter architecture. Unlike prior work relying on discrete tokenization or handling only partial modality subsets, it treats motion as a continuous first-class modality via a Cross-Modal Aligned Motion VAE (CMA-VAE). The framework introduces Dual-Posterior KL Alignment to distill visual semantics into motion representations without requiring images at inference, and Latent Reconstruction Alignment to bootstrap the motion pathway through dense self-supervision before sparse text calibration.
The paper presents a compelling architectural advance in multimodal motion understanding, successfully demonstrating that continuous latent representations with cross-modal alignment outperform discrete tokenization across seven diverse tasks. The symmetric dual-path design elegantly reconciles autoregressive text generation with flow-based motion synthesis through hybrid attention and modality-routed LoRA. However, the evaluation relies heavily on constrained indoor datasets (Human3.6M) for vision tasks, and the staged training pipeline with partial LLM unfreezing only in the final stage suggests the "unified" capability is achieved through progressive specialization rather than true simultaneous joint training from scratch.
The continuous motion representation via CMA-VAE is rigorously validated through comprehensive ablations demonstrating superior reconstruction fidelity (APE 3.53cm vs 17.15cm for VQ-VAE) and downstream task performance. The Dual-Posterior KL Alignment (DPA) effectively injects visual-semantic priors without inference overhead, with Table 9 showing consistent gains across all tasks when DPA is included (T2M R@3 improves from 0.818 to 0.841). Additionally, the Latent Reconstruction Alignment (LRA) successfully addresses the cold-start problem,establishing that motion latents $z \in \mathbb{R}^{T_z \times d}$ serve as effective dense supervision signals for pre-training the motion pathway.
The evaluation of vision capabilities is predominantly limited to controlled indoor environments (Human3.6M), leaving significant uncertainty about robustness to severe occlusion, camera motion, and diverse in-the-wild scenarios despite a single zero-shot test on 3DPW. The claim of being the "first unified framework" for motion-text-vision overlooks potential concurrent work and relies on a staged training pipeline (Table 15) where the LLM backbone remains frozen or partially frozen through most stages, suggesting the architecture achieves unification through sequential adaptation rather than inherent simultaneous tri-modal reasoning. Furthermore, the Motion-guided Image Editing evaluation uses a Procrustes-aligned error threshold that may favor pose shape matching over absolute spatial accuracy.
The evidence robustly supports the core claim that continuous representations outperform discrete tokenization, with Table 6 providing systematic comparisons across reconstruction quality metrics including APE, AVE, and FID, alongside downstream transfer performance. However, comparisons to vision-language baselines like Show-o2 on Vision-to-Text tasks (Table 10) are confounded by UniMotion's incorporation of pose-aware vision backbones and motion-specific inductive biases absent in general MLLMs. The authors appropriately temper their claims by acknowledging that specialist models still outperform UniMotion on Vision-to-Motion tasks (MPJPE 75.0 vs 50.8 for SMPLer), correctly framing the contribution as unified capability coverage rather than absolute domain superiority.
The paper provides substantial architectural specifications in the supplementary material, including exact hyperparameters for the 269-dimensional motion representation, multi-stage training configurations (Table 15), and flow matching details (Euler ODE with 50 steps, CFG scale $s=3.0$). However, critical implementation details such as the complete instruction template sets for each task, the specific data preprocessing pipelines for the motion representation, and code for the CMA-VAE dual-posterior training are referenced but not fully specified. The requirement of $4\times$A6000 GPUs for training presents a significant barrier to independent reproduction, and while the authors reference a project URL, the current submission lacks an accessible public repository or dataset preprocessing scripts necessary for full reproducibility.
We present UniMotion, to our knowledge the first unified framework for simultaneous understanding and generation of human motion, natural language, and RGB images within a single architecture. Existing unified models handle only restricted modality subsets (e.g., Motion-Text or static Pose-Image) and predominantly rely on discrete tokenization, which introduces quantization errors and disrupts temporal continuity. UniMotion overcomes both limitations through a core principle: treating motion as a first-class continuous modality on equal footing with RGB. A novel Cross-Modal Aligned Motion VAE (CMA-VAE) and symmetric dual-path embedders construct parallel continuous pathways for Motion and RGB within a shared LLM backbone. To inject visual-semantic priors into motion representations without requiring images at inference, we propose Dual-Posterior KL Alignment (DPA), which distills a vision-fused encoder's richer posterior into the motion-only encoder. To address the cold-start problem -- where text supervision alone is too sparse to calibrate the newly introduced motion pathway -- we further propose Latent Reconstruction Alignment (LRA), a self-supervised pre-training strategy that uses dense motion latents as unambiguous conditions to co-calibrate the embedder, backbone, and flow head, establishing a stable motion-aware foundation for all downstream tasks. UniMotion achieves state-of-the-art performance across seven tasks spanning any-to-any understanding, generation, and editing among the three modalities, with especially strong advantages on cross-modal compositional tasks.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.