Identity-Consistent Video Generation under Large Facial-Angle Variations
Single-view reference-to-video methods struggle to preserve identity when faces rotate through large angles. This paper proposes Mv2ID, a multi-view conditioning framework that uses region-masking and a decoupled positional encoding scheme to prevent view-dependent copy-paste artifacts without requiring expensive cross-paired training data. The work is relevant for digital character creation and visual effects where identity must remain consistent across extreme viewpoints.
The paper presents a technically sound solution to a genuine problem in human-centric video generation. The core insight—that multi-view references improve consistency but risk copy-paste artifacts that must be actively mitigated—is well-supported. However, the claim of "outperforming existing approaches trained with cross-paired data" is nuanced: while Mv2ID achieves superior identity consistency (MvRC-Arc 0.544 vs. HuMo 0.493), it slightly trails the best competitor on motion naturalness (NaturalScore 4.69 vs. 4.71), indicating a trade-off rather than strict dominance.
The multi-view conditioning strategy is convincingly validated: quantitative results show clear gains over single-view baselines (e.g., Phantom-MV vs. Phantom-SV gains 0.079 on MvRC-Arc). The region-masking approach is principled, forcing the model to aggregate complementary identity cues rather than shortcutting through view-specific features. The reference-decoupled RoPE (RD-RoPE) provides a clean mathematical solution to the heterogeneous token problem, and the trajectory visualizations (Figure 5) compellingly demonstrate smoother viewpoint transitions compared to baseline methods.
The ablation study (Table 2) reveals a tension not fully addressed: adding Region Masking (M) to the Base+RD-RoPE model actually reduces identity consistency metrics (MvRC-Arc drops from 0.552 to 0.535), which the authors attribute to "less copy-paste" but does not fully reconcile with the claim of improved consistency. The comparison to cross-paired methods is also slightly misleading—HuMo achieves better NaturalScore despite using cross-paired data, suggesting Mv2ID's advantage is primarily on consistency metrics, not naturalness. Additionally, the paper lacks systematic analysis of failure modes or boundary conditions (e.g., extreme lighting variations or occlusions).
The evidence generally supports the core technical claims, though with caveats. The dedicated MvRC metric (averaging cosine similarity across 10 multi-view references) is appropriate for the task, and the NaturalScore protocol from OpenS2V-Nexus provides a standardized naturalness benchmark. The user study ($n=25$ cases, 10 evaluators each) shows statistical significance for identity consistency, though the sample size is modest. Comparisons to Phantom, MAGREF, and HuMo are fair in terms of architecture (all DiT-based), though the paper does not control for dataset scale—Mv2ID uses a specifically curated 22K video dataset with pose filtering, which may confound comparisons.
Reproducibility is limited by the absence of a code release announcement and unclear public availability of the curated 22K-video dataset. While the paper provides implementation details (Wan-2.1-T2V-14B base, 16 fps, 480×832 resolution, 60% masking ratio), critical hyperparameters such as learning rate, batch size, and training iterations are not specified in the main text. The dataset construction pipeline is described conceptually (three-stage filtering with RetinaFace face detection and pose estimation), but without access to the specific filtering thresholds or the final dataset, independent reproduction would require substantial resources to replicate.
Single-view reference-to-video methods often struggle to preserve identity consistency under large facial-angle variations. This limitation naturally motivates the incorporation of multi-view facial references. However, simply introducing additional reference images exacerbates the \textit{copy-paste} problem, particularly the \textbf{\textit{view-dependent copy-paste}} artifact, which reduces facial motion naturalness. Although cross-paired data can alleviate this issue, collecting such data is costly. To balance the consistency and naturalness, we propose $\mathrm{Mv}^2\mathrm{ID}$, a multi-view conditioned framework under in-paired supervision. We introduce a region-masking training strategy to prevent shortcut learning and extract essential identity features by encouraging the model to aggregate complementary identity cues across views. In addition, we design a reference decoupled-RoPE mechanism that assigns distinct positional encoding to video and conditioning tokens for better modeling of their heterogeneous properties. Furthermore, we construct a large-scale dataset with diverse facial-angle variations and propose dedicated evaluation metrics for identity consistency and motion naturalness. Extensive experiments demonstrate that our method significantly improves identity consistency while maintaining motion naturalness, outperforming existing approaches trained with cross-paired data.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.