Identity-Consistent Video Generation under Large Facial-Angle Variations

cs.CV Bin Hu, Zipeng Qi, Guoxi Huang, Zunnan Xu, Ruicheng Zhang, Chongjie Ye, Jun Zhou, Xiu Li, Jingdong Wang · Mar 22, 2026

What it does

Why it matters

This paper proposes Mv2ID, a multi-view conditioning framework that uses region-masking and a decoupled positional encoding scheme to prevent view-dependent copy-paste artifacts without requiring expensive cross-paired training data. The...

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

Single-view reference-to-video methods struggle to preserve identity when faces rotate through large angles. This paper proposes Mv2ID, a multi-view conditioning framework that uses region-masking and a decoupled positional encoding scheme to prevent view-dependent copy-paste artifacts without requiring expensive cross-paired training data. The work is relevant for digital character creation and visual effects where identity must remain consistent across extreme viewpoints.

Critical review

Verdict

Bottom line

The paper presents a technically sound solution to a genuine problem in human-centric video generation. The core insight—that multi-view references improve consistency but risk copy-paste artifacts that must be actively mitigated—is well-supported. However, the claim of "outperforming existing approaches trained with cross-paired data" is nuanced: while Mv2ID achieves superior identity consistency (MvRC-Arc 0.544 vs. HuMo 0.493), it slightly trails the best competitor on motion naturalness (NaturalScore 4.69 vs. 4.71), indicating a trade-off rather than strict dominance.

“Extensive experiments demonstrate that our method significantly improves identity consistency while maintaining motion naturalness, outperforming existing approaches trained with cross-paired data.”

paper · Abstract

“Mv2ID ... MvRC-Arc 0.544 ... NaturalScore 4.69 ... HuMo ... MvRC-Arc 0.493 ... NaturalScore 4.71”

paper · Table 1

What holds up

The multi-view conditioning strategy is convincingly validated: quantitative results show clear gains over single-view baselines (e.g., Phantom-MV vs. Phantom-SV gains 0.079 on MvRC-Arc). The region-masking approach is principled, forcing the model to aggregate complementary identity cues rather than shortcutting through view-specific features. The reference-decoupled RoPE (RD-RoPE) provides a clean mathematical solution to the heterogeneous token problem, and the trajectory visualizations (Figure 5) compellingly demonstrate smoother viewpoint transitions compared to baseline methods.

“By partially removing information from every conditioning image, it forces the model to aggregate complementary cues across multi-view to generate the target frame”

paper · Section 4.3

“In contrast, our results is more smooth and reasonable”

paper · Figure 5

Main concerns

The ablation study (Table 2) reveals a tension not fully addressed: adding Region Masking (M) to the Base+RD-RoPE model actually reduces identity consistency metrics (MvRC-Arc drops from 0.552 to 0.535), which the authors attribute to "less copy-paste" but does not fully reconcile with the claim of improved consistency. The comparison to cross-paired methods is also slightly misleading—HuMo achieves better NaturalScore despite using cross-paired data, suggesting Mv2ID's advantage is primarily on consistency metrics, not naturalness. Additionally, the paper lacks systematic analysis of failure modes or boundary conditions (e.g., extreme lighting variations or occlusions).

“B + R ... MvRC-Arc 0.552 ... B + R + M ... MvRC-Arc 0.535”

paper · Table 2

“Region masking training and RD-RoPE improve motion naturalness by 0.52 and 0.23, respectively ... Although region masking slightly reduces identity consistency(caused by the less copy-paste)”

paper · Section 5.3

Evidence and comparison

The evidence generally supports the core technical claims, though with caveats. The dedicated MvRC metric (averaging cosine similarity across 10 multi-view references) is appropriate for the task, and the NaturalScore protocol from OpenS2V-Nexus provides a standardized naturalness benchmark. The user study ($n=25$ cases, 10 evaluators each) shows statistical significance for identity consistency, though the sample size is modest. Comparisons to Phantom, MAGREF, and HuMo are fair in terms of architecture (all DiT-based), though the paper does not control for dataset scale—Mv2ID uses a specifically curated 22K video dataset with pose filtering, which may confound comparisons.

“Both one-way ANOVA and the Kruskal–Wallis test show significant differences among methods (ANOVA: F=81.70, p<10^{-70}; Kruskal–Wallis: H=283.75, p<10^{-50})”

paper · Section 5.4

“Our dataset explicitly focuses on human-centric samples with large facial-angle variations”

paper · Table 3 caption

Reproducibility

Reproducibility is limited by the absence of a code release announcement and unclear public availability of the curated 22K-video dataset. While the paper provides implementation details (Wan-2.1-T2V-14B base, 16 fps, 480×832 resolution, 60% masking ratio), critical hyperparameters such as learning rate, batch size, and training iterations are not specified in the main text. The dataset construction pipeline is described conceptually (three-stage filtering with RetinaFace face detection and pose estimation), but without access to the specific filtering thresholds or the final dataset, independent reproduction would require substantial resources to replicate.

“We build our method on Wan-2.1-T2V-14B and use three reference images by default. All training videos are resampled to 16 fps and resized to a spatial resolution of 480×832”

paper · Section 5.1

“We construct the dataset through a three-stage pipeline ... coarse filtering using face detection”

paper · Section 4.1

Abstract

Single-view reference-to-video methods often struggle to preserve identity consistency under large facial-angle variations. This limitation naturally motivates the incorporation of multi-view facial references. However, simply introducing additional reference images exacerbates the \textit{copy-paste} problem, particularly the \textbf{\textit{view-dependent copy-paste}} artifact, which reduces facial motion naturalness. Although cross-paired data can alleviate this issue, collecting such data is costly. To balance the consistency and naturalness, we propose $\mathrm{Mv}^2\mathrm{ID}$, a multi-view conditioned framework under in-paired supervision. We introduce a region-masking training strategy to prevent shortcut learning and extract essential identity features by encouraging the model to aggregate complementary identity cues across views. In addition, we design a reference decoupled-RoPE mechanism that assigns distinct positional encoding to video and conditioning tokens for better modeling of their heterogeneous properties. Furthermore, we construct a large-scale dataset with diverse facial-angle variations and propose dedicated evaluation metrics for identity consistency and motion naturalness. Extensive experiments demonstrate that our method significantly improves identity consistency while maintaining motion naturalness, outperforming existing approaches trained with cross-paired data.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.