EmoTaG: Emotion-Aware Talking Head Synthesis on Gaussian Splatting with Few-Shot Personalization

cs.CV Haolan Xu, Keli Cheng, Lei Wang, Ning Bi, Xiaoming Liu · Mar 22, 2026

What it does

Why it matters

The core insight is to predict FLAME parameters (expression and jaw pose) rather than directly deforming 3D Gaussians, providing explicit geometric priors for stability. A Gated Residual Motion Network (GRMN) disentangles phonetic...

Main concern

EmoTaG presents a compelling solution for emotion-aware few-shot personalization, achieving strong quantitative gains on emotional metrics (LMD, AUE) and visual quality (PSNR 30. 02 vs 28.

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

EmoTaG tackles few-shot 3D talking-head synthesis with emotional expressiveness using only 5 seconds of target video. The core insight is to predict FLAME parameters (expression and jaw pose) rather than directly deforming 3D Gaussians, providing explicit geometric priors for stability. A Gated Residual Motion Network (GRMN) disentangles phonetic articulation from emotion-driven variations with a learned gate $g \in [0,1]$, while Semantic Emotion Guidance distills knowledge from a pretrained DeepFace recognizer to supervise emotional intensity without manual labels.

Critical review

Verdict

Bottom line

EmoTaG presents a compelling solution for emotion-aware few-shot personalization, achieving strong quantitative gains on emotional metrics (LMD, AUE) and visual quality (PSNR 30.02 vs 28.92 for InsTaG) via its FLAME-based structural prior and residual motion decomposition. However, the method relies on auxiliary pose and expression frames at inference—avoiding the fully audio-driven setting—and evaluates emotional generalization on a limited set (2 identities from MEAD). The Sync-C score (6.212) also trails large-scale pretrained baselines like Real3DPortrait (6.719), indicating trade-offs between personalization depth and audio-lip synchronization.

“audio carries very limited information about upper-face expression or head pose, we additionally provide a set of pose & expression frames as auxiliary inputs during inference”

paper · Section 3.3

“Sync-C ... Real3DPortrait ... 6.719 ... EmoTaG (Ours) ... 6.212”

paper · Table 1

“emotional set derived from MEAD ... 2 identities per emotion”

paper · Section 4

What holds up

The FLAME-Gaussian formulation provides robust geometric stability by binding Gaussians to mesh triangles via barycentric interpolation and updating them through the rigged mapping $\mathcal{G}_i$. The GRMN's three-branch design (base, residual, gate) effectively decouples neutral articulation from emotional deviation via $\boldsymbol{\delta} = \boldsymbol{\delta}_{\text{b}} + g \cdot \boldsymbol{\delta}_{\text{r}}$, supported by ablations showing catastrophic drops when removing the gate (-0.34 Sync-C) or residual branch (-1.25 Sync-C). Semantic Emotion Guidance successfully injects emotion awareness without expensive manual annotations, using $e = 1 - p_{\text{emo}}(\text{neutral})$ and KL divergence on the latent $\mathbf{z}_e$.

Main concerns

The evaluation scope for emotions is narrow: only 2 identities across 5 emotion categories from MEAD, raising questions about identity-emotion entanglement and generalization to diverse demographics. The method is not fully audio-driven; it requires external upper-face AU inputs from OpenFace and pose cues at inference, complicating deployment. The reliance on AdaIN-based identity modulation is brittle—the ablation shows the largest performance drop (-2.53 Sync-C, -1.6 PSNR) when removed, suggesting the model depends heavily on this single component for personalization. Additionally, using DeepFace as a teacher model introduces potential biases from its training data, which the paper does not address.

“w/o Identity Modulation (AdaIN) ... PSNR 28.38 ... Sync-C 4.621 ... EmoTaG (Full Model) ... PSNR 29.95 ... Sync-C 6.147”

paper · Table 5

“we additionally provide a set of pose & expression frames as auxiliary inputs during inference”

paper · Section 3.3

Evidence and comparison

Comparisons against InsTaG, MimicTalk, and Real3DPortrait are fair and cover appropriate baselines (few-shot vs. one-shot vs. per-subject training). The metric suite is comprehensive, covering rendering ($\mathcal{L}_1$, LPIPS), geometry (LMD), emotion (AUE-L/U), and sync (Sync-C/E). Claims of state-of-the-art emotional expressiveness are supported by user study scores (4.50 vs 3.80 for InsTaG). However, the paper glosses over the fact that EmoTaG underperforms large pretrained models on Sync-C and requires auxiliary conditioning, which baselines like Real3DPortrait do not. The qualitative results in Figure 5 effectively highlight geometric artifacts in competitors, though the reliance on red-box highlights is standard.

“EmoTaG (Ours) ... Emo Expr ... 4.50 ... InsTaG ... 3.80”

paper · Table 4

Reproducibility

The paper provides detailed implementation specifics: 250K pretraining iterations, 20K adaptation iterations, AdamW with learning rates $5 \times 10^{-3}$ and $5 \times 10^{-4}$, and loss weights ($\lambda_{\text{D-SSIM}} = 0.2$). It uses public datasets (HDTF for training, MEAD for emotional evaluation) and standard tools (VHAP for FLAME tracking, Wav2Vec 2.0 for audio, OpenFace for AU extraction, DeepFace for emotion guidance). However, no code or pretrained models are mentioned, and the pipeline depends on multiple external preprocessing tools (VHAP, OpenFace, DeepFace) with specific version requirements that are not specified. The geometric loss $\mathcal{L}_{\text{Geo}}$ relying on Sapiens-derived pseudo-ground-truth depth/normals adds another external dependency that may limit exact reproduction.

“Pretraining and adaptation are performed for 250K and 20K iterations, respectively, using AdamW with learning rates of $5 \times 10^{-3}$ and $5 \times 10^{-4}$”

paper · Section 4 (Implementation Details)

“$\mathcal{L}_{\text{Geo}} = \mathcal{L}_D(D, D_{GT}) + \mathcal{L}_N(N, N_{GT})$ ... pseudo-ground-truths from Sapiens”

paper · Section 3.4

Abstract

Audio-driven 3D talking head synthesis has advanced rapidly with Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS). By leveraging rich pre-trained priors, few-shot methods enable instant personalization from just a few seconds of video. However, under expressive facial motion, existing few-shot approaches often suffer from geometric instability and audio-emotion mismatch, highlighting the need for more effective emotion-aware motion modeling. In this work, we present EmoTaG, a few-shot emotion-aware 3D talking head synthesis framework built on the Pretrain-and-Adapt paradigm. Our key insight is to reformulate motion prediction in a structured FLAME parameter space rather than directly deforming 3D Gaussians, thereby introducing explicit geometric priors that improve motion stability. Building upon this, we propose a Gated Residual Motion Network (GRMN), which captures emotional prosody from audio while supplementing head pose and upper-face cues absent from audio, enabling expressive and coherent motion generation. Extensive experiments demonstrate that EmoTaG achieves state-of-the-art performance in emotional expressiveness, lip synchronization, visual realism, and motion stability.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.