VIGIL: Part-Grounded Structured Reasoning for Generalizable Deepfake Detection

cs.CV Xinghan Li, Junhao Xu, Jingjing Chen · Mar 23, 2026
Local to this browser
What it does
VIGIL tackles hallucination in multimodal deepfake detection by decoupling claim generation from evidence sourcing through a part-centric plan-then-examine pipeline. The framework first plans which facial parts to inspect using global...
Why it matters
The framework first plans which facial parts to inspect using global visual cues, then examines each part with independently sourced forensic evidence delivered via a stage-gated injection mechanism. Combined with a progressive three-stage...
Main concern
VIGIL presents a compelling architectural solution to the evidence-hallucination problem in deepfake detection, achieving strong empirical results on the proposed OmniFake benchmark. The part-centric decomposition and stage-gated signal...
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

VIGIL tackles hallucination in multimodal deepfake detection by decoupling claim generation from evidence sourcing through a part-centric plan-then-examine pipeline. The framework first plans which facial parts to inspect using global visual cues, then examines each part with independently sourced forensic evidence delivered via a stage-gated injection mechanism. Combined with a progressive three-stage training paradigm featuring part-aware reinforcement learning rewards, the method aims to produce verifiable, anatomically grounded explanations rather than confabulated reasoning chains.

Critical review
Verdict
Bottom line

VIGIL presents a compelling architectural solution to the evidence-hallucination problem in deepfake detection, achieving strong empirical results on the proposed OmniFake benchmark. The part-centric decomposition and stage-gated signal injection represent genuine innovations that address the "claimant and evidence provider" conflict inherent in end-to-end MLLM reasoning. However, the evaluation relies on hypothetical future models (GPT-5.2, Gemini-3-Pro, Qwen3-VL-8B) that do not currently exist, severely compromising reproducibility assessments and casting doubt on the reported comparisons. The OmniFake benchmark itself offers a valuable hierarchical evaluation protocol, though the paper lacks detailed inter-annotator agreement statistics for its automated annotation pipeline.

“The reasoning process of current MLLM-based methods combines evidence generation and manipulation localization into a unified step. This combination blurs the boundary between faithful observations and hallucinated explanations, leading to unreliable conclusions.”
paper · Section 1
“VIGIL achieves 93.1% overall accuracy, with an average improvement of 4.3% over the previous best expert detector DDA and 5.5% over the concurrent MLLM-based method Veritas.”
paper · Section 5.2
What holds up

The architectural design of decoupling planning from examination via stage-gated injection is theoretically sound and empirically validated through ablations showing +3.9% on Level 5 when stage-gating is included. The progressive training paradigm demonstrates clear incremental benefits, with Stage 2 (hard-sample self-training) providing essential coverage for challenging cases before RL refinement. The ablation study is comprehensive, isolating the contribution of forensic signals (3.9% improvement) from part-centric reasoning (2.8% improvement) and showing their complementary interaction (8.4% combined gain).

“Removing both part-centric reasoning and forensic signals reduces the averaged accuracy by 8.4%. Forensic signals alone contribute 3.9% and part-centric reasoning alone contributes 2.8%. Their combined gain of 8.4% exceeds the sum (6.7%), confirming that the two components are complementary.”
paper · Section 5.3
“Stage-gated injection contributes 2.2% overall, but its effect concentrates on harder levels (+3.9% on L5, +3.4% on L4, vs. +0.6% on L1).”
paper · Section 5.3
Main concerns

The experimental validation relies on models and tools that appear to be hypothetical or future releases, including "GPT-5.2," "Gemini-3-Pro," "Qwen3-VL-8B," and "DeepSeek-V3.2" as the judge LLM. As noted in Section 5.1, these are used as both baseline competitors and evaluation judges, making the reported performance gaps impossible to verify with currently available technology. The automated annotation pipeline (Section 4.2) claims to use "multiple off-the-shelf MLLMs" with consensus filtering, yet provides no inter-annotator agreement metrics or error rates for the automated labeling process, raising questions about ground-truth quality for the 200K image dataset. Additionally, the fixed partition into exactly 8 anatomical parts assumes all forgeries respect these semantic boundaries, though cross-region artifacts (addressed only via a global evidence summary) may not be adequately captured by this rigid partitioning.

“Multiple off-the-shelf MLLMs independently produce visual descriptions, with consensus filtering to remove hallucinated observations.”
paper · Section 4.2
“GPT-5.2 serves as the judge LLM $\mathcal{J}$.”
paper · Section 5.1
Evidence and comparison

The evidence largely supports the core architectural claims, with Table 3 showing that removing both part-centric reasoning and forensic signals causes an 8.4% accuracy drop, and Table 4 demonstrating that stage-gated injection specifically benefits harder levels (+3.9% on L5). However, comparisons to concurrent MLLM-based methods like Veritas (ICLR'26) and FakeVLM (NeurIPS'25) are suspect given these appear to be citations to future or non-existent publications (ICLR 2026 and NeurIPS 2025 have not occurred), making fair comparison impossible to verify. The OmniFake benchmark construction is well-documented in terms of data sources, though without access to the actual dataset or the claimed project page (https://vigil.best), the hierarchical evaluation protocol cannot be independently validated.

“Applying GRPO directly after SFT (S1+S3) achieves 92.0%, while adding hard-sample self-training before GRPO (S1+S2+S3) yields the best performance at 93.3%.”
paper · Section 5.3
Reproducibility

The paper provides detailed hyperparameters for all three training stages (e.g., Stage 1: lr $5\times10^{-5}$, batch size 1; Stage 3: lr $1\times10^{-6}$, GRPO with $G=8$ rollouts), yet lacks explicit code availability statements or links to repositories in the provided text. The reliance on hypothetical models (Qwen3-VL-8B, DeepSeek-V3.2) as both the backbone and judge creates a significant barrier to reproduction until these models are publicly released. While the OmniFake dataset composition is described in detail (Section 3.1), the text notes only that "Detailed protocols are provided in the supplementary material," which is not included in the provided excerpt, leaving critical data preprocessing steps undocumented for replication.

“Stage 1 trains for 3 epochs (lr $5\times10^{-5}$, batch size 1); Stage 2 follows the same hyperparameters for 1 epoch; Stage 3 uses lr $1\times10^{-6}$, batch size 8, $G=8$ rollouts, temperature 1.0.”
paper · Section 5.1
“Detailed protocols are provided in the supplementary material.”
paper · Section 3.1
Abstract

Multimodal large language models (MLLMs) offer a promising path toward interpretable deepfake detection by generating textual explanations. However, the reasoning process of current MLLM-based methods combines evidence generation and manipulation localization into a unified step. This combination blurs the boundary between faithful observations and hallucinated explanations, leading to unreliable conclusions. Building on this, we present VIGIL, a part-centric structured forensic framework inspired by expert forensic practice through a plan-then-examine pipeline: the model first plans which facial parts warrant inspection based on global visual cues, then examines each part with independently sourced forensic evidence. A stage-gated injection mechanism delivers part-level forensic evidence only during examination, ensuring that part selection remains driven by the model's own perception rather than biased by external signals. We further propose a progressive three-stage training paradigm whose reinforcement learning stage employs part-aware rewards to enforce anatomical validity and evidence--conclusion coherence. To enable rigorous generalizability evaluation, we construct OmniFake, a hierarchical 5-Level benchmark where the model, trained on only three foundational generators, is progressively tested up to in-the-wild social-media data. Extensive experiments on OmniFake and cross-dataset evaluations demonstrate that VIGIL consistently outperforms both expert detectors and concurrent MLLM-based methods across all generalizability levels.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.