PROBE: Diagnosing Residual Concept Capacity in Erased Text-to-Video Diffusion Models

cs.CV Yiwei Xie, Zheng Zhang, Ping Liu · Mar 23, 2026

What it does

Why it matters

PROBE introduces a diagnostic protocol that optimizes a pseudo-token embedding with frozen model weights to test whether erased concepts can be reactivated. By probing residual capacity across three architectures and three erasure...

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

Text-to-video concept erasure methods claim to remove sensitive content, but current evaluation only checks if the concept is absent from generated frames. PROBE introduces a diagnostic protocol that optimizes a pseudo-token embedding with frozen model weights to test whether erased concepts can be reactivated. By probing residual capacity across three architectures and three erasure strategies, the authors find that all tested methods leave measurable residual capacity and identify temporal re-emergence—a video-specific failure mode where concepts suppressed in early frames resurface later in the sequence.

Critical review

Verdict

Bottom line

This paper makes a valuable contribution by extending concept erasure robustness evaluation from images to video, introducing a temporal dimension previously overlooked. The demonstration that embedding-level probing can recover erased concepts across multiple T2V architectures and erasure paradigms is convincing and practically important for safety auditing. However, the strong claim that current methods achieve only output-level suppression rather than representational removal relies on interpreting recovery success as evidence of intact representations, which alternative interpretations (such as manifold proximity enabling reconstruction) could also explain.

“These findings suggest that current erasure methods achieve output-level suppression rather than representational removal”

Paper · Abstract

“because no model weights are modified, any successful recovery must originate from information already encoded in the frozen parameters”

Paper · Section IV-A

What holds up

The multi-level evaluation framework is comprehensive, combining classifier-based detection, CLIP semantic similarity, temporal reactivation curves, and human validation to robustly assess recovery. The empirical finding that erasure robustness correlates with intervention depth—weight-space unlearning (T2VUnlearning) showing stronger resistance than activation steering (SAFREE) or negative prompting (NegPrompt)—is consistent across all tested concept categories (objects, NSFW content, celebrities) and architectures (CogVideoX-2B/5B, Wan2.2-5B). The identification of temporal re-emergence represents a genuinely novel contribution specific to video generation, where frame-level metrics fail to detect progressive concept resurfacing across frames.

“temporal re-emergence, a video-specific failure mode where suppressed concepts progressively resurface across frames, invisible to frame-level metrics”

Paper · Section I

“erasure robustness correlates with intervention depth: input-level methods are most vulnerable, while weight-space unlearning provides stronger but still incomplete removal”

Paper · Section V-B

Main concerns

The interpretation of PROBE results as evidence of representational non-removal conflates recoverability via optimization with the existence of intact concept representations. While the frozen-parameter constraint ensures no new capacity is introduced during probing, successful optimization could exploit geometric properties of the latent space (e.g., proximity to non-erased concepts sharing visual attributes) rather than revealing dormant concept encodings. Human validation in Section VII-C shows discrepancies with automatic metrics for object and identity concepts, suggesting that some 'recovery' may not correspond to perceptually recognizable concepts. The theoretical analysis of temporal re-emergence hypothesizes propagation through temporal attention layers but provides no mechanistic evidence (e.g., attention head analysis) to support this causal claim.

“human-perceived scores for objects occasionally decrease after PROBE, and identity scores decrease across both model scales. These patterns are consistent with known discrepancies between automatic metrics and human judgment”

Paper · Section VII-C

“concept traces that survive erasure in the spatial representation can propagate through temporal attention layers, manifesting as delayed or progressive reactivation patterns”

Paper · Section IV-E

Evidence and comparison

The comparison across three intervention depths (input, activation, weight-space) provides strong evidence for the hierarchy of erasure robustness, aligning with theoretical expectations that deeper interventions should yield more persistent removal. The paper fairly positions its contribution relative to T2I predecessors like Ring-A-Bell and adversarial prompt search methods, noting that PROBE adapts textual inversion for diagnostic measurement rather than personalization and adds latent alignment to prevent co-occurring attribute collapse. The ablation study (Table XV) convincingly demonstrates that the latent alignment term ($\mathcal{L}_{\text{align}}$) consistently improves recovery rates over reconstruction-only probing, validating the design choice to anchor recovery to spatiotemporal structure. However, the comparison with P4D-K (Table X) showing PROBE achieves higher reactivation rates (28.90% vs 20.45%) on nudity concepts under T2VUnlearning is limited to a single configuration, and the claim that discrete prompt search is less suited to T2V models would benefit from broader architectural coverage.

“reconstruction-only probing yields limited recovery across all configurations, while adding latent alignment consistently improves reactivation rates”

Paper · Table XV

“PROBE achieves 28.90%, providing a substantially stronger diagnostic signal [than P4D-K's 20.45%]”

Paper · Table X

Reproducibility

The authors release code and provide detailed implementation details including hyperparameters (learning rate 0.02, $\lambda=1$, 5 pseudo-tokens, AdamW optimizer) and hardware requirements (NVIDIA H100). However, reproducibility is hindered by substantial computational requirements: training converges in 1k–3k steps requiring approximately 3–10 hours per concept on H100 GPUs depending on frame count. The sensitivity analysis shows results depend on alignment weight $\lambda$ (Table XV), reference data size (Table XVI shows degradation at 200 samples), and pseudo-token count (10 tokens worse than 5), indicating that exact reproduction requires careful tuning. The paper uses generated reference videos rather than external datasets, which standardizes the reference distribution but means reproduction requires first generating these from unerased models, adding computational overhead.

“Training typically converges within 1k–3k steps... For CogVideoX-2B... training times of approximately 10, 3, and 3 hours”

Paper · Section V-A

“recovery improves from 10 to 100 reference samples (25.60% → 28.90%) but drops at 200 samples (23.47%), likely due to over-smoothing”

Paper · Section VII-B2

Abstract

Concept erasure techniques for text-to-video (T2V) diffusion models report substantial suppression of sensitive content, yet current evaluation is limited to checking whether the target concept is absent from generated frames, treating output-level suppression as evidence of representational removal. We introduce PROBE, a diagnostic protocol that quantifies the \textit{reactivation potential} of erased concepts in T2V models. With all model parameters frozen, PROBE optimizes a lightweight pseudo-token embedding through a denoising reconstruction objective combined with a novel latent alignment constraint that anchors recovery to the spatiotemporal structure of the original concept. We make three contributions: (1) a multi-level evaluation framework spanning classifier-based detection, semantic similarity, temporal reactivation analysis, and human validation; (2) systematic experiments across three T2V architectures, three concept categories, and three erasure strategies revealing that all tested methods leave measurable residual capacity whose robustness correlates with intervention depth; and (3) the identification of temporal re-emergence, a video-specific failure mode where suppressed concepts progressively resurface across frames, invisible to frame-level metrics. These findings suggest that current erasure methods achieve output-level suppression rather than representational removal. We release our protocol to support reproducible safety auditing. Our code is available at https://github.com/YiweiXie/PRObingBasedEvaluation.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.