PROBE: Diagnosing Residual Concept Capacity in Erased Text-to-Video Diffusion Models
Text-to-video concept erasure methods claim to remove sensitive content, but current evaluation only checks if the concept is absent from generated frames. PROBE introduces a diagnostic protocol that optimizes a pseudo-token embedding with frozen model weights to test whether erased concepts can be reactivated. By probing residual capacity across three architectures and three erasure strategies, the authors find that all tested methods leave measurable residual capacity and identify temporal re-emergence—a video-specific failure mode where concepts suppressed in early frames resurface later in the sequence.
This paper makes a valuable contribution by extending concept erasure robustness evaluation from images to video, introducing a temporal dimension previously overlooked. The demonstration that embedding-level probing can recover erased concepts across multiple T2V architectures and erasure paradigms is convincing and practically important for safety auditing. However, the strong claim that current methods achieve only output-level suppression rather than representational removal relies on interpreting recovery success as evidence of intact representations, which alternative interpretations (such as manifold proximity enabling reconstruction) could also explain.
The multi-level evaluation framework is comprehensive, combining classifier-based detection, CLIP semantic similarity, temporal reactivation curves, and human validation to robustly assess recovery. The empirical finding that erasure robustness correlates with intervention depth—weight-space unlearning (T2VUnlearning) showing stronger resistance than activation steering (SAFREE) or negative prompting (NegPrompt)—is consistent across all tested concept categories (objects, NSFW content, celebrities) and architectures (CogVideoX-2B/5B, Wan2.2-5B). The identification of temporal re-emergence represents a genuinely novel contribution specific to video generation, where frame-level metrics fail to detect progressive concept resurfacing across frames.
The interpretation of PROBE results as evidence of representational non-removal conflates recoverability via optimization with the existence of intact concept representations. While the frozen-parameter constraint ensures no new capacity is introduced during probing, successful optimization could exploit geometric properties of the latent space (e.g., proximity to non-erased concepts sharing visual attributes) rather than revealing dormant concept encodings. Human validation in Section VII-C shows discrepancies with automatic metrics for object and identity concepts, suggesting that some 'recovery' may not correspond to perceptually recognizable concepts. The theoretical analysis of temporal re-emergence hypothesizes propagation through temporal attention layers but provides no mechanistic evidence (e.g., attention head analysis) to support this causal claim.
The comparison across three intervention depths (input, activation, weight-space) provides strong evidence for the hierarchy of erasure robustness, aligning with theoretical expectations that deeper interventions should yield more persistent removal. The paper fairly positions its contribution relative to T2I predecessors like Ring-A-Bell and adversarial prompt search methods, noting that PROBE adapts textual inversion for diagnostic measurement rather than personalization and adds latent alignment to prevent co-occurring attribute collapse. The ablation study (Table XV) convincingly demonstrates that the latent alignment term ($\mathcal{L}_{\text{align}}$) consistently improves recovery rates over reconstruction-only probing, validating the design choice to anchor recovery to spatiotemporal structure. However, the comparison with P4D-K (Table X) showing PROBE achieves higher reactivation rates (28.90% vs 20.45%) on nudity concepts under T2VUnlearning is limited to a single configuration, and the claim that discrete prompt search is less suited to T2V models would benefit from broader architectural coverage.
The authors release code and provide detailed implementation details including hyperparameters (learning rate 0.02, $\lambda=1$, 5 pseudo-tokens, AdamW optimizer) and hardware requirements (NVIDIA H100). However, reproducibility is hindered by substantial computational requirements: training converges in 1k–3k steps requiring approximately 3–10 hours per concept on H100 GPUs depending on frame count. The sensitivity analysis shows results depend on alignment weight $\lambda$ (Table XV), reference data size (Table XVI shows degradation at 200 samples), and pseudo-token count (10 tokens worse than 5), indicating that exact reproduction requires careful tuning. The paper uses generated reference videos rather than external datasets, which standardizes the reference distribution but means reproduction requires first generating these from unerased models, adding computational overhead.
Concept erasure techniques for text-to-video (T2V) diffusion models report substantial suppression of sensitive content, yet current evaluation is limited to checking whether the target concept is absent from generated frames, treating output-level suppression as evidence of representational removal. We introduce PROBE, a diagnostic protocol that quantifies the \textit{reactivation potential} of erased concepts in T2V models. With all model parameters frozen, PROBE optimizes a lightweight pseudo-token embedding through a denoising reconstruction objective combined with a novel latent alignment constraint that anchors recovery to the spatiotemporal structure of the original concept. We make three contributions: (1) a multi-level evaluation framework spanning classifier-based detection, semantic similarity, temporal reactivation analysis, and human validation; (2) systematic experiments across three T2V architectures, three concept categories, and three erasure strategies revealing that all tested methods leave measurable residual capacity whose robustness correlates with intervention depth; and (3) the identification of temporal re-emergence, a video-specific failure mode where suppressed concepts progressively resurface across frames, invisible to frame-level metrics. These findings suggest that current erasure methods achieve output-level suppression rather than representational removal. We release our protocol to support reproducible safety auditing. Our code is available at https://github.com/YiweiXie/PRObingBasedEvaluation.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.