CLEAR: Context-Aware Learning with End-to-End Mask-Free Inference for Adaptive Video Subtitle Removal

cs.CV Qingdong He, Chaoyi Wang, Peng Tang, Yifan Yang, Xiaobin Hu · Mar 23, 2026
Local to this browser
What it does
Video subtitle removal traditionally requires expensive per-frame mask annotations and external detection modules during both training and inference. CLEAR introduces a two-stage mask-free framework that decouples prior extraction (via...
Why it matters
77% of base model parameters while achieving +6. 77dB PSNR gains and zero-shot generalization across six languages without ground-truth masks at inference.
Main concern
CLEAR presents a technically coherent solution to mask-free video subtitle removal through its two-stage design (prior extraction + adaptive diffusion refinement). The core innovation—using an occlusion head that learns adaptive weighting...
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

Video subtitle removal traditionally requires expensive per-frame mask annotations and external detection modules during both training and inference. CLEAR introduces a two-stage mask-free framework that decouples prior extraction (via self-supervised disentangled feature learning) from generative refinement (via LoRA-adapted diffusion with adaptive weighting). The method claims to train only 0.77% of base model parameters while achieving +6.77dB PSNR gains and zero-shot generalization across six languages without ground-truth masks at inference.

Critical review
Verdict
Bottom line

CLEAR presents a technically coherent solution to mask-free video subtitle removal through its two-stage design (prior extraction + adaptive diffusion refinement). The core innovation—using an occlusion head that learns adaptive weighting through generation feedback rather than explicit mask prediction—enables genuine end-to-end inference. However, the strong quantitative results rely on a large private dataset of 160,000 video pairs, and the zero-shot cross-lingual claims rest primarily on qualitative visualization rather than systematic quantitative benchmarking across languages.

“our method only requires 0.77% of the parameters of the base diffusion model for training”
Paper · Abstract
“we collect over 160,000 pairs of data for model training, including subtitled videos with Chinese fonts of various styles and sizes”
Paper · Section 4.1
What holds up

The two-stage training strategy is well-motivated: Stage I uses pixel-difference pseudo-labels with orthogonality constraints ($\mathcal{L}_{\text{ortho}}$) and adversarial purification to separate subtitle from content features without manual masks, while Stage II internalizes adaptive weighting into LoRA-augmented attention via the occlusion head $\mathcal{H}$. The parameter efficiency claim (0.77% trainable parameters) is specific and verifiable given the base Wan2.1-Fun 1.3B model and rank-64 LoRA configuration. Table 2's ablation demonstrates that each component (prior learning, context distillation, generation feedback) contributes measurably to the final performance, with the full system achieving 26.80 dB PSNR versus 21.62 dB for LoRA-only baseline.

“We employ dual encoders to separate subtitle-specific and content-specific features ... To enforce disentanglement, we minimize feature correlation through orthogonality constraint”
Paper · Section 3.2.2
“Baseline (LoRA-only) PSNR 21.62 ... + M4: Context Consistency (CLEAR) PSNR 26.80”
Paper · Table 2
Main concerns

The primary limitation is data availability and reproducibility: the 160,000 video pairs are custom-collected and not publicly available, making independent verification impossible. The pixel-difference pseudo-labeling (Equation 2: $\hat{\mathbf{M}}_{t}(i,j)=1$ if $\Delta_t > \mu_t + \sigma_t$) assumes subtitles are visually distinct from background, which fails for semi-transparent text, motion-blurred regions, or subtitles with colors matching the scene. While the paper emphasizes 'mask-free inference,' the method still requires paired training data (subtitled + clean) which is often harder to obtain than binary masks for existing inpainting methods. Additionally, the impressive zero-shot generalization to six languages (Figure 1) lacks quantitative metrics—Table 1 reports results only on Chinese subtitles, leaving open questions about performance degradation on non-Latin scripts or complex typography.

“These labels are intentionally noisy due to lighting changes, semi-transparent subtitles, and motion blur at boundaries”
Paper · Section 3.2.1
“CLEAR, trained on Chinese video subtitle data ... demonstrates excellent zero-shot generalization capabilities to other languages”
Paper · Section 4.2
Evidence and comparison

The comparison to baselines (ProPainter, MiniMax-Remover, DiffuEraser) appears fair: all methods use identical binary masks generated via the same thresholding procedure, ensuring the performance gains stem from CLEAR's adaptive generation mechanism rather than better input masks. The evaluation protocol is comprehensive, spanning reconstruction (PSNR, SSIM), perceptual (LPIPS, DISTS, VFID), and temporal consistency metrics (TWE, TC). However, the baseline comparison conflates architectural differences with input modality differences—CLEAR receives the full subtitled video while baselines receive explicit masks, so the comparison actually demonstrates that internal adaptive weighting outperforms external binary masking, not necessarily that CLEAR outperforms these architectures when all use the same inputs. The dramatic VFID reduction (-74.7% vs. best baseline) suggests strong perceptual quality, but without human evaluation or user studies, the subjective 'clean removal' claims remain unvalidated.

“For comparison, we provide identical binary masks generated via the same thresholding procedure used in CLEAR's Stage I training”
Paper · Section 4.1
“CLEAR ... VFID 20.37 vs MiniMax-Remover 95.39”
Paper · Table 1
Reproducibility

The authors provide code availability via GitHub, and Appendix B contains detailed hyperparameters: LoRA rank 64 applied to all attention and FFN layers, AdamW optimizer with learning rates $2\times 10^{-5}$ (Stage I) and $1\times 10^{-4}$ (Stage II), and specific resolution-adaptive training (1280×720). However, reproduction is severely hindered by the private dataset requirement (160,000 Chinese subtitle pairs). The paper uses Wan2.1-Fun-V1.1-1.3B as the base model, which is publicly available, but training requires 8 GPUs and approximately 1 day for Stage II convergence. Critical missing details include: exact data synthesis pipeline for the 100 green-screen pairs, specific font distributions in the training set, and whether the 400 test samples represent a held-out set from the same distribution or distinct content. The occlusion head architecture (2.1M parameters) is specified as two convolutional layers, but the exact layer dimensions and activation patterns beyond Equation 13 are not provided.

“We set the LoRA rank to 64 and apply low-rank decomposition ... to all attention components (q, k, v, o) and feed-forward network layers”
Paper · Appendix B
“taking approximately 1 day for full convergence on our 8-GPU setup”
Paper · Appendix B
Abstract

Video subtitle removal aims to distinguish text overlays from background content while preserving temporal coherence. Existing diffusion-based methods necessitate explicit mask sequences during both training and inference phases, which restricts their practical deployment. In this paper, we present CLEAR (Context-aware Learning for End-to-end Adaptive Video Subtitle Removal), a mask-free framework that achieves truly end-to-end inference through context-aware adaptive learning. Our two-stage design decouples prior extraction from generative refinement: Stage I learns disentangled subtitle representations via self-supervised orthogonality constraints on dual encoders, while Stage II employs LoRA-based adaptation with generation feedback for dynamic context adjustment. Notably, our method only requires 0.77% of the parameters of the base diffusion model for training. On Chinese subtitle benchmarks, CLEAR outperforms mask-dependent baselines by + 6.77dB PSNR and -74.7% VFID, while demonstrating superior zero-shot generalization across six languages (English, Korean, French, Japanese, Russian, German), a performance enabled by our generation-driven feedback mechanism that ensures robust subtitle removal without ground-truth masks during inference.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.