Test-Time Adaptation via Cache Personalization for Facial Expression Recognition in Videos

cs.CV Masoumeh Sharafi, Muhammad Osama Zeeshan, Soufiane Belharbi, Alessandro Lameiras Koerich, Marco Pedersoli, Eric Granger · Mar 22, 2026
Local to this browser
What it does
Video facial expression recognition (FER) suffers from severe subject-specific distribution shifts that degrade CLIP model performance at test time. This paper proposes TTA-CaP, a gradient-free test-time adaptation method that personalizes...
Why it matters
Video facial expression recognition (FER) suffers from severe subject-specific distribution shifts that degrade CLIP model performance at test time. This paper proposes TTA-CaP, a gradient-free test-time adaptation method that personalizes...
Main concern
TTA-CaP presents a well-engineered solution for subject-level personalization in video FER. The three-cache architecture with tri-gate filtering is novel and demonstrates strong empirical results on BioVid and StressID, achieving 81.
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

Video facial expression recognition (FER) suffers from severe subject-specific distribution shifts that degrade CLIP model performance at test time. This paper proposes TTA-CaP, a gradient-free test-time adaptation method that personalizes models using three coordinated caches—a fixed source-domain prototype cache, a dynamic positive target cache for reliable samples, and a negative cache for uncertain predictions—coupled with a tri-gate filtering mechanism to prevent error accumulation.

Critical review
Verdict
Bottom line

TTA-CaP presents a well-engineered solution for subject-level personalization in video FER. The three-cache architecture with tri-gate filtering is novel and demonstrates strong empirical results on BioVid and StressID, achieving 81.0% and 81.5% WAR respectively. However, gains on the BAH dataset are marginal (+0.2% WAR over T3AL), and the method shows inconsistent performance across subjects—some target subjects degrade significantly compared to baselines. While the efficiency advantages over prompt-tuning are clear, the reliance on DBSCAN clustering with per-subject-class hyperparameter selection raises practical deployment concerns.

“TTA-CaP (ours) ... BioVid WAR: 81.0 ... StressID WAR: 81.5 ... BAH WAR: 69.2”
paper · Table 1
“Sub-3 ... TTA-CaP (ours) 33.3 ... DPE 43.5”
paper · Table 2
What holds up

The tri-gate mechanism effectively addresses noisy pseudo-labels in video FER through conservative updates. The mechanism combines temporal stability (majority voting over window $\mathcal{W}$), entropy thresholds ($\tau_h^+, \tau_h^-$), and prototype consistency checks ($\Delta_{\text{proto}} > \tau_\Delta$) to filter unreliable samples. This is validated by the gate pass-rate analysis showing high temporal (84.47%) and entropy (85.0%) pass rates but selective prototype filtering (29.22%), resulting in updates for only 23.87% of frames. The embedding-level fusion $\boldsymbol{z}^{\text{tgt,fuse}}_{t}=\boldsymbol{z}^{\text{tgt}}_{t}+\boldsymbol{z}^{\text{src,tgt}}_{t}+\boldsymbol{z}^{\text{tgt}+}_{t}-\boldsymbol{z}^{\text{tgt}-}_{t}$ preserves CLIP's cosine-similarity geometry while incorporating cached evidence, which is more effective than logit-level fusion for temporal aggregation.

“Temporal 84.47 ... Entropy 85.0 ... Proto 29.22 ... All 23.87 ... Pos 22.75 ... Neg 4.12”
paper · Figure 3 (right)
“if $\tilde{y}^{\mathrm{tgt}}_{t} \neq \mathrm{Maj}(\mathcal{W})$ then continue ... if $h^{\mathrm{tgt}}_{k} < \tau_{h}^{+}$ ... $(c^{\star}=\tilde{y}^{\mathrm{tgt}}_{k}) \& (\Delta_{\text{proto}}>\tau_{\Delta})$”
paper · Algorithm 1
Main concerns

First, subject-specific performance is inconsistent: BioVid Subject 3 drops to 33.3 F1 (vs 43.5 for DPE), and Subject 8 remains low at 47.8, indicating the method can harm certain expression patterns. Second, the negative cache mechanism assumes uncertainty correlates with systematic error—storing the least-likely class as negative evidence risks penalizing correct classes when the model is merely uncertain rather than wrong. Third, source cache construction relies on DBSCAN clustering with parameters selected via bootstrap stability analysis per subject-class combination; this complexity is not fully characterized for sensitivity, and the reliance on 'neutral' anchor frames assumes their availability and representativeness. Finally, the paper lacks statistical significance testing (e.g., paired t-tests) to verify that subject-level improvements are systematic rather than driven by high-variance outliers.

“Sub-3 ... 33.3 ... Sub-8 ... 47.8”
paper · Table 2
“DBSCAN depends on two parameters ... selected separately for each subject-class subset ... evaluated using ... stability under bootstrap resampling”
paper · Section 3.1
“negative cache ... stores information from uncertain samples ... negative pseudo-label ... indicating the least-likely class”
paper · Section 3.2
Evidence and comparison

The evidence strongly supports efficiency claims: TTA-CaP operates at 110ms per batch with 904MB memory versus 771-900ms and 2800-3100MB for prompt-tuning methods (TPT, PromptAlign). Accuracy improvements are substantial on controlled datasets (BioVid +4.9% WAR over T3AL, StressID +5.6% WAR) but marginal on the more challenging BAH dataset (+0.2% WAR over T3AL), suggesting diminished returns under severe appearance variability. Comparisons to cache-based baselines (TDA, DPE, ReTA) are fair and favor TTA-CaP, though the advantage over T3AL—a video-specific TTA method—narrows on BAH. The ablation in Table 6 validates that all three gates contribute, with the full tri-gate achieving 81.0% WAR versus 75.2% for entropy-only gating.

“TTA-CaP ... 110.0 ... 904.0 ... 81.0 ... TPT ... 771.2 ... 2800”
paper · Table 6 (Complexity)
“Entropy only ... 75.2 ... Tri-gate ... 81.0”
paper · Table 6 (Ablation)
“combining static and dynamic caches yields the best result of 81.0 WAR, improving over dynamic-only by +3 WAR”
paper · Section 4.5
Reproducibility

Implementation details are thorough: CLIP ViT-B/32 backbone, temporal window $\mathcal{W}=3$, thresholds $(\tau_h^+, \tau_h^-, \tau_\Delta) = (0.5, 0.8, 0.05)$, and cache capacities (positive: 5, negative: 4) are specified. However, reproduction faces three barriers: (1) the DBSCAN prototype clustering requires implementing the bootstrap stability criteria with adjusted rand index (ARI) evaluation described in Section 3.1, which is non-trivial and computationally expensive; (2) the personalized source cache depends on Fréchet distance-based subject selection using 'neutral' anchor frames, requiring knowledge of which frames are neutral; (3) code is promised but not yet public ('will be made public'). The paper states code is in supplementary materials, but independent verification requires the specific cross-subject splits and DBSCAN parameter selection pipelines used for the three datasets.

“temporal window length $\mathcal{W}=3$ ... prototype margin threshold $\tau_{\Delta}=0.05$ ... positive cache capacity is set to 5 with $\tau_{h}^{+}=0.5$”
paper · Section 4.1
“Candidate settings are evaluated using ... stability under bootstrap resampling, quantified by the mean adjusted rand index (ARI)”
paper · Section 3.1
“Our code is included in the supplementary materials and will be made public”
paper · Abstract
Abstract

Facial expression recognition (FER) in videos requires model personalization to capture the considerable variations across subjects. Vision-language models (VLMs) offer strong transfer to downstream tasks through image-text alignment, but their performance can still degrade under inter-subject distribution shifts. Personalizing models using test-time adaptation (TTA) methods can mitigate this challenge. However, most state-of-the-art TTA methods rely on unsupervised parameter optimization, introducing computational overhead that is impractical in many real-world applications. This paper introduces TTA through Cache Personalization (TTA-CaP), a cache-based TTA method that enables cost-effective (gradient-free) personalization of VLMs for video FER. Prior cache-based TTA methods rely solely on dynamic memories that store test samples, which can accumulate errors and drift due to noisy pseudo-labels. TTA-CaP leverages three coordinated caches: a personalized source cache that stores source-domain prototypes, a positive target cache that accumulates reliable subject-specific samples, and a negative target cache that stores low-confidence cases as negative samples to reduce the impact of noisy pseudo-labels. Cache updates and replacement are controlled by a tri-gate mechanism based on temporal stability, confidence, and consistency with the personalized cache. Finally, TTA-CaP refines predictions through fusion of embeddings, yielding refined representations that support temporally stable video-level predictions. Our experiments on three challenging video FER datasets, BioVid, StressID, and BAH, indicate that TTA-CaP can outperform state-of-the-art TTA methods under subject-specific and environmental shifts, while maintaining low computational and memory overhead for real-world deployment.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.