A Backbone Benchmarking Study on Self-supervised Learning as a Auxiliary Task with Texture-based Local Descriptors for Face Analysis
This paper benchmarks Vision Transformer backbones (ViT-B, ViT-L, ViT-H) within a Local pattern Self-Supervised Auxiliary Task (L-SSAT) framework. The core idea fuses Local Directional Pattern (LDP) texture descriptors with RGB inputs via Masked Autoencoder reconstruction as an auxiliary task to primary face classification. The study addresses whether a unified backbone exists across diverse face analysis tasks including deepfake detection (FaceForensics++), attribute prediction (CelebA), and emotion recognition (AffectNet).
The paper investigates a legitimate architectural question about backbone capacity scaling in multi-task face analysis, but suffers from significant numerical inconsistencies and overstated claims. While the finding that optimal backbone depth is task-dependent (ViT-H for deepfakes/emotion, ViT-B for attributes) is practically useful, the analysis of why this occurs remains superficial. The work is undermined by discrepancies between abstract claims and tabulated results, and by a lack of comparison against external state-of-the-art baselines.
The systematic comparison of five masking/reconstruction configurations (e.g., MLDP/RRGB/CRGB vs MRGB/RLDP/CRGB) across three distinct face analysis datasets provides empirical evidence that the interaction between backbone capacity and task type is non-trivial. The observation that "Larger backbones, like ViT-H, are better at distinguishing between different types of tasks, such as identifying deepfakes" while smaller models suffice for attribute prediction is a valuable contribution for practitioners selecting efficiency-accuracy trade-offs.
Severe numerical inconsistencies undermine credibility: the abstract aggregates peak accuracies (0.94 FF++, 0.87 CelebA, 0.88 AffectNet) as if achieved by a single configuration, yet the tables reveal these require different backbones (ViT-H for FF++, ViT-B for CelebA and AffectNet). The claim that "ViT-L and ViT-H backbones demonstrated better consistency" for CelebA directly contradicts Table 2, where ViT-L collapses to 0.55 average accuracy versus ViT-B's 0.85.
Methodological gaps persist: the LDP extraction pipeline lacks implementation details (differentiable vs. pre-computed?), the loss weighting $\lambda=0.1$ heavily favors reconstruction over classification without justification, and no statistical significance testing is performed. The study also conflates VideoMAE (video-specific) with image tasks (CelebA, AffectNet) without clarifying the temporal handling.
The evidence supports internal comparisons within the L-SSAT framework but fails to establish external validity. The ROC curves (Figures 3-5) visually demonstrate backbone ranking but lack confidence intervals or AUC values for statistical comparison. Crucially, the paper omits comparison against standard MAE baselines or contemporary deepfake/attribute recognition methods (e.g., Face X-ray, EfficientNet-based detectors), making it impossible to assess whether the proposed texture integration provides absolute gains or merely scales with model capacity.
Reproducibility is hampered by missing implementation details and restricted code availability. While hyperparameters are specified (batch size 8, SGD with cosine schedule $5\times10^{-5}$ to $10^{-6}$, 75 epochs, masking ratio 0.75), critical gaps remain: the LDP feature extraction code is not public, preprocessing pipelines for texture descriptors are unspecified, and the AffectNet test split ("randomly split 50\% of the validation set") lacks random seed documentation. The authors state that "code generated in this study are available from the corresponding author on reasonable request" rather than providing a public repository, which blocks independent verification and contradicts modern open science standards.
In this work, we benchmark with different backbones and study their impact for self-supervised learning (SSL) as an auxiliary task to blend texture-based local descriptors into feature modelling for efficient face analysis. It is established in previous work that combining a primary task and a self-supervised auxiliary task enables more robust and discriminative representation learning. We employed different shallow to deep backbones for the SSL task of Masked Auto-Encoder (MAE) as an auxiliary objective to reconstruct texture features such as local patterns alongside the primary task in local pattern SSAT (L-SSAT), ensuring robust and unbiased face analysis. To expand the benchmark, we conducted a comprehensive comparative analysis across multiple model configurations within the proposed framework. To this end, we address the three research questions: "What is the role of the backbone in performance L-SSAT?", "What type of backbone is effective for different face analysis tasks?", and "Is there any generalized backbone for effective face analysis with L-SSAT?". Towards answering these questions, we provide a detailed study and experiments. The performance evaluation demonstrates that the backbone for the proposed method is highly dependent on the downstream task, achieving average accuracies of 0.94 on FaceForensics++, 0.87 on CelebA, and 0.88 on AffectNet. For consistency of feature representation quality and generalisation capability across various face analysis paradigms, including face attribute prediction, emotion classification, and deepfake detection, there is no unified backbone.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.