A Backbone Benchmarking Study on Self-supervised Learning as a Auxiliary Task with Texture-based Local Descriptors for Face Analysis

cs.CV Shukesh Reddy, Abhijit Das · Mar 23, 2026

What it does

Why it matters

The core idea fuses Local Directional Pattern (LDP) texture descriptors with RGB inputs via Masked Autoencoder reconstruction as an auxiliary task to primary face classification. The study addresses whether a unified backbone exists across...

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

This paper benchmarks Vision Transformer backbones (ViT-B, ViT-L, ViT-H) within a Local pattern Self-Supervised Auxiliary Task (L-SSAT) framework. The core idea fuses Local Directional Pattern (LDP) texture descriptors with RGB inputs via Masked Autoencoder reconstruction as an auxiliary task to primary face classification. The study addresses whether a unified backbone exists across diverse face analysis tasks including deepfake detection (FaceForensics++), attribute prediction (CelebA), and emotion recognition (AffectNet).

Critical review

Verdict

Bottom line

The paper investigates a legitimate architectural question about backbone capacity scaling in multi-task face analysis, but suffers from significant numerical inconsistencies and overstated claims. While the finding that optimal backbone depth is task-dependent (ViT-H for deepfakes/emotion, ViT-B for attributes) is practically useful, the analysis of why this occurs remains superficial. The work is undermined by discrepancies between abstract claims and tabulated results, and by a lack of comparison against external state-of-the-art baselines.

What holds up

The systematic comparison of five masking/reconstruction configurations (e.g., MLDP/RRGB/CRGB vs MRGB/RLDP/CRGB) across three distinct face analysis datasets provides empirical evidence that the interaction between backbone capacity and task type is non-trivial. The observation that "Larger backbones, like ViT-H, are better at distinguishing between different types of tasks, such as identifying deepfakes" while smaller models suffice for attribute prediction is a valuable contribution for practitioners selecting efficiency-accuracy trade-offs.

“For the FF++ dataset, the ViT-H variant consistently exhibited the highest detection accuracy... The ViT-B configuration already achieved strong performance (average accuracy of 0.85) across six binary facial attributes.”

paper · Section 4.3

Main concerns

Severe numerical inconsistencies undermine credibility: the abstract aggregates peak accuracies (0.94 FF++, 0.87 CelebA, 0.88 AffectNet) as if achieved by a single configuration, yet the tables reveal these require different backbones (ViT-H for FF++, ViT-B for CelebA and AffectNet). The claim that "ViT-L and ViT-H backbones demonstrated better consistency" for CelebA directly contradicts Table 2, where ViT-L collapses to 0.55 average accuracy versus ViT-B's 0.85.

Methodological gaps persist: the LDP extraction pipeline lacks implementation details (differentiable vs. pre-computed?), the loss weighting $\lambda=0.1$ heavily favors reconstruction over classification without justification, and no statistical significance testing is performed. The study also conflates VideoMAE (video-specific) with image tasks (CelebA, AffectNet) without clarifying the temporal handling.

“achieving average accuracies of 0.94 on FaceForensics++, 0.87 on CelebA, and 0.88 on AffectNet”

paper · Abstract

“On the CelebA attribute classification task... ViT-L and ViT-H backbones demonstrated better consistency... with average accuracies of approximately 0.78-0.80”

paper · Section 4.3

“VideoMAE/ViT-L ... 0.55 (avg for MRGB,R{LDP,RGB},CRGB)”

paper · Table 2

Evidence and comparison

The evidence supports internal comparisons within the L-SSAT framework but fails to establish external validity. The ROC curves (Figures 3-5) visually demonstrate backbone ranking but lack confidence intervals or AUC values for statistical comparison. Crucially, the paper omits comparison against standard MAE baselines or contemporary deepfake/attribute recognition methods (e.g., Face X-ray, EfficientNet-based detectors), making it impossible to assess whether the proposed texture integration provides absolute gains or merely scales with model capacity.

“The ROC/AUC curves corroborate these findings by demonstrating enhanced separability for deeper backbones in forgery detection.”

paper · Section 4.3

Reproducibility

Reproducibility is hampered by missing implementation details and restricted code availability. While hyperparameters are specified (batch size 8, SGD with cosine schedule $5\times10^{-5}$ to $10^{-6}$, 75 epochs, masking ratio 0.75), critical gaps remain: the LDP feature extraction code is not public, preprocessing pipelines for texture descriptors are unspecified, and the AffectNet test split ("randomly split 50\% of the validation set") lacks random seed documentation. The authors state that "code generated in this study are available from the corresponding author on reasonable request" rather than providing a public repository, which blocks independent verification and contradicts modern open science standards.

“The learning rate is established by a cosine schedule that commences at 0.00005 and aims to achieve a minimum learning rate of 1e-6... The various loss terms are equalized using a Lambda $\lambda$ value of 0.1”

paper · Section 4.2

“Processed data or code generated in this study are available from the corresponding author on reasonable request.”

paper · Data Availability Statement

Abstract

In this work, we benchmark with different backbones and study their impact for self-supervised learning (SSL) as an auxiliary task to blend texture-based local descriptors into feature modelling for efficient face analysis. It is established in previous work that combining a primary task and a self-supervised auxiliary task enables more robust and discriminative representation learning. We employed different shallow to deep backbones for the SSL task of Masked Auto-Encoder (MAE) as an auxiliary objective to reconstruct texture features such as local patterns alongside the primary task in local pattern SSAT (L-SSAT), ensuring robust and unbiased face analysis. To expand the benchmark, we conducted a comprehensive comparative analysis across multiple model configurations within the proposed framework. To this end, we address the three research questions: "What is the role of the backbone in performance L-SSAT?", "What type of backbone is effective for different face analysis tasks?", and "Is there any generalized backbone for effective face analysis with L-SSAT?". Towards answering these questions, we provide a detailed study and experiments. The performance evaluation demonstrates that the backbone for the proposed method is highly dependent on the downstream task, achieving average accuracies of 0.94 on FaceForensics++, 0.87 on CelebA, and 0.88 on AffectNet. For consistency of feature representation quality and generalisation capability across various face analysis paradigms, including face attribute prediction, emotion classification, and deepfake detection, there is no unified backbone.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.