CatRAG: Functor-Guided Structural Debiasing with Retrieval Augmentation for Fair LLMs

cs.CL cs.AI Ravi Ranjan, Utkarsh Grover, Mayur Akewar, Xiaomin Lin, Agoritsa Polyzou · Mar 23, 2026
Local to this browser
What it does
Large Language Models often inherit societal biases that manifest as stereotyped associations across demographic groups. This paper proposes CatRAG, a dual-mechanism debiasing framework that combines a category-theoretic functor-guided...
Why it matters
This paper proposes CatRAG, a dual-mechanism debiasing framework that combines a category-theoretic functor-guided projection—collapsing protected-attribute directions in embedding space via spectral decomposition—with diversity-aware...
Main concern
The paper presents a technically sound combination of representation-level projection and retrieval augmentation for debiasing LLMs, with strong empirical results on BBQ. However, the category-theoretic framing appears to be mathematical...
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

Large Language Models often inherit societal biases that manifest as stereotyped associations across demographic groups. This paper proposes CatRAG, a dual-mechanism debiasing framework that combines a category-theoretic functor-guided projection—collapsing protected-attribute directions in embedding space via spectral decomposition—with diversity-aware Retrieval-Augmented Generation to ground inference in balanced evidence. Evaluated on the BBQ benchmark across Llama-3, GPT-OSS, and Gemma-3, the method claims to reduce bias scores from ~60% to near zero while improving accuracy by up to 40% over base models.

Critical review
Verdict
Bottom line

The paper presents a technically sound combination of representation-level projection and retrieval augmentation for debiasing LLMs, with strong empirical results on BBQ. However, the category-theoretic framing appears to be mathematical formalism for standard Fisher Linear Discriminant Analysis (maximizing $S_O$ scatter subject to $S_D$ constraints), and the surprisingly strong near-zero bias results warrant scrutiny regarding generalization beyond the curated corpus and anchor sets. The work is useful but overstates novelty through unnecessary abstraction.

What holds up

The dual-mechanism approach is well-motivated: ablations confirm that functor-only projection (70.5% acc, 0.15 BS) and RAG-only (65.3% acc, 0.24 BS) combine synergistically to achieve the best trade-off (81.2% acc, 0.01 BS). The evaluation is comprehensive, covering intersectional Race×Gender subsets and three distinct model families (Llama-3.2-1B, GPT-OSS-20B, Gemma-3), with consistent patterns across architectures. The scatter plot analysis (Figure 3) provides intuitive evidence that the method reduces variance in stereotype preference.

“Functor-only: 70.5±1.2 acc, 0.15±0.02 BS; RAG-only: 65.3±1.4 acc, 0.24±0.03 BS; Full CatRAG: 81.2±1.0 acc, 0.01±0.01 BS”
paper · Section V-D, Table III
“CatRAG achieves 79.5% accuracy with 0.02 BS on Race×Gender (intersectional) for Llama-3, compared to Base 42.1% with 0.71 BS”
paper · Section V-B, Table II
Main concerns

The category theory framework (Section III-C) is essentially mathematical dressing for a standard linear discriminant objective: maximize $\text{Tr}(U^\top S_O U)$ subject to $U^\top(S_D+\epsilon I)U=I$, which is textbook Fisher LDA with demographics as 'within-class' and occupations as 'between-class' scatter. The paper does not justify why functorial language is necessary beyond metaphor, nor does it explain how the projection matrix $U$ is learned (what data? what optimization?). The RAG corpus construction (Section IV-C) is worryingly vague—described only as 'short factual snippets relevant to BBQ-style scenarios' with 'documented sources'—raising concerns about whether the corpus contains dataset-specific leakage or overly convenient counter-stereotypical priming. The near-zero bias scores (0.01) are suspiciously strong compared to established baselines and may indicate overfitting to the specific anchor sets $\mathcal{D}$ and $\mathcal{O}$ rather than robust debiasing.

“max_{U} Tr(U^\top S_O U) s.t. U^\top(S_D+\epsilon I)U=I_{d_u}”
paper · Section IV-B, Equation (6)
“We build a compact, domain-aligned corpus K containing short factual snippets relevant to BBQ-style scenarios... For example, for questions involving gender and professions, the corpus includes balanced statements such as: 'A substantial fraction of nurses are men'”
paper · Section IV-C, Corpus construction
“the projection is linear and depends on how well the chosen anchor sets capture the relevant sensitive directions, while RAG effectiveness depends on corpus coverage and diversity”
paper · Section VI, Limitations
Evidence and comparison

The evidence supports the claim that combining projection and retrieval outperforms single-stage methods, but comparison to prior work is incomplete. The CatRAG results (80.7% acc, 0.01 BS) are indeed better than Causal Debiasing (78.6% acc, 0.10 BS), but the paper notes that 'Causal slightly outperforms CatRAG in some cases because its constraint-based adjustment aligns closely with the dataset's spurious correlation structure'—hinting that CatRAG's advantage may be dataset-specific. The category-theoretic distinction from 'ad-hoc projection/editing' (Section I, third bullet) is not substantiated: the spectral solution in Equation (7) is identical to standard LDA, not a novel structure-preserving map necessitating category theory.

“CatRAG (Ours): 80.7% Acc, 0.01 BS; Causal Debiasing: 78.6% Acc, 0.10 BS”
paper · Section V-B, Table I
“Causal slightly outperforms CatRAG in some cases because its constraint-based adjustment aligns closely with the dataset's spurious correlation structure”
paper · Section VI, Discussion
Reproducibility

Code is publicly available at the provided GitHub link, which supports reproducibility. However, critical experimental details are missing: the exact composition of the anchor sets $\mathcal{D}$ and $\mathcal{O}$ (only examples like {man, woman} and {doctor, nurse} are given), the procedure for constructing the 'audit-able counter-stereotypical corpus' including document sources and balancing methodology, and the specific hyperparameters for the projection ($d_u=256$ is mentioned only in Section V-D) and retrieval ($K=3$). The generalized eigenvalue problem (Eq. 7) requires computing scatter matrices from data, yet the paper never specifies what corpus is used to compute these anchor embeddings—whether from the base model's pretrained embeddings, a held-out set, or the test data itself. These omissions significantly block independent reproduction of the spectral projection step.

“Full CatRAG uses d_u=256, gender+occupation anchors, retrieval K=3”
paper · Section V-D, Table III
“Let D be the protected demographic set (e.g., {man, woman}), and O be the occupational set (e.g., {doctor, nurse})... we construct the corresponding scatter matrices S_D and S_O”
paper · Section III-A and IV-B
Abstract

Large Language Models (LLMs) are deployed in high-stakes settings but can show demographic, gender, and geographic biases that undermine fairness and trust. Prior debiasing methods, including embedding-space projections, prompt-based steering, and causal interventions, often act at a single stage of the pipeline, resulting in incomplete mitigation and brittle utility trade-offs under distribution shifts. We propose CatRAG Debiasing, a dual-pronged framework that integrates functor with Retrieval-Augmented Generation (RAG) guided structural debiasing. The functor component leverages category-theoretic structure to induce a principled, structure-preserving projection that suppresses bias-associated directions in the embedding space while retaining task-relevant semantics. On the Bias Benchmark for Question Answering (BBQ) across three open-source LLMs (Meta Llama-3, OpenAI GPT-OSS, and Google Gemma-3), CatRAG achieves state-of-the-art results, improving accuracy by up to 40% over the corresponding base models and by more than 10% over prior debiasing methods, while reducing bias scores to near zero (from 60% for the base models) across gender, nationality, race, and intersectional subgroups.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.