CatRAG: Functor-Guided Structural Debiasing with Retrieval Augmentation for Fair LLMs
Large Language Models often inherit societal biases that manifest as stereotyped associations across demographic groups. This paper proposes CatRAG, a dual-mechanism debiasing framework that combines a category-theoretic functor-guided projection—collapsing protected-attribute directions in embedding space via spectral decomposition—with diversity-aware Retrieval-Augmented Generation to ground inference in balanced evidence. Evaluated on the BBQ benchmark across Llama-3, GPT-OSS, and Gemma-3, the method claims to reduce bias scores from ~60% to near zero while improving accuracy by up to 40% over base models.
The paper presents a technically sound combination of representation-level projection and retrieval augmentation for debiasing LLMs, with strong empirical results on BBQ. However, the category-theoretic framing appears to be mathematical formalism for standard Fisher Linear Discriminant Analysis (maximizing $S_O$ scatter subject to $S_D$ constraints), and the surprisingly strong near-zero bias results warrant scrutiny regarding generalization beyond the curated corpus and anchor sets. The work is useful but overstates novelty through unnecessary abstraction.
The dual-mechanism approach is well-motivated: ablations confirm that functor-only projection (70.5% acc, 0.15 BS) and RAG-only (65.3% acc, 0.24 BS) combine synergistically to achieve the best trade-off (81.2% acc, 0.01 BS). The evaluation is comprehensive, covering intersectional Race×Gender subsets and three distinct model families (Llama-3.2-1B, GPT-OSS-20B, Gemma-3), with consistent patterns across architectures. The scatter plot analysis (Figure 3) provides intuitive evidence that the method reduces variance in stereotype preference.
The category theory framework (Section III-C) is essentially mathematical dressing for a standard linear discriminant objective: maximize $\text{Tr}(U^\top S_O U)$ subject to $U^\top(S_D+\epsilon I)U=I$, which is textbook Fisher LDA with demographics as 'within-class' and occupations as 'between-class' scatter. The paper does not justify why functorial language is necessary beyond metaphor, nor does it explain how the projection matrix $U$ is learned (what data? what optimization?). The RAG corpus construction (Section IV-C) is worryingly vague—described only as 'short factual snippets relevant to BBQ-style scenarios' with 'documented sources'—raising concerns about whether the corpus contains dataset-specific leakage or overly convenient counter-stereotypical priming. The near-zero bias scores (0.01) are suspiciously strong compared to established baselines and may indicate overfitting to the specific anchor sets $\mathcal{D}$ and $\mathcal{O}$ rather than robust debiasing.
The evidence supports the claim that combining projection and retrieval outperforms single-stage methods, but comparison to prior work is incomplete. The CatRAG results (80.7% acc, 0.01 BS) are indeed better than Causal Debiasing (78.6% acc, 0.10 BS), but the paper notes that 'Causal slightly outperforms CatRAG in some cases because its constraint-based adjustment aligns closely with the dataset's spurious correlation structure'—hinting that CatRAG's advantage may be dataset-specific. The category-theoretic distinction from 'ad-hoc projection/editing' (Section I, third bullet) is not substantiated: the spectral solution in Equation (7) is identical to standard LDA, not a novel structure-preserving map necessitating category theory.
Code is publicly available at the provided GitHub link, which supports reproducibility. However, critical experimental details are missing: the exact composition of the anchor sets $\mathcal{D}$ and $\mathcal{O}$ (only examples like {man, woman} and {doctor, nurse} are given), the procedure for constructing the 'audit-able counter-stereotypical corpus' including document sources and balancing methodology, and the specific hyperparameters for the projection ($d_u=256$ is mentioned only in Section V-D) and retrieval ($K=3$). The generalized eigenvalue problem (Eq. 7) requires computing scatter matrices from data, yet the paper never specifies what corpus is used to compute these anchor embeddings—whether from the base model's pretrained embeddings, a held-out set, or the test data itself. These omissions significantly block independent reproduction of the spectral projection step.
Large Language Models (LLMs) are deployed in high-stakes settings but can show demographic, gender, and geographic biases that undermine fairness and trust. Prior debiasing methods, including embedding-space projections, prompt-based steering, and causal interventions, often act at a single stage of the pipeline, resulting in incomplete mitigation and brittle utility trade-offs under distribution shifts. We propose CatRAG Debiasing, a dual-pronged framework that integrates functor with Retrieval-Augmented Generation (RAG) guided structural debiasing. The functor component leverages category-theoretic structure to induce a principled, structure-preserving projection that suppresses bias-associated directions in the embedding space while retaining task-relevant semantics. On the Bias Benchmark for Question Answering (BBQ) across three open-source LLMs (Meta Llama-3, OpenAI GPT-OSS, and Google Gemma-3), CatRAG achieves state-of-the-art results, improving accuracy by up to 40% over the corresponding base models and by more than 10% over prior debiasing methods, while reducing bias scores to near zero (from 60% for the base models) across gender, nationality, race, and intersectional subgroups.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.