Greater accessibility can amplify discrimination in generative AI

cs.CL Carolin Holtermann, Minh Duc Bui, Kaitlyn Zhou, Valentin Hofmann, Katharina von der Wense, Anne Lauscher · Mar 23, 2026

What it does

Why it matters

Crucially, the authors show that voice inputs amplify bias beyond text-only baselines, with models exhibiting stronger stereotypical associations when processing speech than when processing equivalent text with gendered name cues. The...

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

Audio-enabled large language models promise to democratize AI access for users with disabilities or limited literacy, but voice interfaces introduce immutable paralinguistic cues—pitch, timbre, prosody—that carry demographic signals. This paper demonstrates that state-of-the-art audio LLMs systematically discriminate based on speaker voice, assigning gender-stereotyped adjectives and professions solely from acoustic features. Crucially, the authors show that voice inputs amplify bias beyond text-only baselines, with models exhibiting stronger stereotypical associations when processing speech than when processing equivalent text with gendered name cues. The study establishes a causal link via pitch manipulation experiments and surveys 1,000 users to reveal that those who would benefit most from voice accessibility are often most hesitant about the attendant privacy and discrimination risks.

Critical review

Verdict

Bottom line

The paper presents a rigorously designed audit study that successfully isolates voice-based discrimination through content-matched audio pairs and establishes causality via pitch manipulation. The finding that gender detection capability correlates with discriminatory output strength ($r$ between detection accuracy and log odds ratios) is particularly compelling, suggesting that benchmark-driven optimization for speaker attribute recognition directly incentivizes bias. However, the study's framing of mitigation via pitch shifting—while demonstrating a causal lever—raises ethical concerns about placing the burden of anonymization on users rather than systems. The survey evidence, though suggestive, is limited to U.S. Prolific respondents and does not fully capture the global populations (e.g., low-literacy users in the Global South) that the paper identifies as primary beneficiaries of voice interfaces.

“Voice amplified gender discrimination substantially, increasing log odds ratios by an average of 0.14 for adjective associations and 0.11 for profession associations.”

Holtermann et al. · Section 2

“As pitch decreased, outputs transitioned from high female-term predominance toward distributions characteristic of low-pitched voices. This modification substantially reduced gender bias.”

Holtermann et al. · Section 4

What holds up

The experimental design is exemplary: by matching audio samples for content, accent, duration, and age range across self-identified male and female speakers, the authors isolate voice gender as the treatment variable. The log odds ratio analysis
$$\text{LOR} = \log \frac{N(\mathcal{S}_m,\mathcal{X}_m) \cdot N(\mathcal{S}_f,\mathcal{X}_f)}{N(\mathcal{S}_f,\mathcal{X}_m) \cdot N(\mathcal{S}_m,\mathcal{X}_f)}$$
provides a transparent metric for stereotype association strength, and the paired permutation test (100,000 iterations) appropriately handles the non-independence of voice-text comparisons. The pitch manipulation experiments—progressively shifting high-pitched voices (Q8) toward low-pitch (Q1) medians—offer convincing causal evidence that acoustic features drive discriminatory outputs, not merely correlated confounds.

“We employed a paired permutation test. We do 100,000 iterations by randomly swapping the model assignments (text vs. voice) for each example independently with probability 0.5.”

Holtermann et al. · Methods

“All models show a statistically significant increase in female associations from Q1 to Q8.”

Holtermann et al. · Extended Data Table E3

Main concerns

The study relies on binary gender classification (female/male) based on self-identification, which the authors acknowledge does not capture non-binary identities and may obscure distinct bias patterns for gender-diverse speakers. More critically, the proposed 'mitigation' via pitch manipulation requires users to alter their fundamental vocal characteristics to receive equitable treatment—a technically interesting finding but a normatively questionable solution that places accommodation burdens on marginalized users rather than requiring model-level intervention. The user survey, while revealing hesitancy patterns among infrequent users (OR=0.78 for usage frequency predicting lower concern), samples only U.S. residents via Prolific, creating a mismatch with the paper's emphasis on global accessibility benefits for populations in Africa, Asia, and Latin America where literacy barriers are most acute.

“Our dataset includes only speech samples of accents in British and American English... our findings likely reflect predominantly Western, English-speaking norms.”

Holtermann et al. · Supplementary Information

“Frequent chatbot usage (OR=0.78) and male sex (OR=0.65) significantly reduce concern about attribute inference.”

Holtermann et al. · Figure 3b

Evidence and comparison

The evidence strongly supports the central claim that voice amplifies text-based bias across most models tested (Gemini Pro, Flash, Qwen2-Audio), though GPT-4o Audio presents an important exception with its explicit refusal to infer gender, resulting in lower voice-based bias than text. The comparison to related work is fair but could more thoroughly engage with the speech processing literature on paralinguistic bias; the authors cite ASR disparities but do not fully reconcile their findings with prior work on 'Spoken Stereoset' that found mixed evidence for voice-based bias. The demonstration that current audio benchmarks (AudioBench, AIR-bench) explicitly reward gender detection capability—thereby incentivizing discrimination—is a novel and important contribution to the evaluation literature.

“Models with lower gender detection accuracy exhibited log odds ratios near zero... high-performing models showed substantially larger log odds ratios.”

Holtermann et al. · Section 2

“Multiple prominent benchmarks for audio-enabled LLMs explicitly evaluate gender detection performance... incentivizing developers to build these capabilities into their systems.”

Holtermann et al. · Section 5

Reproducibility

The authors commit to releasing anonymized survey data and analysis code on GitHub upon publication, though these materials are currently unavailable. The study relies on proprietary APIs (GPT-4o Audio, Gemini variants) and open-weight models (Qwen2-Audio, Llama-Omni, etc.) with specified versions as of November 2025; however, API-dependent results are vulnerable to model drift and versioning changes. Hyperparameters are fully specified (temperature $T=0.1$, nucleus sampling $p=0.9$, top-$k=100$), and the dataset construction from three public corpora (British Dialects, English Accents, Spoken Stereoset) is documented with filtering criteria. The pitch manipulation uses the open-source 'pyrubberband' library with quantile boundaries (84–515 Hz) disclosed, enabling replication of the causal manipulation.

“All datasets used in this study are publicly available... Anonymized survey data collected in this study will be released together with our analysis code in a public GitHub repository upon publication.”

Holtermann et al. · Methods

“We employed the Python package 'pyrubberband'... The corresponding quantile boundaries were 84, 104, 115, 124, 156, 184, 195, 208, and 515 Hz.”

Holtermann et al. · Methods

Abstract

Hundreds of millions of people rely on large language models (LLMs) for education, work, and even healthcare. Yet these models are known to reproduce and amplify social biases present in their training data. Moreover, text-based interfaces remain a barrier for many, for example, users with limited literacy, motor impairments, or mobile-only devices. Voice interaction promises to expand accessibility, but unlike text, speech carries identity cues that users cannot easily mask, raising concerns about whether accessibility gains may come at the cost of equitable treatment. Here we show that audio-enabled LLMs exhibit systematic gender discrimination, shifting responses toward gender-stereotyped adjectives and occupations solely on the basis of speaker voice, and amplifying bias beyond that observed in text-based interaction. Thus, voice interfaces do not merely extend text models to a new modality but introduce distinct bias mechanisms tied to paralinguistic cues. Complementary survey evidence ($n=1,000$) shows that infrequent chatbot users are most hesitant to undisclosed attribute inference and most likely to disengage when such practices are revealed. To demonstrate a potential mitigation strategy, we show that pitch manipulation can systematically regulate gender-discriminatory outputs. Overall, our findings reveal a critical tension in AI development: efforts to expand accessibility through voice interfaces simultaneously create new pathways for discrimination, demanding that fairness and accessibility be addressed in tandem.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.