Reasoning or Rhetoric? An Empirical Analysis of Moral Reasoning Explanations in Large Language Models

cs.AI Aryan Kasat, Smriti Singh, Aman Chadha, Vinija Jain · Mar 23, 2026

What it does

Why it matters

human Stage 4 dominance), near-robotic cross-dilemma consistency (ICC > 0. 90), and "moral decoupling" where stated justifications misalign with action choices.

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

This paper investigates whether LLMs exhibit genuine moral reasoning or merely produce convincing moral rhetoric through a large-scale empirical study of 13 models across 6 classical moral dilemmas. Using Kohlberg's stages of moral development as a diagnostic framework, the authors evaluate whether model outputs track human developmental patterns or reflect alignment training artifacts. The core finding is "moral ventriloquism" — the hypothesis that models acquire post-conventional moral language through RLHF without the underlying cognitive architecture, evidenced by distributional inversions (86% Stages 5-6 vs. human Stage 4 dominance), near-robotic cross-dilemma consistency (ICC > 0.90), and "moral decoupling" where stated justifications misalign with action choices.

Critical review

Verdict

Bottom line

This is a well-executed empirical study that makes a valuable contribution to the growing literature on reasoning faithfulness in LLMs. The core argument — that post-conventional moral language emerges as a rhetorical register from alignment training rather than genuine developmental progression — is supported by convergent evidence across ten analyses. The moral ventriloquism hypothesis provides a coherent theoretical framing for patterns that might otherwise appear as isolated anomalies. The paper is appropriately cautious in distinguishing empirical findings (distributional patterns) from mechanistic claims (RLHF causation), though the latter would require interpretability methods beyond the current scope.

“We distinguish two claims our evidence supports to different degrees. Our empirical finding that LLMs produce post-conventional moral language is directly demonstrated... Our interpretive hypothesis that this language does not reflect genuine moral reasoning is supported behaviorally by the decoupling and distributional inversion results, but would require mechanistic evidence to establish with certainty.”

paper · Section 6

What holds up

The cross-dilemma consistency analysis (ICC > 0.90) is methodologically sound and substantially anomalous — human moral reasoning typically shows ICC values well below 0.60 due to genuine context-sensitivity. The factorial decomposition cleanly separates scale from training effects, finding scale effects are small ($\eta^{2}=0.050$, $d=0.55$) with mean stages spanning less than one full point (5.00–6.00) even across models from 8B to 235B parameters. The moral decoupling finding — where mid-tier models like GPT-OSS-120B produce high-stage justifications inconsistent with their action choices — represents a genuine logical incoherence that supports the ventriloquism interpretation. The LLM-as-judge validation across three architecturally distinct models (GPT-4, Claude Sonnet, Llama-3) with high inter-judge agreement mitigates concerns about scorer reliability.

“Intraclass Correlation Coefficients computed per model across the six dilemmas reveal that models produce logically indistinguishable responses regardless of the dilemma presented (ICC >> 0.90 for all evaluated models).”

paper · Section 5.3

“Factorial ANOVA... finds scale is a statistically significant but practically small independent predictor (F(2,229)=6.05, p=0.003, $\eta^{2}=0.050$, $d=0.55$)... Training Type has no significant main effect (p=0.065).”

paper · Section 5.7

Main concerns

The LLM-as-judge methodology creates a potential circularity: if scoring models themselves exhibit moral ventriloquism, uniform inflation could produce artifactual consistency. The authors partially address this through inter-judge agreement, but systematic bias across all three judges toward post-conventional language remains possible. The sample of 13 models, while the largest Kohlberg-based evaluation to date, limits power for detecting nuanced training-type effects. The six-dilemma set, though standard in moral psychology, may not capture the full variance of moral reasoning domains. Most critically, the action–reasoning alignment analysis depends on inferring the "stage" of an action choice (itself a classification task), which introduces second-order measurement error — the decoupling finding could reflect classifier noise rather than genuine logical incoherence. The near-ceiling post-conventional rates (86% Stages 5–6) leave little dynamic range for detecting meaningful variation, effectively compressing the developmental spectrum.

“Our LLM-as-judge pipeline uses RLHF-aligned judges that may pattern-match to post-conventional rhetoric, though the ICC consistency and cross-architecture inter-judge agreement partially mitigate this.”

paper · Section 7

“Stages 5–6 account for 86% of all model responses, while Stage 4 accounts for only 10% and Stages 1–3 together for just 4%.”

paper · Table 1

Evidence and comparison

The paper appropriately situates itself within the reasoning faithfulness literature, accurately citing Turpin et al. (2023) on unfaithful CoT explanations and Chen et al. (2025) on reasoning model hallucination. The comparison to Scherrer et al. (2023) — who treat LLM moral responses as "beliefs encoded in LLMs" — effectively highlights how the current work complicates such interpretations. However, the contrast with Zhou et al. (2024) regarding theory-invoking prompts deserves scrutiny: Analysis 2 finds theory prompts do not significantly shift stage distributions, but this conflicts with Zhou et al.'s finding that explicit moral theory grounding improves accuracy. The discrepancy may reflect different operationalizations (moral stage vs. classification accuracy), but the paper could more directly reconcile these findings. The reliance on Kohlberg's framework as "diagnostic scaffolding" is methodologically defensible given its well-characterized human distribution, though the paper acknowledges this framework is contested in developmental psychology.

“We find that CoT explanations can systematically misrepresent the true reason for a model's prediction. We demonstrate that CoT explanations can be heavily influenced by adding biasing features to model inputs... which models systematically fail to mention in their explanations.”

Turpin et al., 2023 · Abstract

“For CoT monitoring to be most effective, the CoT must be a legible and faithful reflection of the way the model reached its conclusion... If the CoT is not faithful, then we cannot depend on our ability to monitor CoT in order to detect misaligned behaviors.”

Chen et al., 2025 · Section 1

“Scherrer et al. (2023) treat LLM moral responses as direct evidence of 'beliefs encoded in LLMs', an interpretation our results complicate, since post-conventional language may be a surface property of alignment rather than a reflection of underlying beliefs.”

paper · Section 2.1

Reproducibility

Experimental detail is generally adequate for reproduction: the six dilemmas are standard (Heinz, trolley, lifeboat, doctor truth-telling, stolen food, broken promise), prompt templates are specified (zero-shot, chain-of-thought, roleplay), and statistical analyses report full test statistics with effect sizes. The factorial ANOVA design with 234 observations across scale × training type groups is well-documented in Appendix A.1. However, critical barriers to independent reproduction remain: the exact LLM-as-judge prompts are not provided in the main text or appendix, and the response-level dataset of >600 classified outputs is not released. The judge model version specifications ("Claude Sonnet") lack specificity regarding exact model versions. Hyperparameters for generation (temperature, max tokens) are not stated. Code availability is not mentioned. Full reproducibility would require: (1) open-sourcing the scoring pipeline prompts, (2) releasing the classified response dataset, (3) providing exact model API versions and generation parameters, and (4) releasing analysis code.

“For each response, the scoring model outputs: (1) a primary Kohlberg stage assignment, (2) a confidence score for the assignment, and (3) a natural language explanation justifying the classification.”

paper · Section 3.2

“Models are grouped into three scale tiers (Small: 8–32B; Mid: 70–120B; Large: 175–671B) and three training-type categories (Base-RLHF, Coding-Tuned, Reasoning-Tuned), yielding a 3×3 factorial design over 234 observations.”

paper · Section 4.1

“Analyses 7, 9, and 10... are reported in full in Appendix B–A.3. Technical specifications for all analyses are in Appendix A.”

paper · Section 3.4

Abstract

Do large language models reason morally, or do they merely sound like they do? We investigate whether LLM responses to moral dilemmas exhibit genuine developmental progression through Kohlberg's stages of moral development, or whether alignment training instead produces reasoning-like outputs that superficially resemble mature moral judgment without the underlying developmental trajectory. Using an LLM-as-judge scoring pipeline validated across three judge models, we classify more than 600 responses from 13 LLMs spanning a range of architectures, parameter scales, and training regimes across six classical moral dilemmas, and conduct ten complementary analyses to characterize the nature and internal coherence of the resulting patterns. Our results reveal a striking inversion: responses overwhelmingly correspond to post-conventional reasoning (Stages 5-6) regardless of model size, architecture, or prompting strategy, the effective inverse of human developmental norms, where Stage 4 dominates. Most strikingly, a subset of models exhibit moral decoupling: systematic inconsistency between stated moral justification and action choice, a form of logical incoherence that persists across scale and prompting strategy and represents a direct reasoning consistency failure independent of rhetorical sophistication. Model scale carries a statistically significant but practically small effect; training type has no significant independent main effect; and models exhibit near-robotic cross-dilemma consistency producing logically indistinguishable responses across semantically distinct moral problems. We posit that these patterns constitute evidence for moral ventriloquism: the acquisition, through alignment training, of the rhetorical conventions of mature moral reasoning without the underlying developmental trajectory those conventions are meant to represent.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.