Evaluating the Reliability and Fidelity of Automated Judgment Systems of Large Language Models

cs.CR cs.AI cs.LG Tom Biskupski, Stephan Kleber · Mar 23, 2026
Local to this browser
What it does
Evaluating LLM outputs at scale remains a bottleneck for deploying safe AI systems. This paper conducts a comprehensive empirical study of 37 conversational LLMs serving as automated judges across eight security and quality assessment...
Why it matters
This paper conducts a comprehensive empirical study of 37 conversational LLMs serving as automated judges across eight security and quality assessment tasks. The work identifies viable open-source alternatives to GPT-4o for judgment tasks...
Main concern
The paper presents a methodical large-scale evaluation of LLM-as-a-judge configurations, establishing that models with $\geqslant$32B parameters (and select smaller ones like Qwen2. 5 14B) achieve high fidelity with human judgments ($F_1$...
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

Evaluating LLM outputs at scale remains a bottleneck for deploying safe AI systems. This paper conducts a comprehensive empirical study of 37 conversational LLMs serving as automated judges across eight security and quality assessment tasks. The work identifies viable open-source alternatives to GPT-4o for judgment tasks while demonstrating that popular techniques like second-level judging and specialized evaluator models underperform compared to well-prompted general models.

Critical review
Verdict
Bottom line

The paper presents a methodical large-scale evaluation of LLM-as-a-judge configurations, establishing that models with $\geqslant$32B parameters (and select smaller ones like Qwen2.5 14B) achieve high fidelity with human judgments ($F_1$ up to 0.96). However, the validity of the ground-truth labels is weakened by reliance on a single human annotator per sample without reported inter-annotator agreement, and the limited dataset size (534 total instances) raises concerns about statistical power for the eight distinct task categories.

“One human annotator created the labels for each entry in the datasets”
Biskupski & Kleber · Section 3.1
“GPT-4o performed the best across all prompts and achieved the best F1-score of 0.96 on the CoT prompt”
Biskupski & Kleber · Section 4.2.2
What holds up

The systematic screening of 37 models through structured output tests, correctness evaluations, and stability analyses provides a rigorous template for judge selection. Key findings—that second-level judges degrade performance and that general conversational models outperform specialized fine-tuned judges—offer actionable insights for practitioners. The observation that larger models ($\geqslant$32B) effectively leverage Chain-of-Thought prompts while smaller models do not aligns neatly with scaling law expectations.

“Qwen2.5 7B with the 2 Basic prompt drops by 0.28 in F1, from 0.87 to 0.59, while Gemma2's F1-score drops as much as 0.18”
Biskupski & Kleber · Section 5.2
“LLMs often fail to improve and frequently worsen their responses when tasked to review answers”
Huang et al., arXiv:2310.01798 · Section 6 citation
Main concerns

The empirical foundation rests on only 534 labeled examples across eight disparate categories, with as few as 34 jailbreak cases, severely limiting generalization. Ground-truth reliability is questionable as one human annotator created the labels for each entry without verification or Cohen's Kappa reporting. The study is restricted to binary classification and single-turn English interactions, ignoring the multi-turn adversarial contexts where judging is often most needed. Excluding o-series models due to API content filters rather than designing appropriate test harnesses creates a noticeable gap in the evaluation.

“One human annotator created the labels for each entry in the datasets, closely regarding the evaluation criteria of each topic”
Biskupski & Kleber · Section 3.1
“We limited our study to single-turn scenarios in English language with binary labels, which may not generalize to complex, multilingual, or nuanced use cases”
Biskupski & Kleber · Section 6
“we were forced to exclude the o1, o1-mini, and o3-mini models due to content filters that blocked inputs which contain seemingly illegal or harmful contents”
Biskupski & Kleber · Section 4.2.1
Evidence and comparison

The comparison to prior work is generally fair, correctly noting that specialized models like Prometheus 2 were not designed to follow additional instructions for binary classification. However, the dismissal of fine-tuned judges as universally unsuitable may be overly broad given that Granite 3 Guardian outperformed baselines on four of six in-scope datasets. The $F_1$ aggregation methodology across differing dataset sizes is appropriate, though the claim that judges achieve high reliability based on percent agreement $\geqslant$95.54% at temperature 0 understates the predictability concerns under realistic sampling conditions.

“Granite 3 Guardian outperforms the Qwen2.5 7B baseline in 4 of 6 categories that fall within its intended use case”
Biskupski & Kleber · Section 6
“To test the format stability during the evaluation of (1) structured outputs, we set a temperature of 0.5 to introduce variability. Throughout all other evaluations, we set the temperature to 0”
Biskupski & Kleber · Section 3.4
Reproducibility

Reproduction is hampered by the absence of a public code repository or explicit data release for the custom Brand Harm and Similarity datasets. While the paper specifies temperature settings, it omits critical hyperparameters such as top-p, max tokens, and system prompt handling. The reliance on proprietary APIs alongside local model execution creates environment-specific dependencies. The single-annotator ground truth cannot be independently validated, and the manual parsing accommodations made for Gemini 1.5 Pro's JSON formatting errors indicate that results may be parser-dependent.

“We enhance our output parser to handle these minor formatting errors as gracefully as possible in the subsequent evaluation steps”
Biskupski & Kleber · Section 4.1
“Brand Harm... Source: Custom”
Biskupski & Kleber · Table I
Abstract

A Large Language Model (LLM) as judge evaluates the quality of victim Machine Learning (ML) models, specifically LLMs, by analyzing their outputs. An LLM as judge is the combination of one model and one specifically engineered judge prompt that contains the criteria for the analysis. The resulting automation of the analysis scales up the complex evaluation of the victim models' free-form text outputs by faster and more consistent judgments compared to human reviewers. Thus, quality and security assessments of LLMs can cover a wide range of the victim models' use cases. Being a comparably new technique, LLMs as judges lack a thorough investigation for their reliability and agreement to human judgment. Our work evaluates the applicability of LLMs as automated quality assessors of victim LLMs. We test the efficacy of 37 differently sized conversational LLMs in combination with 5 different judge prompts, the concept of a second-level judge, and 5 models fine-tuned for the task as assessors. As assessment objective, we curate datasets for eight different categories of judgment tasks and the corresponding ground-truth labels based on human assessments. Our empirical results show a high correlation of LLMs as judges with human assessments, when combined with a suitable prompt, in particular for GPT-4o, several open-source models with $\geqslant$ 32B parameters, and a few smaller models like Qwen2.5 14B.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.