PRM-as-a-Judge: A Dense Evaluation Paradigm for Fine-Grained Robotic Auditing
PRM-as-a-Judge addresses the fundamental limitation of binary success metrics in robotic manipulation by repurposing Process Reward Models (PRMs) as dense evaluators. The paper introduces the OPD (Outcome–Process–Diagnosis) metric system, which decomposes execution quality via a task-aligned progress potential $\Phi(x_t) \in [0,1]$ induced from trajectory videos. Validated on the RoboPulse benchmark and RoboTwin policy auditing, the work shows that trajectory-supervised PRMs achieve superior micro-resolution compared to foundation models, revealing behavioral signatures invisible to outcome-only evaluation.
This is a methodologically solid contribution to robotic evaluation that moves beyond coarse success rates. The axiomatic foundation—macro-consistency via potential-based additivity and micro-resolution via fine-grained sensitivity—provides a principled framework for dense evaluation. The OPD metrics (Milestone Coverage, Path-weighted Progress Length, Cumulative Regret Area, and Stagnation Ratio) are well-motivated and mathematically grounded. The empirical validation on RoboPulse demonstrates that PRM judges, particularly Robo-Dopamine, achieve substantially higher accuracy (0.80) than general foundation models (0.47–0.54) on fine-grained progress discrimination, supporting the core claim that domain-specific process reward models serve as effective execution judges.
The theoretical formulation is rigorous and insightful. The potential-difference construction $S(x_i, x_j) = \Phi(x_j) - \Phi(x_i)$ naturally satisfies the macro-consistency axiom (Equation 2), ensuring path-independent aggregation across trajectory segments. The OPD metric system effectively separates outcome reachability (MC, MP) from process efficiency (PPL) and diagnostic failure modes (CRA, STR), enabling granular policy auditing. The RoboPulse benchmark—comprising 1,800 pairwise judgments across 9 embodiment–setting categories with Small/Medium/Large hop stratification—is a valuable contribution for testing progress discrimination at controlled granularities. The failure fingerprint analysis (Figure 5) successfully distinguishes stagnation-dominant failures (high STR) from regret-dominant failures (high CRA), offering actionable diagnostic signals.
The comparison between PRM judges and foundation models is methodologically problematic. While the paper frames this as "zero-shot" evaluation, the PRMs (Robo-Dopamine, VLAC, GVL) are trained on thousands of hours of robotic trajectory data (3,400+ hours for Robo-Dopamine, 4,000+ hours for VLAC), whereas Gemini and GPT-5.2 are general-purpose models without domain-specific pretraining on manipulation videos. This creates an apples-to-oranges comparison that unfairly disadvantages the foundation models. The claim that PRM-as-a-Judge operates "directly from trajectory videos" elides the fact that these models required extensive domain-specific supervision to learn the progress manifold. Additionally, the theoretical guarantee of macro-consistency assumes a perfect potential function, but real PRMs implemented as neural networks may violate path independence due to stochastic inference or non-convex optimization, a limitation not quantified in the experiments.
The RoboPulse results (Table 2) provide strong evidence for micro-resolution: Robo-Dopamine achieves 0.80 accuracy on Small hops versus 0.54 for Gemini 3 Pro and 0.47 for GPT-5.2. However, without controlling for the domain-specific training data advantage, these comparisons are difficult to interpret as evidence that PRMs are inherently better judges rather than simply better trained for this specific domain. The comparison to CLIP-based methods is more equitable (both trained on broad internet data), though CLIP is not optimized for temporal progress modeling. The RoboTwin policy auditing experiments (Table 3) effectively demonstrate OPD's utility: on Blocks Ranking RGB, $\pi_0$ reaches MC@75 of 40% while OpenVLA-OFT reaches only 6%, despite both having near-zero MC@100—revealing qualitatively different failure regimes. The success-conditioned analysis (Figure 4) validly separates high-quality successes (DP: PPL=94.9) from unstable ones.
The mathematical formalization is thorough, with Appendix A providing rigorous definitions, range proofs, and robustness analyses for all OPD metrics (MC, MP, PPL, CRA, STR). The RoboPulse construction protocol is detailed in Appendices E and F, including hop-based normalization and sampling procedures. However, the paper does not explicitly state code availability or release checkpoints for the specific PRM-as-a-Judge implementation in the provided text. Reproduction would require obtaining the specific Robo-Dopamine or VLAC checkpoints (which are from prior work) and implementing the OPD metric calculations. The reliance on existing PRM checkpoints means hyperparameters for the judges are inherited from prior publications, though the OPD aggregation metrics themselves are deterministic given the potential estimates. Full reproducibility would benefit from explicit release of the RoboPulse dataset splits and evaluation scripts.
Current robotic evaluation is still largely dominated by binary success rates, which collapse rich execution processes into a single outcome and obscure critical qualities such as progress, efficiency, and stability. To address this limitation, we propose PRM-as-a-Judge, a dense evaluation paradigm that leverages Process Reward Models (PRMs) to audit policy execution directly from trajectory videos by estimating task progress from observation sequences. Central to this paradigm is the OPD (Outcome-Process-Diagnosis) metric system, which explicitly formalizes execution quality via a task-aligned progress potential. We characterize dense robotic evaluation through two axiomatic properties: macro-consistency, which requires additive and path-consistent aggregation, and micro-resolution, which requires sensitivity to fine-grained physical evolution. Under this formulation, potential-based PRM judges provide a natural instantiation of dense evaluation, with macro-consistency following directly from the induced scalar potential. We empirically validate the micro-resolution property using RoboPulse, a diagnostic benchmark specifically designed for probing micro-scale progress discrimination, where several trajectory-trained PRM judges outperform discriminative similarity-based methods and general-purpose foundation-model judges. Finally, leveraging PRM-as-a-Judge and the OPD metric system, we conduct a structured audit of mainstream policy paradigms across long-horizon tasks, revealing behavioral signatures and failure modes that are invisible to outcome-only metrics.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.