Guideline-grounded retrieval-augmented generation for ophthalmic clinical decision support

cs.AI Shuying Chen, Sen Cui, Zhong Cao · Mar 23, 2026

What it does

Why it matters

5, preserving tables, flowcharts, and layout information without OCR errors. The system introduces a controllable retrieval framework with routing and filtering to selectively introduce external evidence, evaluated on a specialized...

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

This paper presents Oph-Guid-RAG, a multimodal retrieval-augmented generation system for ophthalmic clinical decision support. Unlike conventional text-based RAG systems, it retrieves full guideline page images using ColQwen2.5, preserving tables, flowcharts, and layout information without OCR errors. The system introduces a controllable retrieval framework with routing and filtering to selectively introduce external evidence, evaluated on a specialized ophthalmology subset extracted from HealthBench.

Critical review

Verdict

Bottom line

The paper proposes a visually-grounded RAG approach for clinical guidelines that demonstrates strong accuracy gains on challenging cases but exhibits a problematic completeness trade-off. On HealthBench Hard, the system improves overall score by +30.0% over GPT-5.2 (from 0.2969 to 0.3861) and accuracy by +10.4%, yet completeness drops precipitously to 0.0483 on the hard subset—suggesting aggressive filtering discards partially relevant evidence resulting in clinically unsafe incomplete responses. While the page-level visual retrieval strategy is technically sound and ablation studies robustly validate the importance of reranking, the conservative evidence usage creates a liability for clinical deployment where incomplete guidance could be harmful.

“On the hard subset, our approach improves the overall score from 0.2969 to 0.3861 (+0.0892, +30.0%) compared to GPT-5.2”

Chen et al. (this paper) · Abstract

“But the incompleteness is still a major limitation. We obtain the score of 0.0483, lower than GPT-5.4(0.1139) and GPT-5.2 (0.0971)”

Chen et al. (this paper) · Section 5.2

What holds up

The page-level visual retrieval strategy is well-motivated and technically sound for preserving clinically relevant structures. The paper correctly identifies that guidelines contain irreplaceable layout information including tables and flowcharts that text extraction destroys, and implements this using ColQwen2.5 with FAISS indexing without OCR pipelines. The ablation studies robustly demonstrate that reranking is critical for performance—removing it on the hard subset causes accuracy to drop sharply from 0.6576 to 0.4461 (−0.2115), confirming that evidence selection quality drives correctness. The experimental setup is transparent, using deterministic filtering rules for dataset construction (keyword matching against 16 ophthalmology terms) and consistent evaluation protocols across all baselines with the official HealthBench grader.

“We intentionally did not do any text extraction or structure parsing in order to preserve the original layout information like tables, flow charts,staging criteria”

Chen et al. (this paper) · Section 3.3

“removing reranking reduces the overall score from 0.3861 to 0.2817 (−0.1044), and accuracy drops sharply from 0.6576 to 0.4461 (−0.2115)”

Chen et al. (this paper) · Section 5.4

Main concerns

The most significant issue is the severe completeness failure on clinically challenging cases. On HealthBench Hard, the completeness score drops to 0.0483 compared to 0.0971 for GPT-5.2, with the authors acknowledging this results from the precision-completeness trade-off where reranking and filtering 'improve accuracy by removing noise, they may also discard partially relevant evidence, resulting in concise but incomplete replies.' This creates a safety concern for clinical deployment where partial guidance could lead to missed treatment steps. The routing mechanism is also unreliable—'sometimes take a wrong turn,either adding extraneous retrieval or omitting required evidence'—creating unpredictable failure modes. Furthermore, the guideline corpus is limited to 305 documents from Medlive (Yimaitong), creating coverage gaps for rare diseases or recent updates.

“The routing scheme can also be unreliable and sometimes take a wrong turn,either adding extraneous retrieval or omitting required evidence”

Chen et al. (this paper) · Section 6

“reranking and LLMbased filtering improve accuracy by removing noise, they may also discard partially relevant evidence, resulting in concise but incomplete replies”

Chen et al. (this paper) · Section 6

Evidence and comparison

The evidence supports the claim that visual page retrieval improves accuracy on hard questions requiring precise thresholds, with improvements of +0.0620 over GPT-5.2 and +0.1289 over GPT-5.4 on the hard subset. However, the comparison is partially confounded because the system uses GPT-5.2 as its underlying generator with added retrieval infrastructure, making the comparison against GPT-5.2 essentially an ablation of the RAG pipeline rather than a comparison of independent systems. The claim of being 'more effective on challenging cases' holds for accuracy but fails for completeness, which the authors acknowledge represents 'conservative or partial use of retrieved evidence.' The comparison to HealthBench's reported ceiling is favorable, but the ophthalmology subset represents only 16 of 5000 total conversations, and the system's architecture is heavily tuned for guideline-based QA with offline corpora.

“HealthBench Hard, on which no model we evaluated scored above 32%”

Arora et al., HealthBench · Abstract

“Our approach performs better in cases where there are clear evidential reasons to be able to reason precisely,while it continues to generate incomplete responses possibly because of its conservative use of evidence”

Chen et al. (this paper) · Section 5.5

Reproducibility

The paper provides good experimental transparency with deterministic filtering rules for the HealthBench subset (keyword matching against 16 ophthalmology terms) and detailed hyperparameters including ColQwen2.5 for encoding, FAISS IndexFlatL2 for indexing, and GPT-5.2 for relevance filtering. The code repository is publicly linked (https://github.com/Suey419/Oph-Guid-RAG). However, reproduction faces significant barriers: the 305 guideline PDFs from Medlive (Yimaitong) may not be freely redistributable, and exact prompt templates for routing, query rewriting, and synthesis are not provided in the paper. Additionally, the system requires access to GPT-5.2/5.4 for filtering and routing decisions (closed API), and the specific rendering pipeline (720 DPI, 5390×7940 pixel standardization) must be exactly replicated to ensure consistent visual embeddings. Without the proprietary guideline corpus and exact LLM prompts, independent reproduction of the retrieval behavior is severely constrained.

“We extracted for every examplethe question text from the first available text field, lowercased the string, andmatched against an ophthalmology keyword list which was pre-defined by us”

Chen et al. (this paper) · Section 4.1

“The mean page size for the guideline PDFs was 5908 x 8063 pixels, and we rendered every page to a high resolutionimage at 720 DPI”

Chen et al. (this paper) · Section 3.3

Abstract

In this work, we propose Oph-Guid-RAG, a multimodal visual RAG system for ophthalmology clinical question answering and decision support. We treat each guideline page as an independent evidence unit and directly retrieve page images, preserving tables, flowcharts, and layout information. We further design a controllable retrieval framework with routing and filtering, which selectively introduces external evidence and reduces noise. The system integrates query decomposition, query rewriting, retrieval, reranking, and multimodal reasoning, and provides traceable outputs with guideline page references. We evaluate our method on HealthBench using a doctor-based scoring protocol. On the hard subset, our approach improves the overall score from 0.2969 to 0.3861 (+0.0892, +30.0%) compared to GPT-5.2, and achieves higher accuracy, improving from 0.5956 to 0.6576 (+0.0620, +10.4%). Compared to GPT-5.4, our method achieves a larger accuracy gain of +0.1289 (+24.4%). These results show that our method is more effective on challenging cases that require precise, evidence-based reasoning. Ablation studies further show that reranking, routing, and retrieval design are critical for stable performance, especially under difficult settings. Overall, we show how combining visionbased retrieval with controllable reasoning can improve evidence grounding and robustness in clinical AI applications,while pointing out that further work is needed to be more complete.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.