Guideline-grounded retrieval-augmented generation for ophthalmic clinical decision support
This paper presents Oph-Guid-RAG, a multimodal retrieval-augmented generation system for ophthalmic clinical decision support. Unlike conventional text-based RAG systems, it retrieves full guideline page images using ColQwen2.5, preserving tables, flowcharts, and layout information without OCR errors. The system introduces a controllable retrieval framework with routing and filtering to selectively introduce external evidence, evaluated on a specialized ophthalmology subset extracted from HealthBench.
The paper proposes a visually-grounded RAG approach for clinical guidelines that demonstrates strong accuracy gains on challenging cases but exhibits a problematic completeness trade-off. On HealthBench Hard, the system improves overall score by +30.0% over GPT-5.2 (from 0.2969 to 0.3861) and accuracy by +10.4%, yet completeness drops precipitously to 0.0483 on the hard subset—suggesting aggressive filtering discards partially relevant evidence resulting in clinically unsafe incomplete responses. While the page-level visual retrieval strategy is technically sound and ablation studies robustly validate the importance of reranking, the conservative evidence usage creates a liability for clinical deployment where incomplete guidance could be harmful.
The page-level visual retrieval strategy is well-motivated and technically sound for preserving clinically relevant structures. The paper correctly identifies that guidelines contain irreplaceable layout information including tables and flowcharts that text extraction destroys, and implements this using ColQwen2.5 with FAISS indexing without OCR pipelines. The ablation studies robustly demonstrate that reranking is critical for performance—removing it on the hard subset causes accuracy to drop sharply from 0.6576 to 0.4461 (−0.2115), confirming that evidence selection quality drives correctness. The experimental setup is transparent, using deterministic filtering rules for dataset construction (keyword matching against 16 ophthalmology terms) and consistent evaluation protocols across all baselines with the official HealthBench grader.
The most significant issue is the severe completeness failure on clinically challenging cases. On HealthBench Hard, the completeness score drops to 0.0483 compared to 0.0971 for GPT-5.2, with the authors acknowledging this results from the precision-completeness trade-off where reranking and filtering 'improve accuracy by removing noise, they may also discard partially relevant evidence, resulting in concise but incomplete replies.' This creates a safety concern for clinical deployment where partial guidance could lead to missed treatment steps. The routing mechanism is also unreliable—'sometimes take a wrong turn,either adding extraneous retrieval or omitting required evidence'—creating unpredictable failure modes. Furthermore, the guideline corpus is limited to 305 documents from Medlive (Yimaitong), creating coverage gaps for rare diseases or recent updates.
The evidence supports the claim that visual page retrieval improves accuracy on hard questions requiring precise thresholds, with improvements of +0.0620 over GPT-5.2 and +0.1289 over GPT-5.4 on the hard subset. However, the comparison is partially confounded because the system uses GPT-5.2 as its underlying generator with added retrieval infrastructure, making the comparison against GPT-5.2 essentially an ablation of the RAG pipeline rather than a comparison of independent systems. The claim of being 'more effective on challenging cases' holds for accuracy but fails for completeness, which the authors acknowledge represents 'conservative or partial use of retrieved evidence.' The comparison to HealthBench's reported ceiling is favorable, but the ophthalmology subset represents only 16 of 5000 total conversations, and the system's architecture is heavily tuned for guideline-based QA with offline corpora.
The paper provides good experimental transparency with deterministic filtering rules for the HealthBench subset (keyword matching against 16 ophthalmology terms) and detailed hyperparameters including ColQwen2.5 for encoding, FAISS IndexFlatL2 for indexing, and GPT-5.2 for relevance filtering. The code repository is publicly linked (https://github.com/Suey419/Oph-Guid-RAG). However, reproduction faces significant barriers: the 305 guideline PDFs from Medlive (Yimaitong) may not be freely redistributable, and exact prompt templates for routing, query rewriting, and synthesis are not provided in the paper. Additionally, the system requires access to GPT-5.2/5.4 for filtering and routing decisions (closed API), and the specific rendering pipeline (720 DPI, 5390×7940 pixel standardization) must be exactly replicated to ensure consistent visual embeddings. Without the proprietary guideline corpus and exact LLM prompts, independent reproduction of the retrieval behavior is severely constrained.
In this work, we propose Oph-Guid-RAG, a multimodal visual RAG system for ophthalmology clinical question answering and decision support. We treat each guideline page as an independent evidence unit and directly retrieve page images, preserving tables, flowcharts, and layout information. We further design a controllable retrieval framework with routing and filtering, which selectively introduces external evidence and reduces noise. The system integrates query decomposition, query rewriting, retrieval, reranking, and multimodal reasoning, and provides traceable outputs with guideline page references. We evaluate our method on HealthBench using a doctor-based scoring protocol. On the hard subset, our approach improves the overall score from 0.2969 to 0.3861 (+0.0892, +30.0%) compared to GPT-5.2, and achieves higher accuracy, improving from 0.5956 to 0.6576 (+0.0620, +10.4%). Compared to GPT-5.4, our method achieves a larger accuracy gain of +0.1289 (+24.4%). These results show that our method is more effective on challenging cases that require precise, evidence-based reasoning. Ablation studies further show that reranking, routing, and retrieval design are critical for stable performance, especially under difficult settings. Overall, we show how combining visionbased retrieval with controllable reasoning can improve evidence grounding and robustness in clinical AI applications,while pointing out that further work is needed to be more complete.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.