CataractSAM-2: A Domain-Adapted Model for Anterior Segment Surgery Segmentation and Scalable Ground-Truth Annotation

cs.CV cs.AI cs.DB cs.LG cs.RO Mohammad Eslami, Dhanvinkumar Ganeshkumar, Saber Kazeminasab, Michael G. Morley, Michael V. Boland, Michael M. Lin, John B. Miller, David S. Friedman, Nazlee Zebardast, Lucia Sobrin, Tobias Elze · Mar 23, 2026
Local to this browser
What it does
CataractSAM-2 adapts Meta's Segment Anything Model 2 (SAM-2) for real-time semantic segmentation in cataract surgery videos. The core idea is to fine-tune only the prompt encoder and mask decoder while freezing the image encoder, enabling...
Why it matters
The core idea is to fine-tune only the prompt encoder and mask decoder while freezing the image encoder, enabling precise segmentation of anatomical structures and surgical instruments under challenging conditions like glare and occlusion....
Main concern
The paper presents a competent domain adaptation of SAM-2 for ophthalmic surgery with strong quantitative results on standard benchmarks. The CaDIS external validation and the release of code and pretrained weights are concrete strengths.
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

CataractSAM-2 adapts Meta's Segment Anything Model 2 (SAM-2) for real-time semantic segmentation in cataract surgery videos. The core idea is to fine-tune only the prompt encoder and mask decoder while freezing the image encoder, enabling precise segmentation of anatomical structures and surgical instruments under challenging conditions like glare and occlusion. The paper also introduces an interactive annotation framework that propagates sparse user prompts across video frames to accelerate ground-truth generation.

Critical review
Verdict
Bottom line

The paper presents a competent domain adaptation of SAM-2 for ophthalmic surgery with strong quantitative results on standard benchmarks. The CaDIS external validation and the release of code and pretrained weights are concrete strengths. However, the cross-procedure generalization claim relies solely on qualitative inspection of YouTube videos, and the comparison omits RP-SAM2—a contemporary method reporting superior results on the same Cataract-1K dataset using a lightweight shift-block architecture. The frozen image encoder limits adaptation to domain-specific visual phenomena like transparent tissues and specular reflections.

“Video GL1: Trabeculectomy for Treatment of Glaucoma – Edited Surgery Pearls. Video GL2: Trabeculectomy – Glaucoma Surgery with Mitomycin-C.”
paper · Section III-B4
“First, dataset diversity is constrained. Cataract-1K consists primarily of cataract surgeries from a single clinical sites, reducing exposure to variability in imaging conditions, surgical styles, and patient anatomy.”
paper · Section V
What holds up

The quantitative evaluation on Cataract-1K and CaDIS is rigorous. CataractSAM-2 achieves IoU of 0.88–0.95 on CaDIS and maintains consistent performance across 25 videos. The inference speed of 15 FPS on an NVIDIA A100 GPU meets near-real-time requirements for surgical applications. The interactive annotation framework, which propagates masks from sparse prompts via SAM2VideoPredictor, is well-motivated and technically sound. The external validation on CaDIS—using only the 12 overlapping classes—demonstrates generalization beyond the training domain.

“Across the 25 CaDIS videos, performance was consistent with IoU ranging from 0.88–0.95, and PAC from 0.96–0.99.”
paper · Section III-B3
“CataractSAM-2 achieves real-time performance with an inference speed of 15 frames per second (FPS) in binary segmentation, making it suitable for intraoperative deployment (NVIDIA A100 GPU).”
paper · Section III-B2
Main concerns

First, the comparison with related work is incomplete. The authors note that RP-SAM2 is not publicly available at this time, yet RP-SAM2 (concurrent work on the same dataset) reports $2\%$ mDSC gain and $21.36\%$ reduction in mHD95 using a lightweight shift-block to stabilize point prompts—suggesting CataractSAM-2 may not represent the state-of-the-art on Cataract-1K. Second, the cross-procedure generalization claim rests on qualitative evaluation of only two YouTube videos (GL1 and GL2) without quantitative metrics or ground truth, making the claim of "strong zero-shot generalization" unsubstantiated. Third, the frozen image encoder limits adaptation to ophthalmic-specific challenges: "lightweight tuning of encoder layers or pretraining on ophthalmic data may improve segmentation fidelity in visually complex scenarios." Fourth, the annotation tool's 4 FPS speed contrasts with the 15 FPS inference speed, limiting its utility for real-time annotation workflows.

“RP-SAM2 is not publicly available at this time”
paper · Section II-B
“Second, the image encoder remains frozen during fine-tuning, limiting adaptation to ophthalmic-specific features such as glare and transparent tissues.”
paper · Section V
“The system achieves a segmentation speed of approximately 4 frames per second (FPS) on a standard GPU, enabling efficient mask propagation across surgical videos with low latency.”
paper · Section IV
Evidence and comparison

The evidence supports the primary claim that domain-adapted SAM-2 outperforms zero-shot baselines (SAM-2, MedSAM-2, SurgSAM-2) on Cataract-1K. The comparisons are fair for available methods, though SurgSAM2 was designed for laparoscopy (EndoVis17/18) not ophthalmology, making it a mismatched baseline. The CaDIS validation strengthens external validity. However, the paper's claim of "first domain-adapted extension of SAM-2 for anterior segment ophthalmic surgery" is debatable given RP-SAM2's concurrent submission and similar scope. The qualitative trabeculectomy results show some oversegmentation in high-reflectivity zones, which the authors acknowledge, but without IoU or Dice scores, the generalization claim remains speculative. The citation of "85–89\% phase-recognition accuracy" for cross-procedure trends refers to phase recognition, not segmentation, which conflates tasks.

“some masks slightly exceed object boundaries, indicating minor oversegmentation in high-reflectivity zones”
paper · Section III-B4
“Comparable cross-procedure trends appear in ophthalmic workflow datasets, where models trained on standard cataract techniques retain approximately 85–89\% phase-recognition accuracy on small-incision cataract surgery”
paper · Section III-B4
Reproducibility

The paper commits to open-source release with model weights, inference scripts, and annotation notebooks on GitHub and Hugging Face (links anonymized in the preprint). Training configuration details include AdamW optimizer with learning rate $0.0001$, weight decay $10^{-4}$, gradient accumulation every 4 steps, and step-based scheduling with decay every 500 steps. However, critical details are missing: the exact batch size, number of training iterations, specific data augmentation pipelines, and whether the reported IoU uses single-point prompts, box prompts, or ground-truth initialization. The binary mask setup merges all 12 classes into a single foreground class, which simplifies evaluation but limits usefulness for multi-class instrument tracking. The warm-up phase on first 5 frames is mentioned but not fully described in algorithmic detail.

“The following training configuration was used to balance accuracy and efficiency: AdamW optimizer (learning rate = 0.0001, weight decay = 1e-4); mixed-precision training with PyTorch AMP; gradient accumulation every four steps to simulate larger batch sizes on limited hardware; step-based learning rate scheduler with decay every 500 steps; and a warm-up phase on the first 5 frames of each video to stabilize early predictions.”
paper · Section II-B
Abstract

We present CataractSAM-2, a domain-adapted extension of Meta's Segment Anything Model 2, designed for real-time semantic segmentation of cataract ophthalmic surgery videos with high accuracy. Positioned at the intersection of computer vision and medical robotics, CataractSAM-2 enables precise intraoperative perception crucial for robotic-assisted and computer-guided surgical systems. Furthermore, to alleviate the burden of manual labeling, we introduce an interactive annotation framework that combines sparse prompts with video-based mask propagation. This tool significantly reduces annotation time and facilitates the scalable creation of high-quality ground-truth masks, accelerating dataset development for ocular anterior segment surgeries. We also demonstrate the model's strong zero-shot generalization to glaucoma trabeculectomy procedures, confirming its cross-procedural utility and potential for broader surgical applications. The trained model and annotation toolkit are released as open-source resources, establishing CataractSAM-2 as a foundation for expanding anterior ophthalmic surgical datasets and advancing real-time AI-driven solutions in medical robotics, as well as surgical video understanding.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.