CataractSAM-2: A Domain-Adapted Model for Anterior Segment Surgery Segmentation and Scalable Ground-Truth Annotation
CataractSAM-2 adapts Meta's Segment Anything Model 2 (SAM-2) for real-time semantic segmentation in cataract surgery videos. The core idea is to fine-tune only the prompt encoder and mask decoder while freezing the image encoder, enabling precise segmentation of anatomical structures and surgical instruments under challenging conditions like glare and occlusion. The paper also introduces an interactive annotation framework that propagates sparse user prompts across video frames to accelerate ground-truth generation.
The paper presents a competent domain adaptation of SAM-2 for ophthalmic surgery with strong quantitative results on standard benchmarks. The CaDIS external validation and the release of code and pretrained weights are concrete strengths. However, the cross-procedure generalization claim relies solely on qualitative inspection of YouTube videos, and the comparison omits RP-SAM2—a contemporary method reporting superior results on the same Cataract-1K dataset using a lightweight shift-block architecture. The frozen image encoder limits adaptation to domain-specific visual phenomena like transparent tissues and specular reflections.
The quantitative evaluation on Cataract-1K and CaDIS is rigorous. CataractSAM-2 achieves IoU of 0.88–0.95 on CaDIS and maintains consistent performance across 25 videos. The inference speed of 15 FPS on an NVIDIA A100 GPU meets near-real-time requirements for surgical applications. The interactive annotation framework, which propagates masks from sparse prompts via SAM2VideoPredictor, is well-motivated and technically sound. The external validation on CaDIS—using only the 12 overlapping classes—demonstrates generalization beyond the training domain.
First, the comparison with related work is incomplete. The authors note that RP-SAM2 is not publicly available at this time, yet RP-SAM2 (concurrent work on the same dataset) reports $2\%$ mDSC gain and $21.36\%$ reduction in mHD95 using a lightweight shift-block to stabilize point prompts—suggesting CataractSAM-2 may not represent the state-of-the-art on Cataract-1K. Second, the cross-procedure generalization claim rests on qualitative evaluation of only two YouTube videos (GL1 and GL2) without quantitative metrics or ground truth, making the claim of "strong zero-shot generalization" unsubstantiated. Third, the frozen image encoder limits adaptation to ophthalmic-specific challenges: "lightweight tuning of encoder layers or pretraining on ophthalmic data may improve segmentation fidelity in visually complex scenarios." Fourth, the annotation tool's 4 FPS speed contrasts with the 15 FPS inference speed, limiting its utility for real-time annotation workflows.
The evidence supports the primary claim that domain-adapted SAM-2 outperforms zero-shot baselines (SAM-2, MedSAM-2, SurgSAM-2) on Cataract-1K. The comparisons are fair for available methods, though SurgSAM2 was designed for laparoscopy (EndoVis17/18) not ophthalmology, making it a mismatched baseline. The CaDIS validation strengthens external validity. However, the paper's claim of "first domain-adapted extension of SAM-2 for anterior segment ophthalmic surgery" is debatable given RP-SAM2's concurrent submission and similar scope. The qualitative trabeculectomy results show some oversegmentation in high-reflectivity zones, which the authors acknowledge, but without IoU or Dice scores, the generalization claim remains speculative. The citation of "85–89\% phase-recognition accuracy" for cross-procedure trends refers to phase recognition, not segmentation, which conflates tasks.
The paper commits to open-source release with model weights, inference scripts, and annotation notebooks on GitHub and Hugging Face (links anonymized in the preprint). Training configuration details include AdamW optimizer with learning rate $0.0001$, weight decay $10^{-4}$, gradient accumulation every 4 steps, and step-based scheduling with decay every 500 steps. However, critical details are missing: the exact batch size, number of training iterations, specific data augmentation pipelines, and whether the reported IoU uses single-point prompts, box prompts, or ground-truth initialization. The binary mask setup merges all 12 classes into a single foreground class, which simplifies evaluation but limits usefulness for multi-class instrument tracking. The warm-up phase on first 5 frames is mentioned but not fully described in algorithmic detail.
We present CataractSAM-2, a domain-adapted extension of Meta's Segment Anything Model 2, designed for real-time semantic segmentation of cataract ophthalmic surgery videos with high accuracy. Positioned at the intersection of computer vision and medical robotics, CataractSAM-2 enables precise intraoperative perception crucial for robotic-assisted and computer-guided surgical systems. Furthermore, to alleviate the burden of manual labeling, we introduce an interactive annotation framework that combines sparse prompts with video-based mask propagation. This tool significantly reduces annotation time and facilitates the scalable creation of high-quality ground-truth masks, accelerating dataset development for ocular anterior segment surgeries. We also demonstrate the model's strong zero-shot generalization to glaucoma trabeculectomy procedures, confirming its cross-procedural utility and potential for broader surgical applications. The trained model and annotation toolkit are released as open-source resources, establishing CataractSAM-2 as a foundation for expanding anterior ophthalmic surgical datasets and advancing real-time AI-driven solutions in medical robotics, as well as surgical video understanding.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.