Feed - arxlens

0

Leveraging GCN-based Action Recognition for Teleoperation in Daily Activity Assistance

cs.RO cs.HC cs.RO Thomas M. Kwok, Jiaan Li, Yue Hu · Apr 9, 2025

Remote caregiving robots could help older adults age in place, but conventional motion-mapping teleoperation forces caregivers into slow, unnatural postures that cause muscle fatigue. This paper proposes a lightweight Spatio-Temporal Graph Convolutional Network (S-ST-GCN) that recognizes four single-arm daily-living actions from an RGB camera and triggers preset robot trajectories, decoupling the operator's motion from the robot's execution. A finite-state machine filters misclassifications before execution. While still a proof-of-concept, the framework offers a more intuitive alternative to direct motion-tracking teleoperation for assisting with activities of daily living.

Caregiving of older adults is an urgent global challenge, with many older adults preferring to age in place rather than enter residential care. However, providing adequate home-based assistance remains difficult, particularly in geographically vast regions. Teleoperated robots offer a promising solution, but conventional motion-mapping teleoperation imposes unnatural movement constraints on operators, leading to muscle fatigue and reduced usability. This paper presents a novel teleoperation framework that leverages action recognition to enable intuitive remote robot control. Using our simplified Spatio-Temporal Graph Convolutional Network (S-ST-GCN), the system recognizes human actions and executes corresponding preset robot trajectories, eliminating the need for direct motion synchronization. A finite-state machine (FSM) is integrated to enhance reliability by filtering out misclassified actions. Our experiments demonstrate that the proposed framework enables effortless operator movement while ensuring accurate robot execution. This proof-of-concept study highlights the potential of teleoperation with action recognition for enabling caregivers to remotely assist older adults during activities of daily living (ADLs). Future work will focus on improving the S-ST-GCN's recognition accuracy and generalization, integrating advanced motion planning techniques to further enhance robotic autonomy in older adult care, and conducting a user study to evaluate the system's telepresence and ease of control.

Read abstractHide abstract

0

DualCoT-VLA: Visual-Linguistic Chain of Thought via Parallel Reasoning for Vision-Language-Action Models

cs.CV cs.RO Zhide Zhong, Junfeng Li, Junjie He et al. · Mar 23, 2026

Vision-Language-Action models excel at direct visuomotor mapping but struggle with tasks requiring both fine-grained 3D spatial understanding and long-horizon logical planning. DualCoT-VLA proposes a parallel dual-stream reasoning mechanism that processes visual Chain-of-Thought for spatial perception and linguistic Chain-of-Thought for task planning simultaneously in latent space, using learnable query tokens to bypass autoregressive decoding and achieve single-step inference.

Vision-Language-Action (VLA) models map visual observations and language instructions directly to robotic actions. While effective for simple tasks, standard VLA models often struggle with complex, multi-step tasks requiring logical planning, as well as precise manipulations demanding fine-grained spatial perception. Recent efforts have incorporated Chain-of-Thought (CoT) reasoning to endow VLA models with a ``thinking before acting'' capability. However, current CoT-based VLA models face two critical limitations: 1) an inability to simultaneously capture low-level visual details and high-level logical planning due to their reliance on isolated, single-modal CoT; 2) high inference latency with compounding errors caused by step-by-step autoregressive decoding. To address these limitations, we propose DualCoT-VLA, a visual-linguistic CoT method for VLA models with a parallel reasoning mechanism. To achieve comprehensive multi-modal reasoning, our method integrates a visual CoT for low-level spatial understanding and a linguistic CoT for high-level task planning. Furthermore, to overcome the latency bottleneck, we introduce a parallel CoT mechanism that incorporates two sets of learnable query tokens, shifting autoregressive reasoning to single-step forward reasoning. Extensive experiments demonstrate that our DualCoT-VLA achieves state-of-the-art performance on the LIBERO and RoboCasa GR1 benchmarks, as well as in real-world platforms.

Read abstractHide abstract

0

6D Robotic OCT Scanning of Curved Tissue Surfaces

cs.CV cs.RO Suresh Guttikonda, Maximilian Neidhardt, Vidas Raudonis et al. · Mar 23, 2026

This paper tackles robotic optical coherence tomography (OCT) scanning of curved tissue surfaces, addressing the limitation that existing approaches restrict motion to pure translations to avoid challenging hand-eye calibration. The core contribution is a custom ChArUco calibration pattern enabling full six-degree-of-freedom hand-eye calibration, allowing the OCT probe to rotate and follow curved surfaces. This matters because pure translational scanning accumulates registration errors on curved geometries, whereas full 6D motion enables accurate, large-area surface reconstruction.

Optical coherence tomography (OCT) is a non-invasive volumetric imaging modality with high spatial and temporal resolution. For imaging larger tissue structures, OCT probes need to be moved to scan the respective area. For handheld scanning, stitching of the acquired OCT volumes requires overlap to register the images. For robotic scanning and stitching, a typical approach is to restrict the motion to translations, as this avoids a full hand-eye calibration, which is complicated by the small field of view of most OCT probes. However, stitching by registration or by translational scanning are limited when curved tissue surfaces need to be scanned. We propose a marker for full six-dimensional hand-eye calibration of a robot mounted OCT probe. We show that the calibration results in highly repeatable estimates of the transformation. Moreover, we evaluate robotic scanning of two phantom surfaces to demonstrate that the proposed calibration allows for consistent scanning of large, curved tissue surfaces. As the proposed approach is not relying on image registration, it does not suffer from a potential accumulation of errors along a scan path. We also illustrate the improvement compared to conventional 3D-translational robotic scanning.

Read abstractHide abstract

0

FreeArtGS: Articulated Gaussian Splatting Under Free-moving Scenario

cs.CV cs.GR cs.RO Hang Dai, Hongwei Fan, Han Zhang et al. · Mar 23, 2026

Articulated object reconstruction typically requires either multi-view capture of discrete states or monocular video with a strict static-base-part assumption, limiting practical deployment. FreeArtGS introduces a "free-moving" setting where both joint angles and object poses vary arbitrarily during capture, using only a monocular RGB-D video. The method combines motion-based part segmentation via point tracking priors with joint estimation and 3D Gaussian Splatting optimization to jointly reconstruct geometry, appearance, and articulation.

The increasing demand for augmented reality and robotics is driving the need for articulated object reconstruction with high scalability. However, existing settings for reconstructing from discrete articulation states or casual monocular videos require non-trivial axis alignment or suffer from insufficient coverage, limiting their applicability. In this paper, we introduce FreeArtGS, a novel method for reconstructing articulated objects under free-moving scenario, a new setting with a simple setup and high scalability. FreeArtGS combines free-moving part segmentation with joint estimation and end-to-end optimization, taking only a monocular RGB-D video as input. By optimizing with the priors from off-the-shelf point-tracking and feature models, the free-moving part segmentation module identifies rigid parts from relative motion under unconstrained capture. The joint estimation module calibrates the unified object-to-camera poses and recovers joint type and axis robustly from part segmentation. Finally, 3DGS-based end-to-end optimization is implemented to jointly reconstruct visual textures, geometry, and joint angles of the articulated object. We conduct experiments on two benchmarks and real-world free-moving articulated objects. Experimental results demonstrate that FreeArtGS consistently excels in reconstructing free-moving articulated objects and remains highly competitive in previous reconstruction settings, proving itself a practical and effective solution for realistic asset generation. The project page is available at: https://freeartgs.github.io/

Read abstractHide abstract

0

Anatomical Prior-Driven Framework for Autonomous Robotic Cardiac Ultrasound Standard View Acquisition

cs.RO cs.CV Zhiyan Cao, Zhengxi Wu, Yiwei Wang et al. · Mar 22, 2026

Cardiac ultrasound view acquisition is notoriously operator-dependent, limiting reproducibility and access. This paper proposes an anatomical prior (AP)-driven framework that unifies cardiac structure segmentation with autonomous probe adjustment. The core innovation is a spatial-relation graph (SRG) module that injects spatial-topological constraints into YOLO-based segmentation, coupled with an RL formulation where states and rewards are built from quantifiable anatomical features drawn from Gaussian priors. The work matters because it offers an interpretable alternative to black-box end-to-end methods, potentially enabling zero-shot sim-to-real deployment for robotic echocardiography.

Cardiac ultrasound diagnosis is critical for cardiovascular disease assessment, but acquiring standard views remains highly operator-dependent. Existing medical segmentation models often yield anatomically inconsistent results in images with poor textural differentiation between distinct feature classes, while autonomous probe adjustment methods either rely on simplistic heuristic rules or black-box learning. To address these issues, our study proposed an anatomical prior (AP)-driven framework integrating cardiac structure segmentation and autonomous probe adjustment for standard view acquisition. A YOLO-based multi-class segmentation model augmented by a spatial-relation graph (SRG) module is designed to embed AP into the feature pyramid. Quantifiable anatomical features of standard views are extracted. Their priors are fitted to Gaussian distributions to construct probabilistic APs. The probe adjustment process of robotic ultrasound scanning is formalized as a reinforcement learning (RL) problem, with the RL state built from real-time anatomical features and the reward reflecting the AP matching. Experiments validate the efficacy of the framework. The SRG-YOLOv11s improves mAP50 by 11.3% and mIoU by 6.8% on the Special Case dataset, while the RL agent achieves a 92.5% success rate in simulation and 86.7% in phantom experiments.

Read abstractHide abstract

0

RAFL: Generalizable Sim-to-Real of Soft Robots with Residual Acceleration Field Learning

cs.RO cs.LG Dong Heon Cho, Boyuan Chen · Mar 23, 2026

Soft robot simulators suffer from a sim-to-real gap that widens when optimizing morphology, because calibration parameters identified on one geometry often fail to transfer to unseen shapes. This paper proposes Residual Acceleration Field Learning (RAFL), which learns local corrective accelerations defined on quadrature elements rather than global nodal forces. By operating on deformation and velocity gradients in material space, the model becomes independent of mesh topology and discretization, enabling zero-shot generalization across geometries.

Differentiable simulators enable gradient-based optimization of soft robots over material parameters, control, and morphology, but accurately modeling real systems remains challenging due to the sim-to-real gap. This issue becomes more pronounced when geometry is itself a design variable. System identification reduces discrepancies by fitting global material parameters to data; however, when constitutive models are misspecified or observations are sparse, identified parameters often absorb geometry-dependent effects rather than reflect intrinsic material behavior. More expressive constitutive models can improve accuracy but substantially increase computational cost, limiting practicality. We propose a residual acceleration field learning (RAFL) framework that augments a base simulator with a transferable, element-level corrective dynamics field. Operating on shared local features, the model is agnostic to global mesh topology and discretization. Trained end-to-end through a differentiable simulator using sparse marker observations, the learned residual generalizes across shapes. In both sim-to-sim and sim-to-real experiments, our method achieves consistent zero-shot improvements on unseen morphologies, while system identification frequently exhibits negative transfer. The framework also supports continual refinement, enabling simulation accuracy to accumulate during morphology optimization.

Read abstractHide abstract

0

PRM-as-a-Judge: A Dense Evaluation Paradigm for Fine-Grained Robotic Auditing

cs.RO cs.CV Yuheng Ji, Yuyang Liu, Huajie Tan et al. · Mar 23, 2026

PRM-as-a-Judge addresses the fundamental limitation of binary success metrics in robotic manipulation by repurposing Process Reward Models (PRMs) as dense evaluators. The paper introduces the OPD (Outcome–Process–Diagnosis) metric system, which decomposes execution quality via a task-aligned progress potential $\Phi(x_t) \in [0,1]$ induced from trajectory videos. Validated on the RoboPulse benchmark and RoboTwin policy auditing, the work shows that trajectory-supervised PRMs achieve superior micro-resolution compared to foundation models, revealing behavioral signatures invisible to outcome-only evaluation.

Current robotic evaluation is still largely dominated by binary success rates, which collapse rich execution processes into a single outcome and obscure critical qualities such as progress, efficiency, and stability. To address this limitation, we propose PRM-as-a-Judge, a dense evaluation paradigm that leverages Process Reward Models (PRMs) to audit policy execution directly from trajectory videos by estimating task progress from observation sequences. Central to this paradigm is the OPD (Outcome-Process-Diagnosis) metric system, which explicitly formalizes execution quality via a task-aligned progress potential. We characterize dense robotic evaluation through two axiomatic properties: macro-consistency, which requires additive and path-consistent aggregation, and micro-resolution, which requires sensitivity to fine-grained physical evolution. Under this formulation, potential-based PRM judges provide a natural instantiation of dense evaluation, with macro-consistency following directly from the induced scalar potential. We empirically validate the micro-resolution property using RoboPulse, a diagnostic benchmark specifically designed for probing micro-scale progress discrimination, where several trajectory-trained PRM judges outperform discriminative similarity-based methods and general-purpose foundation-model judges. Finally, leveraging PRM-as-a-Judge and the OPD metric system, we conduct a structured audit of mainstream policy paradigms across long-horizon tasks, revealing behavioral signatures and failure modes that are invisible to outcome-only metrics.

Read abstractHide abstract

0

ThinkJEPA: Empowering Latent World Models with Large Vision-Language Reasoning Model

cs.CV cs.AI cs.CL Haichao Zhang, Yijiang Li, Shwai He et al. · Mar 23, 2026

ThinkJEPA addresses the limitation of JEPA-style latent world models that rely on short, densely sampled windows, which bias predictions toward local dynamics while missing long-horizon semantics. The paper proposes a dual-temporal architecture combining a dense-frame V-JEPA branch for fine-grained motion with a sparsely sampled VLM "thinker" branch that provides semantic guidance via multi-layer feature pyramids. This matters because it attempts to marry the physical consistency of latent world models with the general knowledge of vision-language models for robust trajectory forecasting.

Recent progress in latent world models (e.g., V-JEPA2) has shown promising capability in forecasting future world states from video observations. Nevertheless, dense prediction from a short observation window limits temporal context and can bias predictors toward local, low-level extrapolation, making it difficult to capture long-horizon semantics and reducing downstream utility. Vision--language models (VLMs), in contrast, provide strong semantic grounding and general knowledge by reasoning over uniformly sampled frames, but they are not ideal as standalone dense predictors due to compute-driven sparse sampling, a language-output bottleneck that compresses fine-grained interaction states into text-oriented representations, and a data-regime mismatch when adapting to small action-conditioned datasets. We propose a VLM-guided JEPA-style latent world modeling framework that combines dense-frame dynamics modeling with long-horizon semantic guidance via a dual-temporal pathway: a dense JEPA branch for fine-grained motion and interaction cues, and a uniformly sampled VLM \emph{thinker} branch with a larger temporal stride for knowledge-rich guidance. To transfer the VLM's progressive reasoning signals effectively, we introduce a hierarchical pyramid representation extraction module that aggregates multi-layer VLM representations into guidance features compatible with latent prediction. Experiments on hand-manipulation trajectory prediction show that our method outperforms both a strong VLM-only baseline and a JEPA-predictor baseline, and yields more robust long-horizon rollout behavior.

Read abstractHide abstract

0

CounterScene: Counterfactual Causal Reasoning in Generative World Models for Safety-Critical Closed-Loop Evaluation

cs.RO cs.CV Bowen Jing, Ruiyang Hao, Weitao Zhou et al. · Mar 22, 2026

Existing safety-critical scenario generation methods force collisions through brute-force perturbations, destroying trajectory realism. CounterScene reframes this as a counterfactual inference problem within diffusion-based BEV world models: given a safe scene, identify the single agent whose behavioral change would maximally increase collision risk, then minimally intervene on that agent alone via structured diffusion guidance. This targets the realism-adversarial trade-off by allowing danger to emerge through natural interaction propagation rather than global trajectory distortion.

Generating safety-critical driving scenarios requires understanding why dangerous interactions arise, rather than merely forcing collisions. However, existing methods rely on heuristic adversarial agent selection and unstructured perturbations, lacking explicit modeling of interaction dependencies and thus exhibiting a realism--adversarial trade-off. We present CounterScene, a framework that endows closed-loop generative BEV world models with structured counterfactual reasoning for safety-critical scenario generation. Given a safe scene, CounterScene asks: what if the causally critical agent had behaved differently? To answer this, we introduce causal adversarial agent identification to identify the critical agent and classify conflict types, and develop a conflict-aware interactive world model in which a causal interaction graph is used to explicitly model dynamic inter-agent dependencies. Building on this structure, stage-adaptive counterfactual guidance performs minimal interventions on the identified agent, removing its spatial and temporal safety margins while allowing risk to emerge through natural interaction propagation. Extensive experiments on nuScenes demonstrate that CounterScene achieves the strongest adversarial effectiveness while maintaining superior trajectory realism across all horizons, improving long-horizon collision rate from 12.3% to 22.7% over the strongest baseline with better realism (ADE 1.88 vs.2.09). Notably, this advantage further widens over longer rollouts, and CounterScene generalizes zero-shot to nuPlan with state-of-the-art realism.

Read abstractHide abstract

0

OrbitStream: Training-Free Adaptive 360-degree Video Streaming via Semantic Potential Fields

cs.NI cs.CV cs.MM Aizierjiang Aiersilan, Zhangfei Yang · Mar 22, 2026

OrbitStream addresses adaptive 360° video streaming for teleoperation by proposing a training-free framework that combines semantic scene understanding with robust control theory. It formulates viewport prediction as a Gravitational Viewport Prediction (GVP) problem where semantic objects (pedestrians, vehicles) generate potential fields that "attract" user gaze with task-relevant mass, while a Saturation-Based Proportional-Derivative (PD) Controller handles bitrate adaptation. This offers an interpretable, zero-shot alternative to black-box Deep Reinforcement Learning methods for safety-critical systems where deployment constraints prohibit lengthy training.

Adaptive 360{\deg} video streaming for teleoperation faces dual challenges: viewport prediction under uncertain gaze patterns and bitrate adaptation over volatile wireless channels. While data-driven and Deep Reinforcement Learning (DRL) methods achieve high Quality of Experience (QoE), their "black-box" nature and reliance on training data can limit deployment in safety-critical systems. To address this, we propose OrbitStream, a training-free framework that combines semantic scene understanding with robust control theory. We formulate viewport prediction as a Gravitational Viewport Prediction (GVP) problem, where semantic objects generate potential fields that attract user gaze. Furthermore, we employ a Saturation-Based Proportional-Derivative (PD) Controller for buffer regulation. On object-rich teleoperation traces, OrbitStream achieves a 94.7\% zero-shot viewport prediction accuracy without user-specific profiling, approaching trajectory-extrapolation baselines ($\sim$98.5\%). Across 3,600 Monte Carlo simulations on diverse network traces, OrbitStream yields a mean QoE of 2.71. It ranks second among 12 evaluated algorithms, close to the top-performing BOLA-E (2.80) while outperforming FastMPC (1.84). The system exhibits an average decision latency of 1.01 ms with minimal rebuffering events. By providing competitive QoE with interpretability and zero training overhead, OrbitStream demonstrates that physics-based control, combined with semantic modeling, offers a practical solution for 360{\deg} streaming in teleoperation.

Read abstractHide abstract

0

GaussianSSC: Triplane-Guided Directional Gaussian Fields for 3D Semantic Completion

cs.RO cs.LG Ruiqi Xian, Jing Liang, He Yin et al. · Mar 23, 2026

2-4 sentences for scrolling feed.

Sections:
1. Verdict: Overall assessment - solid incremental contribution, hybrid approach is interesting, results are good but limited scope.
2. What holds up: Gaussian anchoring mechanism, two-stage design, ablation studies showing component effectiveness.
3. Main concerns: Single-frame limitation, dataset limitation (only SemanticKITTI), missing comparison with GaussianFormer, efficiency trade-offs not fully characterized, limited discussion of failure modes.
4. Evidence and comparison: Fair comparison with ETFormer/VoxFormer using same backbone, but missing key Gaussian baselines; ablations validate design choices; qualitative results show improvements.
5. Reproducibility: Good implementation details provided, standard dataset, but no code release mentioned; hyperparameters mostly specified.

Let me write the content now, ensuring I follow the formatting rules:
- Use LaTeX for math
- Keep JSON strings on single lines (use \n for line breaks)
- Include exact quotes with locators
- No markdown fences around JSON

We present \emph{GaussianSSC}, a two-stage, grid-native and triplane-guided approach to semantic scene completion (SSC) that injects the benefits of Gaussians without replacing the voxel grid or maintaining a separate Gaussian set. We introduce \emph{Gaussian Anchoring}, a sub-pixel, Gaussian-weighted image aggregation over fused FPN features that tightens voxel--image alignment and improves monocular occupancy estimation. We further convert point-like voxel features into a learned per-voxel Gaussian field and refine triplane features via a triplane-aligned \emph{Gaussian--Triplane Refinement} module that combines \emph{local gathering} (target-centric) and \emph{global aggregation} (source-centric). This directional, anisotropic support captures surface tangency, scale, and occlusion-aware asymmetry while preserving the efficiency of triplane representations. On SemanticKITTI~\cite{behley2019semantickitti}, GaussianSSC improves Stage~1 occupancy by +1.0\% Recall, +2.0\% Precision, and +1.8\% IoU over state-of-the-art baselines, and improves Stage~2 semantic prediction by +1.8\% IoU and +0.8\% mIoU.

Read abstractHide abstract

0

Architecture for Multi-Unmanned Aerial Vehicles based Autonomous Precision Agriculture Systems

cs.RO cs.LG cs.MA Ebasa Temesgen, Nathnael Minyelshowa, Lebsework Negash · Mar 22, 2026

This paper proposes a multi-UAV architecture for autonomous precision agriculture that combines centralized mission planning with decentralized execution control. It integrates coverage path planning, battery-aware task allocation, CNN-based image processing, and battery swapping stations to enable end-to-end farm monitoring. The work targets large-scale agricultural operations with minimal human intervention, claiming advantages in fault-tolerance, scalability, and user-friendliness.

The use of unmanned aerial vehicles (UAVs) in precision agriculture has seen a huge increase recently. As such, systems that aim to apply various algorithms on the field need a structured framework of abstractions. This paper defines the various tasks of the UAVs in precision agriculture and model them into an architectural framework. The presented architecture is built on the context that there will be minimal physical intervention to do the tasks defined with multiple coordinated and cooperative UAVs. Various tasks such as image processing, path planning, communication, data acquisition, and field mapping are employed in the architecture to provide an efficient system. Besides, different limitation for applying Multi-UAVs in precision agriculture has been considered in designing the architecture. The architecture provides an autonomous end-to-end solution, starting from mission planning, data acquisition and image processing framework that is highly efficient and can enable farmers to comprehensively deploy UAVs onto their lands. Simulation and field tests shows that the architecture offers a number of advantages that include fault-tolerance, robustness, developer and user-friendliness.

Read abstractHide abstract

0

HyReach: Vision-Guided Hybrid Manipulator Reaching in Unseen Cluttered Environments

cs.RO cs.AI Shivani Kamtikar, Kendall Koe, Justin Wasserman et al. · Mar 22, 2026

This paper addresses robotic reaching in cluttered, unseen environments using a hybrid rigid-soft continuum manipulator. The core idea is a real-time pipeline that combines multi-view RGB reconstruction (Mast3r), open-world object detection (YOLO-World), shape-aware RRT* planning with asymmetric collision constraints, and a learned controller trained on pose-to-actuation data. If validated at scale, this could enable robots to navigate dense foliage or disaster debris where rigid arms fail and pure soft arms lack reach.

As robotic systems increasingly operate in unstructured, cluttered, and previously unseen environments, there is a growing need for manipulators that combine compliance, adaptability, and precise control. This work presents a real-time hybrid rigid-soft continuum manipulator system designed for robust open-world object reaching in such challenging environments. The system integrates vision-based perception and 3D scene reconstruction with shape-aware motion planning to generate safe trajectories. A learning-based controller drives the hybrid arm to arbitrary target poses, leveraging the flexibility of the soft segment while maintaining the precision of the rigid segment. The system operates without environment-specific retraining, enabling direct generalization to new scenes. Extensive real-world experiments demonstrate consistent reaching performance with errors below 2 cm across diverse cluttered setups, highlighting the potential of hybrid manipulators for adaptive and reliable operation in unstructured environments.

Read abstractHide abstract

0

CataractSAM-2: A Domain-Adapted Model for Anterior Segment Surgery Segmentation and Scalable Ground-Truth Annotation

cs.CV cs.AI cs.DB Mohammad Eslami, Dhanvinkumar Ganeshkumar, Saber Kazeminasab et al. · Mar 23, 2026

CataractSAM-2 adapts Meta's Segment Anything Model 2 (SAM-2) for real-time semantic segmentation in cataract surgery videos. The core idea is to fine-tune only the prompt encoder and mask decoder while freezing the image encoder, enabling precise segmentation of anatomical structures and surgical instruments under challenging conditions like glare and occlusion. The paper also introduces an interactive annotation framework that propagates sparse user prompts across video frames to accelerate ground-truth generation.

We present CataractSAM-2, a domain-adapted extension of Meta's Segment Anything Model 2, designed for real-time semantic segmentation of cataract ophthalmic surgery videos with high accuracy. Positioned at the intersection of computer vision and medical robotics, CataractSAM-2 enables precise intraoperative perception crucial for robotic-assisted and computer-guided surgical systems. Furthermore, to alleviate the burden of manual labeling, we introduce an interactive annotation framework that combines sparse prompts with video-based mask propagation. This tool significantly reduces annotation time and facilitates the scalable creation of high-quality ground-truth masks, accelerating dataset development for ocular anterior segment surgeries. We also demonstrate the model's strong zero-shot generalization to glaucoma trabeculectomy procedures, confirming its cross-procedural utility and potential for broader surgical applications. The trained model and annotation toolkit are released as open-source resources, establishing CataractSAM-2 as a foundation for expanding anterior ophthalmic surgical datasets and advancing real-time AI-driven solutions in medical robotics, as well as surgical video understanding.

Read abstractHide abstract

0

SafePilot: A Framework for Assuring LLM-enabled Cyber-Physical Systems

cs.RO cs.AI Weizhe Xu, Mengyu Liu, Fanxin Kong · Mar 23, 2026

SafePilot addresses a critical gap in deploying Large Language Models (LLMs) for cyber-physical systems (CPS): LLM "hallucinations" can generate plausible-sounding but unsafe plans that violate safety constraints or temporal requirements. The authors propose a hierarchical neuro-symbolic framework that combines LLM planning with formal verification—using First-Order Logic (FOL) for attribute-based constraints and Linear Temporal Logic (LTL) for temporal constraints—to ensure plans satisfy specifications before execution.

Large Language Models (LLMs), deep learning architectures with typically over 10 billion parameters, have recently begun to be integrated into various cyber-physical systems (CPS) such as robotics, industrial automation, and autopilot systems. The abstract knowledge and reasoning capabilities of LLMs are employed for tasks like planning and navigation. However, a significant challenge arises from the tendency of LLMs to produce "hallucinations" - outputs that are coherent yet factually incorrect or contextually unsuitable. This characteristic can lead to undesirable or unsafe actions in the CPS. Therefore, our research focuses on assuring the LLM-enabled CPS by enhancing their critical properties. We propose SafePilot, a novel hierarchical neuro-symbolic framework that provides end-to-end assurance for LLM-enabled CPS according to attribute-based and temporal specifications. Given a task and its specification, SafePilot first invokes a hierarchical planner with a discriminator that assesses task complexity. If the task is deemed manageable, it is passed directly to an LLM-based task planner with built-in verification. Otherwise, the hierarchical planner applies a divide-and-conquer strategy, decomposing the task into sub-tasks, each of which is individually planned and later merged into a final solution. The LLM-based task planner translates natural language constraints into formal specifications and verifies the LLM's output against them. If violations are detected, it identifies the flaw, adjusts the prompt accordingly, and re-invokes the LLM. This iterative process continues until a valid plan is produced or a predefined limit is reached. Our framework supports LLM-enabled CPS with both attribute-based and temporal constraints. Its effectiveness and adaptability are demonstrated through two illustrative case studies.

Read abstractHide abstract

0

A Framework for Closed-Loop Robotic Assembly, Alignment and Self-Recovery of Precision Optical Systems

cs.RO cs.AI physics.optics Seou Choi, Sachin Vaidya, Caio Silva et al. · Mar 23, 2026

Precision free-space optics demands sub-millimeter and sub-degree tolerances where traditional robotic pick-and-place fails. This work introduces a closed-loop robotics framework integrating hierarchical computer vision, Newton-based spatial optimization, and Bayesian angular optimization to autonomously construct, align, and maintain optical systems. The authors demonstrate this by building a tabletop laser cavity from randomly distributed components—achieving beam alignment, mode selection, and self-recovery without human intervention. The system bridges the gap between coarse robotic manipulation and the extreme precision required for functional optical experiments.

Robotic automation has transformed scientific workflows in domains such as chemistry and materials science, yet free-space optics, which is a high precision domain, remains largely manual. Optical systems impose strict spatial and angular tolerances, and their performance is governed by tightly coupled physical parameters, making generalizable automation particularly challenging. In this work, we present a robotics framework for the autonomous construction, alignment, and maintenance of precision optical systems. Our approach integrates hierarchical computer vision systems, optimization routines, and custom-built tools to achieve this functionality. As a representative demonstration, we perform the fully autonomous construction of a tabletop laser cavity from randomly distributed components. The system performs several tasks such as laser beam centering, spatial alignment of multiple beams, resonator alignment, laser mode selection, and self-recovery from induced misalignment and disturbances. By achieving closed-loop autonomy for highly sensitive optical systems, this work establishes a foundation for autonomous optical experiments for applications across technical domains.

Read abstractHide abstract

0

Sim-to-Real of Humanoid Locomotion Policies via Joint Torque Space Perturbation Injection

cs.RO cs.AI Junhyeok Rui Cha, Woohyun Cha, Jaeyong Shin et al. · Mar 23, 2026

This paper proposes a novel alternative to existing sim-to-real methods for training control policies with simulated experiences. Unlike prior methods that typically rely on domain randomization over a fixed finite set of parameters, the proposed approach injects state-dependent perturbations into the input joint torque during forward simulation. These perturbations are designed to simulate a broader spectrum of reality gaps than standard parameter randomization without requiring additional training. By using neural networks as flexible perturbation generators, the proposed method can represent complex, state-dependent uncertainties, such as nonlinear actuator dynamics and contact compliance, that parametric randomization cannot capture. Experimental results demonstrate that the proposed approach enables humanoid locomotion policies to achieve superior robustness against complex, unseen reality gaps in both simulation and real-world deployment.

Read abstractHide abstract

Nothing here yet