Nothing here yet
Vision-Language-Action models excel at direct visuomotor mapping but struggle with tasks requiring both fine-grained 3D spatial understanding and long-horizon logical planning. DualCoT-VLA proposes a parallel dual-stream reasoning mechanism that processes visual Chain-of-Thought for spatial perception and linguistic Chain-of-Thought for task planning simultaneously in latent space, using learnable query tokens to bypass autoregressive decoding and achieve single-step inference.
This paper tackles robotic optical coherence tomography (OCT) scanning of curved tissue surfaces, addressing the limitation that existing approaches restrict motion to pure translations to avoid challenging hand-eye calibration. The core contribution is a custom ChArUco calibration pattern enabling full six-degree-of-freedom hand-eye calibration, allowing the OCT probe to rotate and follow curved surfaces. This matters because pure translational scanning accumulates registration errors on curved geometries, whereas full 6D motion enables accurate, large-area surface reconstruction.
Articulated object reconstruction typically requires either multi-view capture of discrete states or monocular video with a strict static-base-part assumption, limiting practical deployment. FreeArtGS introduces a "free-moving" setting where both joint angles and object poses vary arbitrarily during capture, using only a monocular RGB-D video. The method combines motion-based part segmentation via point tracking priors with joint estimation and 3D Gaussian Splatting optimization to jointly reconstruct geometry, appearance, and articulation.
Cardiac ultrasound view acquisition is notoriously operator-dependent, limiting reproducibility and access. This paper proposes an anatomical prior (AP)-driven framework that unifies cardiac structure segmentation with autonomous probe adjustment. The core innovation is a spatial-relation graph (SRG) module that injects spatial-topological constraints into YOLO-based segmentation, coupled with an RL formulation where states and rewards are built from quantifiable anatomical features drawn from Gaussian priors. The work matters because it offers an interpretable alternative to black-box end-to-end methods, potentially enabling zero-shot sim-to-real deployment for robotic echocardiography.
Soft robot simulators suffer from a sim-to-real gap that widens when optimizing morphology, because calibration parameters identified on one geometry often fail to transfer to unseen shapes. This paper proposes Residual Acceleration Field Learning (RAFL), which learns local corrective accelerations defined on quadrature elements rather than global nodal forces. By operating on deformation and velocity gradients in material space, the model becomes independent of mesh topology and discretization, enabling zero-shot generalization across geometries.
PRM-as-a-Judge addresses the fundamental limitation of binary success metrics in robotic manipulation by repurposing Process Reward Models (PRMs) as dense evaluators. The paper introduces the OPD (Outcome–Process–Diagnosis) metric system, which decomposes execution quality via a task-aligned progress potential $\Phi(x_t) \in [0,1]$ induced from trajectory videos. Validated on the RoboPulse benchmark and RoboTwin policy auditing, the work shows that trajectory-supervised PRMs achieve superior micro-resolution compared to foundation models, revealing behavioral signatures invisible to outcome-only evaluation.
ThinkJEPA addresses the limitation of JEPA-style latent world models that rely on short, densely sampled windows, which bias predictions toward local dynamics while missing long-horizon semantics. The paper proposes a dual-temporal architecture combining a dense-frame V-JEPA branch for fine-grained motion with a sparsely sampled VLM "thinker" branch that provides semantic guidance via multi-layer feature pyramids. This matters because it attempts to marry the physical consistency of latent world models with the general knowledge of vision-language models for robust trajectory forecasting.
Existing safety-critical scenario generation methods force collisions through brute-force perturbations, destroying trajectory realism. CounterScene reframes this as a counterfactual inference problem within diffusion-based BEV world models: given a safe scene, identify the single agent whose behavioral change would maximally increase collision risk, then minimally intervene on that agent alone via structured diffusion guidance. This targets the realism-adversarial trade-off by allowing danger to emerge through natural interaction propagation rather than global trajectory distortion.
OrbitStream addresses adaptive 360° video streaming for teleoperation by proposing a training-free framework that combines semantic scene understanding with robust control theory. It formulates viewport prediction as a Gravitational Viewport Prediction (GVP) problem where semantic objects (pedestrians, vehicles) generate potential fields that "attract" user gaze with task-relevant mass, while a Saturation-Based Proportional-Derivative (PD) Controller handles bitrate adaptation. This offers an interpretable, zero-shot alternative to black-box Deep Reinforcement Learning methods for safety-critical systems where deployment constraints prohibit lengthy training.
2-4 sentences for scrolling feed.
Sections:
1. Verdict: Overall assessment - solid incremental contribution, hybrid approach is interesting, results are good but limited scope.
2. What holds up: Gaussian anchoring mechanism, two-stage design, ablation studies showing component effectiveness.
3. Main concerns: Single-frame limitation, dataset limitation (only SemanticKITTI), missing comparison with GaussianFormer, efficiency trade-offs not fully characterized, limited discussion of failure modes.
4. Evidence and comparison: Fair comparison with ETFormer/VoxFormer using same backbone, but missing key Gaussian baselines; ablations validate design choices; qualitative results show improvements.
5. Reproducibility: Good implementation details provided, standard dataset, but no code release mentioned; hyperparameters mostly specified.
Let me write the content now, ensuring I follow the formatting rules:
- Use LaTeX for math
- Keep JSON strings on single lines (use \n for line breaks)
- Include exact quotes with locators
- No markdown fences around JSON
This paper proposes a multi-UAV architecture for autonomous precision agriculture that combines centralized mission planning with decentralized execution control. It integrates coverage path planning, battery-aware task allocation, CNN-based image processing, and battery swapping stations to enable end-to-end farm monitoring. The work targets large-scale agricultural operations with minimal human intervention, claiming advantages in fault-tolerance, scalability, and user-friendliness.
This paper addresses robotic reaching in cluttered, unseen environments using a hybrid rigid-soft continuum manipulator. The core idea is a real-time pipeline that combines multi-view RGB reconstruction (Mast3r), open-world object detection (YOLO-World), shape-aware RRT* planning with asymmetric collision constraints, and a learned controller trained on pose-to-actuation data. If validated at scale, this could enable robots to navigate dense foliage or disaster debris where rigid arms fail and pure soft arms lack reach.
CataractSAM-2 adapts Meta's Segment Anything Model 2 (SAM-2) for real-time semantic segmentation in cataract surgery videos. The core idea is to fine-tune only the prompt encoder and mask decoder while freezing the image encoder, enabling precise segmentation of anatomical structures and surgical instruments under challenging conditions like glare and occlusion. The paper also introduces an interactive annotation framework that propagates sparse user prompts across video frames to accelerate ground-truth generation.
SafePilot addresses a critical gap in deploying Large Language Models (LLMs) for cyber-physical systems (CPS): LLM "hallucinations" can generate plausible-sounding but unsafe plans that violate safety constraints or temporal requirements. The authors propose a hierarchical neuro-symbolic framework that combines LLM planning with formal verification—using First-Order Logic (FOL) for attribute-based constraints and Linear Temporal Logic (LTL) for temporal constraints—to ensure plans satisfy specifications before execution.
Precision free-space optics demands sub-millimeter and sub-degree tolerances where traditional robotic pick-and-place fails. This work introduces a closed-loop robotics framework integrating hierarchical computer vision, Newton-based spatial optimization, and Bayesian angular optimization to autonomously construct, align, and maintain optical systems. The authors demonstrate this by building a tabletop laser cavity from randomly distributed components—achieving beam alignment, mode selection, and self-recovery without human intervention. The system bridges the gap between coarse robotic manipulation and the extreme precision required for functional optical experiments.