HyReach: Vision-Guided Hybrid Manipulator Reaching in Unseen Cluttered Environments

cs.RO cs.AI Shivani Kamtikar, Kendall Koe, Justin Wasserman, Samhita Marri, Benjamin Walt, Naveen Kumar Uppalapati, Girish Krishnan, Girish Chowdhary · Mar 22, 2026
Local to this browser
What it does
This paper addresses robotic reaching in cluttered, unseen environments using a hybrid rigid-soft continuum manipulator. The core idea is a real-time pipeline that combines multi-view RGB reconstruction (Mast3r), open-world object...
Why it matters
The core idea is a real-time pipeline that combines multi-view RGB reconstruction (Mast3r), open-world object detection (YOLO-World), shape-aware RRT* planning with asymmetric collision constraints, and a learned controller trained on...
Main concern
The paper presents a technically sound modular framework and demonstrates real hardware performance, but the experimental validation is limited to just 11 trials per environment, which is insufficient for robust statistical claims about...
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

This paper addresses robotic reaching in cluttered, unseen environments using a hybrid rigid-soft continuum manipulator. The core idea is a real-time pipeline that combines multi-view RGB reconstruction (Mast3r), open-world object detection (YOLO-World), shape-aware RRT* planning with asymmetric collision constraints, and a learned controller trained on pose-to-actuation data. If validated at scale, this could enable robots to navigate dense foliage or disaster debris where rigid arms fail and pure soft arms lack reach.

Critical review
Verdict
Bottom line

The paper presents a technically sound modular framework and demonstrates real hardware performance, but the experimental validation is limited to just 11 trials per environment, which is insufficient for robust statistical claims about "consistent" sub-2 cm performance. Additionally, the success metrics are counterintuitive—SR@2cm requires line-of-sight which can make it harder to achieve than SR@Touch (e.g., Table I shows 90.9% SR@2cm vs 72.7% SR@Touch in Clutter), potentially masking control inaccuracies.

“11 trials are performed for each setup”
Table I · Section IV
“SR@2cm: The trial is a success if the end effector reaches within 2 cm of the query object and the object is in the line of sight... SR@Touch: The trial is a success if the end effector touches the object”
Section IV-A · Metrics
What holds up

The ablation studies convincingly demonstrate that shape-informed planning is critical for safety in cluttered environments. Without shape estimation, success rates drop significantly (e.g., from 75% to 45.5% in Obstacles), validating the claim that backbone-aware collision checking matters for soft manipulators. The comparison against a rigid-only baseline also effectively proves the value of the compliant segment, showing dramatic failures (9.1% success) in the Hole environment where the hybrid succeeds (27.3-54.5%).

“In the Obstacles setup, lack of shape estimation led to frequent, sometimes severe, collisions with obstacles, reducing the success rate to 45.5% as compared to our method, which gave 75%”
Table III · Section IV-B
“Rigid only... 9.1... Ours... 27.3”
Table I · Hole environment
Main concerns

First, the sample size (11 trials) is too small to support claims about "consistent" performance across "diverse" setups. Second, the controller is trained on 9,536 systematically collected samples, but there is no analysis of data efficiency or robustness to out-of-distribution poses. Third, the collision threshold $\tau$ is treated as a tunable hyperparameter without principled selection criteria per environment (Table IV only shows aggregate results across all scenes). Fourth, the system assumes static environments; dynamic obstacles would break the reconstruction-planning pipeline. Finally, the claim of operating "without environment-specific retraining" elides that the controller is fixed to this specific hardware and cannot generalize to different soft arm morphologies without new data collection.

“systematically incrementing each actuation dimension... yielding 9536 data points”
Section III-C · Data Collection
“Effect of collision threshold on goal reaching... Strict thresholds ($\tau=0$) yield safer but less feasible paths”
Table IV · Section IV
Evidence and comparison

The comparison to Img2Act is fair in showing that end-to-end visual servoing fails when obstacles block line-of-sight, though the comparison is somewhat biased because Img2Act requires per-environment retraining while HyReach uses pre-trained perception modules. The authors appropriately credit that higher-fidelity shape estimation methods exist but trade accuracy for speed. However, the paper lacks comparison against simpler baselines like a rigid arm with standard MoveIt planning or point-cloud-based methods using depth cameras, which would clarify whether the gains come from the hybrid hardware or the specific planning approach.

“Img2Act... has robust performance in structured environments... It was trained by collecting images in each experimental scene”
Section IV-A · Baselines
“Although higher-fidelity shape estimation methods exist [43, 11], our approximate model provides real-time performance with substantial gains in safety”
Section IV-B · Result 3
Reproducibility

Reproduction is severely hampered by the lack of released code, datasets, or trained model weights. The hardware setup (custom B3 soft segment with magnetic tracking for data collection) is specialized and not commercially available. While hyperparameters like the MLP hidden size (15,000) and learning rate ($1\times 10^{-4}$) are provided, critical engineering details—such as the specific CC model implementation, occupancy grid resolution, or the exact heuristic for shortcutting—are absent. The reliance on specific versions of Mast3r and YOLO-World provides some reproducibility anchor, but without the data collection protocol or controller training scripts, independent verification is nearly impossible.

“hidden size of 15,000... learning rate of $1\times 10^{-4}$”
Section III-C · Training
“N was set to 50... recomputed for each steering step”
Section III-B · Path Planning
Abstract

As robotic systems increasingly operate in unstructured, cluttered, and previously unseen environments, there is a growing need for manipulators that combine compliance, adaptability, and precise control. This work presents a real-time hybrid rigid-soft continuum manipulator system designed for robust open-world object reaching in such challenging environments. The system integrates vision-based perception and 3D scene reconstruction with shape-aware motion planning to generate safe trajectories. A learning-based controller drives the hybrid arm to arbitrary target poses, leveraging the flexibility of the soft segment while maintaining the precision of the rigid segment. The system operates without environment-specific retraining, enabling direct generalization to new scenes. Extensive real-world experiments demonstrate consistent reaching performance with errors below 2 cm across diverse cluttered setups, highlighting the potential of hybrid manipulators for adaptive and reliable operation in unstructured environments.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.