Robotics 45
☆ Slot-Level Robotic Placement via Visual Imitation from Single Human Video
The majority of modern robot learning methods focus on learning a set of
pre-defined tasks with limited or no generalization to new tasks. Extending the
robot skillset to novel tasks involves gathering an extensive amount of
training data for additional tasks. In this paper, we address the problem of
teaching new tasks to robots using human demonstration videos for repetitive
tasks (e.g., packing). This task requires understanding the human video to
identify which object is being manipulated (the pick object) and where it is
being placed (the placement slot). In addition, it needs to re-identify the
pick object and the placement slots during inference along with the relative
poses to enable robot execution of the task. To tackle this, we propose SLeRP,
a modular system that leverages several advanced visual foundation models and a
novel slot-level placement detector Slot-Net, eliminating the need for
expensive video demonstrations for training. We evaluate our system using a new
benchmark of real-world videos. The evaluation results show that SLeRP
outperforms several baselines and can be deployed on a real robot.
☆ Strengthening Multi-Robot Systems for SAR: Co-Designing Robotics and Communication Towards 6G
This paper presents field-tested use cases from Search and Rescue (SAR)
missions, highlighting the co-design of mobile robots and communication systems
to support Edge-Cloud architectures based on 5G Standalone (SA). The main goal
is to contribute to the effective cooperation of multiple robots and first
responders. Our field experience includes the development of Hybrid Wireless
Sensor Networks (H-WSNs) for risk and victim detection, smartphones integrated
into the Robot Operating System (ROS) as Edge devices for mission requests and
path planning, real-time Simultaneous Localization and Mapping (SLAM) via
Multi-Access Edge Computing (MEC), and implementation of Uncrewed Ground
Vehicles (UGVs) for victim evacuation in different navigation modes. These
experiments, conducted in collaboration with actual first responders,
underscore the need for intelligent network resource management, balancing
low-latency and high-bandwidth demands. Network slicing is key to ensuring
critical emergency services are performed despite challenging communication
conditions. The paper identifies architectural needs, lessons learned, and
challenges to be addressed by 6G technologies to enhance emergency response
capabilities.
comment: 8 pages, 6 figures, submitted to IEEE Communication Society (Special
Issue: Empowering Robotics with 6G: Connectivity, Intelligence, and Beyond)
☆ Overcoming Deceptiveness in Fitness Optimization with Unsupervised Quality-Diversity
Policy optimization seeks the best solution to a control problem according to
an objective or fitness function, serving as a fundamental field of engineering
and research with applications in robotics. Traditional optimization methods
like reinforcement learning and evolutionary algorithms struggle with deceptive
fitness landscapes, where following immediate improvements leads to suboptimal
solutions. Quality-diversity (QD) algorithms offer a promising approach by
maintaining diverse intermediate solutions as stepping stones for escaping
local optima. However, QD algorithms require domain expertise to define
hand-crafted features, limiting their applicability where characterizing
solution diversity remains unclear. In this paper, we show that unsupervised QD
algorithms - specifically the AURORA framework, which learns features from
sensory data - efficiently solve deceptive optimization problems without domain
expertise. By enhancing AURORA with contrastive learning and periodic
extinction events, we propose AURORA-XCon, which outperforms all traditional
optimization baselines and matches, in some cases even improving by up to 34%,
the best QD baseline with domain-specific hand-crafted features. This work
establishes a novel application of unsupervised QD algorithms, shifting their
focus from discovering novel solutions toward traditional optimization and
expanding their potential to domains where defining feature spaces poses
challenges.
☆ Ross3D: Reconstructive Visual Instruction Tuning with 3D-Awareness
The rapid development of Large Multimodal Models (LMMs) for 2D images and
videos has spurred efforts to adapt these models for interpreting 3D scenes.
However, the absence of large-scale 3D vision-language datasets has posed a
significant obstacle. To address this issue, typical approaches focus on
injecting 3D awareness into 2D LMMs by designing 3D input-level scene
representations. This work provides a new perspective. We introduce
reconstructive visual instruction tuning with 3D-awareness (Ross3D), which
integrates 3D-aware visual supervision into the training procedure.
Specifically, it incorporates cross-view and global-view reconstruction. The
former requires reconstructing masked views by aggregating overlapping
information from other views. The latter aims to aggregate information from all
available views to recover Bird's-Eye-View images, contributing to a
comprehensive overview of the entire scene. Empirically, Ross3D achieves
state-of-the-art performance across various 3D scene understanding benchmarks.
More importantly, our semi-supervised experiments demonstrate significant
potential in leveraging large amounts of unlabeled 3D vision-only data.
☆ A novel gesture interaction control method for rehabilitation lower extremity exoskeleton
With the rapid development of Rehabilitation Lower Extremity Robotic
Exoskeletons (RLEEX) technology, significant advancements have been made in
Human-Robot Interaction (HRI) methods. These include traditional physical HRI
methods that are easily recognizable and various bio-electrical signal-based
HRI methods that can visualize and predict actions. However, most of these HRI
methods are contact-based, facing challenges such as operational complexity,
sensitivity to interference, risks associated with implantable devices, and,
most importantly, limitations in comfort. These challenges render the
interaction less intuitive and natural, which can negatively impact patient
motivation for rehabilitation. To address these issues, this paper proposes a
novel non-contact gesture interaction control method for RLEEX, based on RGB
monocular camera depth estimation. This method integrates three key steps:
detecting keypoints, recognizing gestures, and assessing distance, thereby
applying gesture information and augmented reality triggering technology to
control gait movements of RLEEX. Results indicate that this approach provides a
feasible solution to the problems of poor comfort, low reliability, and high
latency in HRI for RLEEX platforms. Specifically, it achieves a
gesture-controlled exoskeleton motion accuracy of 94.11\% and an average system
response time of 0.615 seconds through non-contact HRI. The proposed
non-contact HRI method represents a pioneering advancement in control
interactions for RLEEX, paving the way for further exploration and development
in this field.
☆ Corner-Grasp: Multi-Action Grasp Detection and Active Gripper Adaptation for Grasping in Cluttered Environments
Robotic grasping is an essential capability, playing a critical role in
enabling robots to physically interact with their surroundings. Despite
extensive research, challenges remain due to the diverse shapes and properties
of target objects, inaccuracies in sensing, and potential collisions with the
environment. In this work, we propose a method for effectively grasping in
cluttered bin-picking environments where these challenges intersect. We utilize
a multi-functional gripper that combines both suction and finger grasping to
handle a wide range of objects. We also present an active gripper adaptation
strategy to minimize collisions between the gripper hardware and the
surrounding environment by actively leveraging the reciprocating suction cup
and reconfigurable finger motion. To fully utilize the gripper's capabilities,
we built a neural network that detects suction and finger grasp points from a
single input RGB-D image. This network is trained using a larger-scale
synthetic dataset generated from simulation. In addition to this, we propose an
efficient approach to constructing a real-world dataset that facilitates grasp
point detection on various objects with diverse characteristics. Experiment
results show that the proposed method can grasp objects in cluttered
bin-picking scenarios and prevent collisions with environmental constraints
such as a corner of the bin. Our proposed method demonstrated its effectiveness
in the 9th Robotic Grasping and Manipulation Competition (RGMC) held at ICRA
2024.
comment: 11 pages, 14 figures
☆ Virtual Target Trajectory Prediction for Stochastic Targets
Trajectory prediction of other vehicles is crucial for autonomous vehicles,
with applications from missile guidance to UAV collision avoidance. Typically,
target trajectories are assumed deterministic, but real-world aerial vehicles
exhibit stochastic behavior, such as evasive maneuvers or gliders circling in
thermals. This paper uses Conditional Normalizing Flows, an unsupervised
Machine Learning technique, to learn and predict the stochastic behavior of
targets of guided missiles using trajectory data. The trained model predicts
the distribution of future target positions based on initial conditions and
parameters of the dynamics. Samples from this distribution are clustered using
a time series k-means algorithm to generate representative trajectories, termed
virtual targets. The method is fast and target-agnostic, requiring only
training data in the form of target trajectories. Thus, it serves as a drop-in
replacement for deterministic trajectory predictions in guidance laws and path
planning. Simulated scenarios demonstrate the approach's effectiveness for
aerial vehicles with random maneuvers, bridging the gap between deterministic
predictions and stochastic reality, advancing guidance and control algorithms
for autonomous vehicles.
comment: will be submitted to Journal of Guidance, Control, and Dynamics
☆ Quattro: Transformer-Accelerated Iterative Linear Quadratic Regulator Framework for Fast Trajectory Optimization
Real-time optimal control remains a fundamental challenge in robotics,
especially for nonlinear systems with stringent performance requirements. As
one of the representative trajectory optimization algorithms, the iterative
Linear Quadratic Regulator (iLQR) faces limitations due to their inherently
sequential computational nature, which restricts the efficiency and
applicability of real-time control for robotic systems. While existing parallel
implementations aim to overcome the above limitations, they typically demand
additional computational iterations and high-performance hardware, leading to
only modest practical improvements. In this paper, we introduce Quattro, a
transformer-accelerated iLQR framework employing an algorithm-hardware
co-design strategy to predict intermediate feedback and feedforward matrices.
It facilitates effective parallel computations on resource-constrained devices
without sacrificing accuracy. Experiments on cart-pole and quadrotor systems
show an algorithm-level acceleration of up to 5.3$\times$ and 27$\times$ per
iteration, respectively. When integrated into a Model Predictive Control (MPC)
framework, Quattro achieves overall speedups of 2.8$\times$ for the cart-pole
and 17.8$\times$ for the quadrotor compared to the one that applies traditional
iLQR. Transformer inference is deployed on FPGA to maximize performance,
achieving up to 27.3$\times$ speedup over commonly used computing devices, with
around 2 to 4$\times$ power reduction and acceptable hardware overhead.
☆ SOLAQUA: SINTEF Ocean Large Aquaculture Robotics Dataset
This paper presents a dataset gathered with an underwater robot in a
sea-based aquaculture setting. Data was gathered from an operational fish farm
and includes data from sensors such as the Waterlinked A50 DVL, the Nortek
Nucleus 1000 DVL, Sonardyne Micro Ranger 2 USBL, Sonoptix Mulitbeam Sonar, mono
and stereo cameras, and vehicle sensor data such as power usage, IMU, pressure,
temperature, and more. Data acquisition is performed during both manual and
autonomous traversal of the net pen structure. The collected vision data is of
undamaged nets with some fish and marine growth presence, and it is expected
that both the research community and the aquaculture industry will benefit
greatly from the utilization of the proposed SOLAQUA dataset.
☆ Beyond Non-Expert Demonstrations: Outcome-Driven Action Constraint for Offline Reinforcement Learning
We address the challenge of offline reinforcement learning using realistic
data, specifically non-expert data collected through sub-optimal behavior
policies. Under such circumstance, the learned policy must be safe enough to
manage \textit{distribution shift} while maintaining sufficient flexibility to
deal with non-expert (bad) demonstrations from offline data.To tackle this
issue, we introduce a novel method called Outcome-Driven Action Flexibility
(ODAF), which seeks to reduce reliance on the empirical action distribution of
the behavior policy, hence reducing the negative impact of those bad
demonstrations.To be specific, a new conservative reward mechanism is developed
to deal with {\it distribution shift} by evaluating actions according to
whether their outcomes meet safety requirements - remaining within the state
support area, rather than solely depending on the actions' likelihood based on
offline data.Besides theoretical justification, we provide empirical evidence
on widely used MuJoCo and various maze benchmarks, demonstrating that our ODAF
method, implemented using uncertainty quantification techniques, effectively
tolerates unseen transitions for improved "trajectory stitching," while
enhancing the agent's ability to learn from realistic non-expert data.
☆ TransforMerger: Transformer-based Voice-Gesture Fusion for Robust Human-Robot Communication
As human-robot collaboration advances, natural and flexible communication
methods are essential for effective robot control. Traditional methods relying
on a single modality or rigid rules struggle with noisy or misaligned data as
well as with object descriptions that do not perfectly fit the predefined
object names (e.g. 'Pick that red object'). We introduce TransforMerger, a
transformer-based reasoning model that infers a structured action command for
robotic manipulation based on fused voice and gesture inputs. Our approach
merges multimodal data into a single unified sentence, which is then processed
by the language model. We employ probabilistic embeddings to handle uncertainty
and we integrate contextual scene understanding to resolve ambiguous references
(e.g., gestures pointing to multiple objects or vague verbal cues like "this").
We evaluate TransforMerger in simulated and real-world experiments,
demonstrating its robustness to noise, misalignment, and missing information.
Our results show that TransforMerger outperforms deterministic baselines,
especially in scenarios requiring more contextual knowledge, enabling more
robust and flexible human-robot communication. Code and datasets are available
at: http://imitrob.ciirc.cvut.cz/publications/transformerger.
comment: 8 pages, 7 figures
☆ Reasoning LLMs for User-Aware Multimodal Conversational Agents
Hamed Rahimi, Jeanne Cattoni, Meriem Beghili, Mouad Abrini, Mahdi Khoramshahi, Maribel Pino, Mohamed Chetouani
Personalization in social robotics is critical for fostering effective
human-robot interactions, yet systems often face the cold start problem, where
initial user preferences or characteristics are unavailable. This paper
proposes a novel framework called USER-LLM R1 for a user-aware conversational
agent that addresses this challenge through dynamic user profiling and model
initiation. Our approach integrates chain-of-thought (CoT) reasoning models to
iteratively infer user preferences and vision-language models (VLMs) to
initialize user profiles from multimodal inputs, enabling personalized
interactions from the first encounter. Leveraging a Retrieval-Augmented
Generation (RAG) architecture, the system dynamically refines user
representations within an inherent CoT process, ensuring contextually relevant
and adaptive responses. Evaluations on the ElderlyTech-VQA Bench demonstrate
significant improvements in ROUGE-1 (+23.2%), ROUGE-2 (+0.6%), and ROUGE-L
(+8%) F1 scores over state-of-the-art baselines, with ablation studies
underscoring the impact of reasoning model size on performance. Human
evaluations further validate the framework's efficacy, particularly for elderly
users, where tailored responses enhance engagement and trust. Ethical
considerations, including privacy preservation and bias mitigation, are
rigorously discussed and addressed to ensure responsible deployment.
☆ Overlap-Aware Feature Learning for Robust Unsupervised Domain Adaptation for 3D Semantic Segmentation
3D point cloud semantic segmentation (PCSS) is a cornerstone for
environmental perception in robotic systems and autonomous driving, enabling
precise scene understanding through point-wise classification. While
unsupervised domain adaptation (UDA) mitigates label scarcity in PCSS, existing
methods critically overlook the inherent vulnerability to real-world
perturbations (e.g., snow, fog, rain) and adversarial distortions. This work
first identifies two intrinsic limitations that undermine current PCSS-UDA
robustness: (a) unsupervised features overlap from unaligned boundaries in
shared-class regions and (b) feature structure erosion caused by
domain-invariant learning that suppresses target-specific patterns. To address
the proposed problems, we propose a tripartite framework consisting of: 1) a
robustness evaluation model quantifying resilience against adversarial
attack/corruption types through robustness metrics; 2) an invertible attention
alignment module (IAAM) enabling bidirectional domain mapping while preserving
discriminative structure via attention-guided overlap suppression; and 3) a
contrastive memory bank with quality-aware contrastive learning that
progressively refines pseudo-labels with feature quality for more
discriminative representations. Extensive experiments on
SynLiDAR-to-SemanticPOSS adaptation demonstrate a maximum mIoU improvement of
14.3\% under adversarial attack.
comment: 8 pages,6 figures
☆ Proposition of Affordance-Driven Environment Recognition Framework Using Symbol Networks in Large Language Models
In the quest to enable robots to coexist with humans, understanding dynamic
situations and selecting appropriate actions based on common sense and
affordances are essential. Conventional AI systems face challenges in applying
affordance, as it represents implicit knowledge derived from common sense.
However, large language models (LLMs) offer new opportunities due to their
ability to process extensive human knowledge. This study proposes a method for
automatic affordance acquisition by leveraging LLM outputs. The process
involves generating text using LLMs, reconstructing the output into a symbol
network using morphological and dependency analysis, and calculating
affordances based on network distances. Experiments using ``apple'' as an
example demonstrated the method's ability to extract context-dependent
affordances with high explainability. The results suggest that the proposed
symbol network, reconstructed from LLM outputs, enables robots to interpret
affordances effectively, bridging the gap between symbolized data and
human-like situational understanding.
☆ LLM-mediated Dynamic Plan Generation with a Multi-Agent Approach
Planning methods with high adaptability to dynamic environments are crucial
for the development of autonomous and versatile robots. We propose a method for
leveraging a large language model (GPT-4o) to automatically generate networks
capable of adapting to dynamic environments. The proposed method collects
environmental "status," representing conditions and goals, and uses them to
generate agents. These agents are interconnected on the basis of specific
conditions, resulting in networks that combine flexibility and generality. We
conducted evaluation experiments to compare the networks automatically
generated with the proposed method with manually constructed ones, confirming
the comprehensiveness of the proposed method's networks and their higher
generality. This research marks a significant advancement toward the
development of versatile planning methods applicable to robotics, autonomous
vehicles, smart systems, and other complex environments.
☆ Anticipating Degradation: A Predictive Approach to Fault Tolerance in Robot Swarms
An active approach to fault tolerance is essential for robot swarms to
achieve long-term autonomy. Previous efforts have focused on responding to
spontaneous electro-mechanical faults and failures. However, many faults occur
gradually over time. Waiting until such faults have manifested as failures
before addressing them is both inefficient and unsustainable in a variety of
scenarios. This work argues that the principles of predictive maintenance, in
which potential faults are resolved before they hinder the operation of the
swarm, offer a promising means of achieving long-term fault tolerance. This is
a novel approach to swarm fault tolerance, which is shown to give a comparable
or improved performance when tested against a reactive approach in almost all
cases tested.
☆ Building Knowledge from Interactions: An LLM-Based Architecture for Adaptive Tutoring and Social Reasoning IROS
Integrating robotics into everyday scenarios like tutoring or physical
training requires robots capable of adaptive, socially engaging, and
goal-oriented interactions. While Large Language Models show promise in
human-like communication, their standalone use is hindered by memory
constraints and contextual incoherence. This work presents a multimodal,
cognitively inspired framework that enhances LLM-based autonomous
decision-making in social and task-oriented Human-Robot Interaction.
Specifically, we develop an LLM-based agent for a robot trainer, balancing
social conversation with task guidance and goal-driven motivation. To further
enhance autonomy and personalization, we introduce a memory system for
selecting, storing and retrieving experiences, facilitating generalized
reasoning based on knowledge built across different interactions. A preliminary
HRI user study and offline experiments with a synthetic dataset validate our
approach, demonstrating the system's ability to manage complex interactions,
autonomously drive training tasks, and build and retrieve contextual memories,
advancing socially intelligent robotics.
comment: Submitted to IEEE/RSJ International Conference on Intelligent Robots
and Systems (IROS) 2025
☆ LL-Localizer: A Life-Long Localization System based on Dynamic i-Octree
This paper proposes an incremental voxel-based life-long localization method,
LL-Localizer, which enables robots to localize robustly and accurately in
multi-session mode using prior maps. Meanwhile, considering that it is
difficult to be aware of changes in the environment in the prior map and robots
may traverse between mapped and unmapped areas during actual operation, we will
update the map when needed according to the established strategies through
incremental voxel map. Besides, to ensure high performance in real-time and
facilitate our map management, we utilize Dynamic i-Octree, an efficient
organization of 3D points based on Dynamic Octree to load local map and update
the map during the robot's operation. The experiments show that our system can
perform stable and accurate localization comparable to state-of-the-art LIO
systems. And even if the environment in the prior map changes or the robots
traverse between mapped and unmapped areas, our system can still maintain
robust and accurate localization without any distinction. Our demo can be found
on Blibili (https://www.bilibili.com/video/BV1faZHYCEkZ) and youtube
(https://youtu.be/UWn7RCb9kA8) and the program will be available at
https://github.com/M-Evanovic/LL-Localizer.
☆ 8-DoFs Cable Driven Parallel Robots for Bimanual Teleportation
Teleoperation plays a critical role in intuitive robot control and imitation
learning, particularly for complex tasks involving mobile manipulators with
redundant degrees of freedom (DoFs). However, most existing master controllers
are limited to 6-DoF spatial control and basic gripper control, making them
insufficient for controlling high-DoF robots and restricting the operator to a
small workspace. In this work, we present a novel, low-cost, high-DoF master
controller based on Cable-Driven Parallel Robots (CDPRs), designed to overcome
these limitations. The system decouples translation and orientation control,
following a scalable 3 + 3 + n DoF structure: 3 DoFs for large-range
translation using a CDPR, 3 DoFs for orientation using a gimbal mechanism, and
n additional DoFs for gripper and redundant joint control. Its lightweight
cable-driven design enables a large and adaptable workspace while minimizing
actuator load. The end-effector remains stable without requiring continuous
high-torque input, unlike most serial robot arms. We developed the first
dual-arm CDPR-based master controller using cost-effective actuators and a
simple mechanical structure. In demonstrations, the system successfully
controlled an 8-DoF robotic arm with a 2-DoF pan-tilt camera, performing tasks
such as pick-and-place, knot tying, object sorting, and tape application. The
results show precise, versatile, and practical high-DoF teleoperation.
☆ Grasping by Spiraling: Reproducing Elephant Movements with Rigid-Soft Robot Synergy
Huishi Huang, Haozhe Wang, Chongyu Fang, Mingge Yan, Ruochen Xu, Yiyuan Zhang, Zhanchi Wang, Fengkang Ying, Jun Liu, Cecilia Laschi, Marcelo H. Ang Jr
The logarithmic spiral is observed as a common pattern in several living
beings across kingdoms and species. Some examples include fern shoots,
prehensile tails, and soft limbs like octopus arms and elephant trunks. In the
latter cases, spiraling is also used for grasping. Motivated by how this
strategy simplifies behavior into kinematic primitives and combines them to
develop smart grasping movements, this work focuses on the elephant trunk,
which is more deeply investigated in the literature. We present a soft arm
combined with a rigid robotic system to replicate elephant grasping
capabilities based on the combination of a soft trunk with a solid body. In our
system, the rigid arm ensures positioning and orientation, mimicking the role
of the elephant's head, while the soft manipulator reproduces trunk motion
primitives of bending and twisting under proper actuation patterns. This
synergy replicates 9 distinct elephant grasping strategies reported in the
literature, accommodating objects of varying shapes and sizes. The synergistic
interaction between the rigid and soft components of the system minimizes the
control complexity while maintaining a high degree of adaptability.
comment: Version 1. 16 pages, 5 figures
☆ Dynamic Initialization for LiDAR-inertial SLAM
The accuracy of the initial state, including initial velocity, gravity
direction, and IMU biases, is critical for the initialization of LiDAR-inertial
SLAM systems. Inaccurate initial values can reduce initialization speed or lead
to failure. When the system faces urgent tasks, robust and fast initialization
is required while the robot is moving, such as during the swift assessment of
rescue environments after natural disasters, bomb disposal, and restarting
LiDAR-inertial SLAM in rescue missions. However, existing initialization
methods usually require the platform to remain stationary, which is ineffective
when the robot is in motion. To address this issue, this paper introduces a
robust and fast dynamic initialization method for LiDAR-inertial systems
(D-LI-Init). This method iteratively aligns LiDAR-based odometry with IMU
measurements to achieve system initialization. To enhance the reliability of
the LiDAR odometry module, the LiDAR and gyroscope are tightly integrated
within the ESIKF framework. The gyroscope compensates for rotational distortion
in the point cloud. Translational distortion compensation occurs during the
iterative update phase, resulting in the output of LiDAR-gyroscope odometry.
The proposed method can initialize the system no matter the robot is moving or
stationary. Experiments on public datasets and real-world environments
demonstrate that the D-LI-Init algorithm can effectively serve various
platforms, including vehicles, handheld devices, and UAVs. D-LI-Init completes
dynamic initialization regardless of specific motion patterns. To benefit the
research community, we have open-sourced our code and test datasets on GitHub.
comment: Accepted by IEEE/ASME Transactions on Mechatronics
☆ DF-Calib: Targetless LiDAR-Camera Calibration via Depth Flow
Precise LiDAR-camera calibration is crucial for integrating these two sensors
into robotic systems to achieve robust perception. In applications like
autonomous driving, online targetless calibration enables a prompt sensor
misalignment correction from mechanical vibrations without extra targets.
However, existing methods exhibit limitations in effectively extracting
consistent features from LiDAR and camera data and fail to prioritize salient
regions, compromising cross-modal alignment robustness. To address these
issues, we propose DF-Calib, a LiDAR-camera calibration method that
reformulates calibration as an intra-modality depth flow estimation problem.
DF-Calib estimates a dense depth map from the camera image and completes the
sparse LiDAR projected depth map, using a shared feature encoder to extract
consistent depth-to-depth features, effectively bridging the 2D-3D cross-modal
gap. Additionally, we introduce a reliability map to prioritize valid pixels
and propose a perceptually weighted sparse flow loss to enhance depth flow
estimation. Experimental results across multiple datasets validate its accuracy
and generalization,with DF-Calib achieving a mean translation error of 0.635cm
and rotation error of 0.045 degrees on the KITTI dataset.
comment: 7 pages,3 figures, 3 figures
☆ Pedestrian-Aware Motion Planning for Autonomous Driving in Complex Urban Scenarios
Motion planning in uncertain environments like complex urban areas is a key
challenge for autonomous vehicles (AVs). The aim of our research is to
investigate how AVs can navigate crowded, unpredictable scenarios with multiple
pedestrians while maintaining a safe and efficient vehicle behavior. So far,
most research has concentrated on static or deterministic traffic participant
behavior. This paper introduces a novel algorithm for motion planning in
crowded spaces by combining social force principles for simulating realistic
pedestrian behavior with a risk-aware motion planner. We evaluate this new
algorithm in a 2D simulation environment to rigorously assess AV-pedestrian
interactions, demonstrating that our algorithm enables safe, efficient, and
adaptive motion planning, particularly in highly crowded urban environments - a
first in achieving this level of performance. This study has not taken into
consideration real-time constraints and has been shown only in simulation so
far. Further studies are needed to investigate the novel algorithm in a
complete software stack for AVs on real cars to investigate the entire
perception, planning and control pipeline in crowded scenarios. We release the
code developed in this research as an open-source resource for further studies
and development. It can be accessed at the following link:
https://github.com/TUM-AVS/PedestrianAwareMotionPlanning
comment: 13 Pages. Submitted to the IEEE Transactions on Intelligent Vehicles
☆ From Shadows to Safety: Occlusion Tracking and Risk Mitigation for Urban Autonomous Driving
Autonomous vehicles (AVs) must navigate dynamic urban environments where
occlusions and perception limitations introduce significant uncertainties. This
research builds upon and extends existing approaches in risk-aware motion
planning and occlusion tracking to address these challenges. While prior
studies have developed individual methods for occlusion tracking and risk
assessment, a comprehensive method integrating these techniques has not been
fully explored. We, therefore, enhance a phantom agent-centric model by
incorporating sequential reasoning to track occluded areas and predict
potential hazards. Our model enables realistic scenario representation and
context-aware risk evaluation by modeling diverse phantom agents, each with
distinct behavior profiles. Simulations demonstrate that the proposed approach
improves situational awareness and balances proactive safety with efficient
traffic flow. While these results underline the potential of our method,
validation in real-world scenarios is necessary to confirm its feasibility and
generalizability. By utilizing and advancing established methodologies, this
work contributes to safer and more reliable AV planning in complex urban
environments. To support further research, our method is available as
open-source software at:
https://github.com/TUM-AVS/OcclusionAwareMotionPlanning
comment: 8 Pages. Submitted to the IEEE Intelligent Vehicles Symposium (IV
2025), Romania
☆ Teaching Robots to Handle Nuclear Waste: A Teleoperation-Based Learning Approach<
This paper presents a Learning from Teleoperation (LfT) framework that
integrates human expertise with robotic precision to enable robots to
autonomously perform skills learned from human operators. The proposed
framework addresses challenges in nuclear waste handling tasks, which often
involve repetitive and meticulous manipulation operations. By capturing
operator movements and manipulation forces during teleoperation, the framework
utilizes this data to train machine learning models capable of replicating and
generalizing human skills. We validate the effectiveness of the LfT framework
through its application to a power plug insertion task, selected as a
representative scenario that is repetitive yet requires precise trajectory and
force control. Experimental results highlight significant improvements in task
efficiency, while reducing reliance on continuous operator involvement.
comment: Waste Management Symposia 2025
☆ Intuitive Human-Drone Collaborative Navigation in Unknown Environments through Mixed Reality
Considering the widespread integration of aerial robots in inspection, search
and rescue, and monitoring tasks, there is a growing demand to design intuitive
human-drone interfaces. These aim to streamline and enhance the user
interaction and collaboration process during drone navigation, ultimately
expediting mission success and accommodating users' inputs. In this paper, we
present a novel human-drone mixed reality interface that aims to (a) increase
human-drone spatial awareness by sharing relevant spatial information and
representations between the human equipped with a Head Mounted Display (HMD)
and the robot and (b) enable safer and intuitive human-drone interactive and
collaborative navigation in unknown environments beyond the simple command and
control or teleoperation paradigm. We validate our framework through extensive
user studies and experiments in a simulated post-disaster scenarios, comparing
its performance against a traditional First-Person View (FPV) control systems.
Furthermore, multiple tests on several users underscore the advantages of the
proposed solution, which offers intuitive and natural interaction with the
system. This demonstrates the solution's ability to assist humans during a
drone navigation mission, ensuring its safe and effective execution.
comment: Approved at ICUAS 25
☆ Inverse RL Scene Dynamics Learning for Nonlinear Predictive Control in Autonomous Vehicles
This paper introduces the Deep Learning-based Nonlinear Model Predictive
Controller with Scene Dynamics (DL-NMPC-SD) method for autonomous navigation.
DL-NMPC-SD uses an a-priori nominal vehicle model in combination with a scene
dynamics model learned from temporal range sensing information. The scene
dynamics model is responsible for estimating the desired vehicle trajectory, as
well as to adjust the true system model used by the underlying model predictive
controller. We propose to encode the scene dynamics model within the layers of
a deep neural network, which acts as a nonlinear approximator for the high
order state-space of the operating conditions. The model is learned based on
temporal sequences of range sensing observations and system states, both
integrated by an Augmented Memory component. We use Inverse Reinforcement
Learning and the Bellman optimality principle to train our learning controller
with a modified version of the Deep Q-Learning algorithm, enabling us to
estimate the desired state trajectory as an optimal action-value function. We
have evaluated DL-NMPC-SD against the baseline Dynamic Window Approach (DWA),
as well as against two state-of-the-art End2End and reinforcement learning
methods, respectively. The performance has been measured in three experiments:
i) in our GridSim virtual environment, ii) on indoor and outdoor navigation
tasks using our RovisLab AMTU (Autonomous Mobile Test Unit) platform and iii)
on a full scale autonomous test vehicle driving on public roads.
comment: 21 pages, 14 figures, journal paper
☆ Bi-LAT: Bilateral Control-Based Imitation Learning via Natural Language and Action Chunking with Transformers
We present Bi-LAT, a novel imitation learning framework that unifies
bilateral control with natural language processing to achieve precise force
modulation in robotic manipulation. Bi-LAT leverages joint position, velocity,
and torque data from leader-follower teleoperation while also integrating
visual and linguistic cues to dynamically adjust applied force. By encoding
human instructions such as "softly grasp the cup" or "strongly twist the
sponge" through a multimodal Transformer-based model, Bi-LAT learns to
distinguish nuanced force requirements in real-world tasks. We demonstrate
Bi-LAT's performance in (1) unimanual cup-stacking scenario where the robot
accurately modulates grasp force based on language commands, and (2) bimanual
sponge-twisting task that requires coordinated force control. Experimental
results show that Bi-LAT effectively reproduces the instructed force levels,
particularly when incorporating SigLIP among tested language encoders. Our
findings demonstrate the potential of integrating natural language cues into
imitation learning, paving the way for more intuitive and adaptive human-robot
interaction. For additional material, please visit:
https://mertcookimg.github.io/bi-lat/
☆ AIM: Acoustic Inertial Measurement for Indoor Drone Localization and Tracking
We present Acoustic Inertial Measurement (AIM), a one-of-a-kind technique for
indoor drone localization and tracking. Indoor drone localization and tracking
are arguably a crucial, yet unsolved challenge: in GPS-denied environments,
existing approaches enjoy limited applicability, especially in Non-Line of
Sight (NLoS), require extensive environment instrumentation, or demand
considerable hardware/software changes on drones. In contrast, AIM exploits the
acoustic characteristics of the drones to estimate their location and derive
their motion, even in NLoS settings. We tame location estimation errors using a
dedicated Kalman filter and the Interquartile Range rule (IQR). We implement
AIM using an off-the-shelf microphone array and evaluate its performance with a
commercial drone under varied settings. Results indicate that the mean
localization error of AIM is 46% lower than commercial UWB-based systems in
complex indoor scenarios, where state-of-the-art infrared systems would not
even work because of NLoS settings. We further demonstrate that AIM can be
extended to support indoor spaces with arbitrary ranges and layouts without
loss of accuracy by deploying distributed microphone arrays.
comment: arXiv admin note: substantial text overlap with arXiv:2504.00445
☆ Cuddle-Fish: Exploring a Soft Floating Robot with Flapping Wings for Physical Interactions
Mingyang Xu, Jiayi Shao, Yulan Ju, Ximing Shen, Qingyuan Gao, Weijen Chen, Qing Zhang, Yun Suen Pai, Giulia Barbareschi, Matthias Hoppe, Kouta Minamizawa, Kai Kunze
Flying robots, such as quadrotor drones, offer new possibilities for
human-robot interaction but often pose safety risks due to fast-spinning
propellers, rigid structures, and noise. In contrast, lighter-than-air
flapping-wing robots, inspired by animal movement, offer a soft, quiet, and
touch-safe alternative. Building on these advantages, we present
\textit{Cuddle-Fish}, a soft, flapping-wing floating robot designed for safe,
close-proximity interactions in indoor spaces. Through a user study with 24
participants, we explored their perceptions of the robot and experiences during
a series of co-located demonstrations in which the robot moved near them.
Results showed that participants felt safe, willingly engaged in touch-based
interactions with the robot, and exhibited spontaneous affective behaviours,
such as patting, stroking, hugging, and cheek-touching, without external
prompting. They also reported positive emotional responses towards the robot.
These findings suggest that the soft floating robot with flapping wings can
serve as a novel and socially acceptable alternative to traditional rigid
flying robots, opening new possibilities for companionship, play, and
interactive experiences in everyday indoor environments.
☆ ForestVO: Enhancing Visual Odometry in Forest Environments through ForestGlue
Recent advancements in visual odometry systems have improved autonomous
navigation; however, challenges persist in complex environments like forests,
where dense foliage, variable lighting, and repetitive textures compromise
feature correspondence accuracy. To address these challenges, we introduce
ForestGlue, enhancing the SuperPoint feature detector through four
configurations - grayscale, RGB, RGB-D, and stereo-vision - optimised for
various sensing modalities. For feature matching, we employ LightGlue or
SuperGlue, retrained with synthetic forest data. ForestGlue achieves comparable
pose estimation accuracy to baseline models but requires only 512 keypoints -
just 25% of the baseline's 2048 - to reach an LO-RANSAC AUC score of 0.745 at a
10{\deg} threshold. With only a quarter of keypoints needed, ForestGlue
significantly reduces computational overhead, demonstrating effectiveness in
dynamic forest environments, and making it suitable for real-time deployment on
resource-constrained platforms. By combining ForestGlue with a
transformer-based pose estimation model, we propose ForestVO, which estimates
relative camera poses using matched 2D pixel coordinates between frames. On
challenging TartanAir forest sequences, ForestVO achieves an average relative
pose error (RPE) of 1.09 m and a kitti_score of 2.33%, outperforming
direct-based methods like DSO by 40% in dynamic scenes. Despite using only 10%
of the dataset for training, ForestVO maintains competitive performance with
TartanVO while being a significantly lighter model. This work establishes an
end-to-end deep learning pipeline specifically tailored for visual odometry in
forested environments, leveraging forest-specific training data to optimise
feature correspondence and pose estimation, thereby enhancing the accuracy and
robustness of autonomous navigation systems.
comment: Accepted to the IEEE Robotics and Automation Letters
☆ The Social Life of Industrial Arms: How Arousal and Attention Shape Human-Robot Interaction
This study explores how human perceptions of a non-anthropomorphic robotic
manipulator are shaped by two key dimensions of behaviour: arousal, defined as
the robot's movement energy and expressiveness, and attention, defined as the
robot's capacity to selectively orient toward and engage with a user. We
introduce a novel control architecture that integrates a gaze-like attention
engine with an arousal-modulated motion system to generate socially meaningful
behaviours. In a user study, we find that robots exhibiting high attention --
actively directing their focus toward users -- are perceived as warmer and more
competent, intentional, and lifelike. In contrast, high arousal --
characterized by fast, expansive, and energetic motions -- increases
perceptions of discomfort and disturbance. Importantly, a combination of
focused attention and moderate arousal yields the highest ratings of trust and
sociability, while excessive arousal diminishes social engagement. These
findings offer design insights for endowing non-humanoid robots with
expressive, intuitive behaviours that support more natural human-robot
interaction.
comment: 7 pages, 3 figures, 1 table
♻ ☆ Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning
NVIDIA, :, Alisson Azzolini, Hannah Brandon, Prithvijit Chattopadhyay, Huayu Chen, Jinju Chu, Yin Cui, Jenna Diamond, Yifan Ding, Francesco Ferroni, Rama Govindaraju, Jinwei Gu, Siddharth Gururani, Imad El Hanafi, Zekun Hao, Jacob Huffman, Jingyi Jin, Brendan Johnson, Rizwan Khan, George Kurian, Elena Lantz, Nayeon Lee, Zhaoshuo Li, Xuan Li, Tsung-Yi Lin, Yen-Chen Lin, Ming-Yu Liu, Alice Luo, Andrew Mathau, Yun Ni, Lindsey Pavao, Wei Ping, David W. Romero, Misha Smelyanskiy, Shuran Song, Lyne Tchapmi, Andrew Z. Wang, Boxin Wang, Haoxiang Wang, Fangyin Wei, Jiashu Xu, Yao Xu, Xiaodong Yang, Zhuolin Yang, Xiaohui Zeng, Zhe Zhang
Physical AI systems need to perceive, understand, and perform complex actions
in the physical world. In this paper, we present the Cosmos-Reason1 models that
can understand the physical world and generate appropriate embodied decisions
(e.g., next step action) in natural language through long chain-of-thought
reasoning processes. We begin by defining key capabilities for Physical AI
reasoning, with a focus on physical common sense and embodied reasoning. To
represent physical common sense, we use a hierarchical ontology that captures
fundamental knowledge about space, time, and physics. For embodied reasoning,
we rely on a two-dimensional ontology that generalizes across different
physical embodiments. Building on these capabilities, we develop two multimodal
large language models, Cosmos-Reason1-8B and Cosmos-Reason1-56B. We curate data
and train our models in four stages: vision pre-training, general supervised
fine-tuning (SFT), Physical AI SFT, and Physical AI reinforcement learning (RL)
as the post-training. To evaluate our models, we build comprehensive benchmarks
for physical common sense and embodied reasoning according to our ontologies.
Evaluation results show that Physical AI SFT and reinforcement learning bring
significant improvements. To facilitate the development of Physical AI, we will
make our code and pre-trained models available under the NVIDIA Open Model
License at https://github.com/nvidia-cosmos/cosmos-reason1.
♻ ☆ Sim-and-Real Co-Training: A Simple Recipe for Vision-Based Robotic Manipulation
Abhiram Maddukuri, Zhenyu Jiang, Lawrence Yunliang Chen, Soroush Nasiriany, Yuqi Xie, Yu Fang, Wenqi Huang, Zu Wang, Zhenjia Xu, Nikita Chernyadev, Scott Reed, Ken Goldberg, Ajay Mandlekar, Linxi Fan, Yuke Zhu
Large real-world robot datasets hold great potential to train generalist
robot models, but scaling real-world human data collection is time-consuming
and resource-intensive. Simulation has great potential in supplementing
large-scale data, especially with recent advances in generative AI and
automated data generation tools that enable scalable creation of robot behavior
datasets. However, training a policy solely in simulation and transferring it
to the real world often demands substantial human effort to bridge the reality
gap. A compelling alternative is to co-train the policy on a mixture of
simulation and real-world datasets. Preliminary studies have recently shown
this strategy to substantially improve the performance of a policy over one
trained on a limited amount of real-world data. Nonetheless, the community
lacks a systematic understanding of sim-and-real co-training and what it takes
to reap the benefits of simulation data for real-robot learning. This work
presents a simple yet effective recipe for utilizing simulation data to solve
vision-based robotic manipulation tasks. We derive this recipe from
comprehensive experiments that validate the co-training strategy on various
simulation and real-world datasets. Using two domains--a robot arm and a
humanoid--across diverse tasks, we demonstrate that simulation data can enhance
real-world task performance by an average of 38%, even with notable differences
between the simulation and real-world data. Videos and additional results can
be found at https://co-training.github.io/
comment: Project website: https://co-training.github.io/
♻ ☆ Dynamics-aware Diffusion Models for Planning and Control
This paper addresses the problem of generating dynamically admissible
trajectories for control tasks using diffusion models, particularly in
scenarios where the environment is complex and system dynamics are crucial for
practical application. We propose a novel framework that integrates system
dynamics directly into the diffusion model's denoising process through a
sequential prediction and projection mechanism. This mechanism, aligned with
the diffusion model's noising schedule, ensures generated trajectories are both
consistent with expert demonstrations and adhere to underlying physical
constraints. Notably, our approach can generate maximum likelihood trajectories
and accurately recover trajectories generated by linear feedback controllers,
even when explicit dynamics knowledge is unavailable. We validate the
effectiveness of our method through experiments on standard control tasks and a
complex non-convex optimal control problem involving waypoint tracking and
collision avoidance, demonstrating its potential for efficient trajectory
generation in practical applications.
comment: 8 pages, 3 figures
♻ ☆ A Tutorial on Distributed Optimization for Cooperative Robotics: from Setups and Algorithms to Toolboxes and Research Directions
Several interesting problems in multi-robot systems can be cast in the
framework of distributed optimization. Examples include multi-robot task
allocation, vehicle routing, target protection, and surveillance. While the
theoretical analysis of distributed optimization algorithms has received
significant attention, its application to cooperative robotics has not been
investigated in detail. In this paper, we show how notable scenarios in
cooperative robotics can be addressed by suitable distributed optimization
setups. Specifically, after a brief introduction on the widely investigated
consensus optimization (most suited for data analytics) and on the
partition-based setup (matching the graph structure in the optimization), we
focus on two distributed settings modeling several scenarios in cooperative
robotics, i.e., the so-called constraint-coupled and aggregative optimization
frameworks. For each one, we consider use-case applications, and we discuss
tailored distributed algorithms with their convergence properties. Then, we
revise state-of-the-art toolboxes allowing for the implementation of
distributed schemes on real networks of robots without central coordinators.
For each use case, we discuss its implementation in these toolboxes and provide
simulations and real experiments on networks of heterogeneous robots.
♻ ☆ A Model-Agnostic Approach for Semantically Driven Disambiguation in Human-Robot Interaction
Ambiguities are inevitable in human-robot interaction, especially when a
robot follows user instructions in a large, shared space. For example, if a
user asks the robot to find an object in a home environment with underspecified
instructions, the object could be in multiple locations depending on missing
factors. For instance, a bowl might be in the kitchen cabinet or on the dining
room table, depending on whether it is clean or dirty, full or empty, and the
presence of other objects around it. Previous works on object search have
assumed that the queried object is immediately visible to the robot or have
predicted object locations using one-shot inferences, which are likely to fail
for ambiguous or partially understood instructions. This paper focuses on these
gaps and presents a novel model-agnostic approach leveraging semantically
driven clarifications to enhance the robot's ability to locate queried objects
in fewer attempts. Specifically, we leverage different knowledge embedding
models, and when ambiguities arise, we propose an informative clarification
method, which follows an iterative prediction process. The user experiment
evaluation of our method shows that our approach is applicable to different
custom semantic encoders as well as LLMs, and informative clarifications
improve performances, enabling the robot to locate objects on its first
attempts. The user experiment data is publicly available at
https://github.com/IrmakDogan/ExpressionDataset.
comment: Under review for 2025 IEEE International Conference on Robot & Human
Interactive Communication (RO-MAN), Supplementary video:
https://youtu.be/_P0v07Xc24Y, Dataset publicly available:
https://github.com/IrmakDogan/ExpressionDataset
♻ ☆ Why Autonomous Vehicles Are Not Ready Yet: A Multi-Disciplinary Review of Problems, Attempted Solutions, and Future Directions
Xingshuai Dong, Max Cappuccio, Hamad Al Jassmi, Fady Alnajjar, Essam Debie, Milad Ghasrikhouzani, Alessandro Lanteri, Ali Luqman, Tate McGregor, Oleksandra Molloy, Alice Plebe, Michael Regan, Dongmo Zhang
Personal autonomous vehicles are cars, trucks and bikes capable of sensing
their surrounding environment, planning their route, and driving with little or
no involvement of human drivers. Despite the impressive technological
achievements made by the industry in recent times and the hopeful announcements
made by leading entrepreneurs, to date no personal vehicle is approved for road
circulation in a 'fully' or 'semi' autonomous mode (autonomy levels 4 and 5)
and it is still unclear when such vehicles will eventually be mature enough to
receive this kind of approval. The present review adopts an integrative and
multidisciplinary approach to investigate the major challenges faced by the
automative sector, with the aim to identify the problems that still trouble and
delay the commercialization of autonomous vehicles. The review examines the
limitations and risks associated with current technologies and the most
promising solutions devised by the researchers. This negative assessment
methodology is not motivated by pessimism, but by the aspiration to raise
critical awareness about the technology's state-of-the-art, the industry's
quality standards, and the society's demands and expectations. While the survey
primarily focuses on the applications of artificial intelligence for perception
and navigation, it also aims to offer an enlarged picture that links the purely
technological aspects with the relevant human-centric aspects, including,
cultural attitudes, conceptual assumptions, and normative (ethico-legal)
frameworks. Examining the broader context serves to highlight problems that
have a cross-disciplinary scope and identify solutions that may benefit from a
holistic consideration.
comment: This manuscript extends the work "Applications of Computer Vision in
Autonomous Vehicles: Methods, Challenges, and Future Directions." We have
added several sections to explore autonomous vehicles from a
multidisciplinary perspective. We propose changing the arXiv category to
cs.RO, as the expanded content addresses broader autonomous vehicle topics
aligning more closely with the Robotics domain
♻ ☆ Learning Dual-Arm Push and Grasp Synergy in Dense Clutter
Robotic grasping in densely cluttered environments is challenging due to
scarce collision-free grasp affordances. Non-prehensile actions can increase
feasible grasps in cluttered environments, but most research focuses on
single-arm rather than dual-arm manipulation. Policies from single-arm systems
fail to fully leverage the advantages of dual-arm coordination. We propose a
target-oriented hierarchical deep reinforcement learning (DRL) framework that
learns dual-arm push-grasp synergy for grasping objects to enhance dexterous
manipulation in dense clutter. Our framework maps visual observations to
actions via a pre-trained deep learning backbone and a novel CNN-based DRL
model, trained with Proximal Policy Optimization (PPO), to develop a dual-arm
push-grasp strategy. The backbone enhances feature mapping in densely cluttered
environments. A novel fuzzy-based reward function is introduced to accelerate
efficient strategy learning. Our system is developed and trained in Isaac Gym
and then tested in simulations and on a real robot. Experimental results show
that our framework effectively maps visual data to dual push-grasp motions,
enabling the dual-arm system to grasp target objects in complex environments.
Compared to other methods, our approach generates 6-DoF grasp candidates and
enables dual-arm push actions, mimicking human behavior. Results show that our
method efficiently completes tasks in densely cluttered environments.
https://sites.google.com/view/pg4da/home
♻ ☆ Efficient Alignment of Unconditioned Action Prior for Language-conditioned Pick and Place in Clutter
We study the task of language-conditioned pick and place in clutter, where a
robot should grasp a target object in open clutter and move it to a specified
place. Some approaches learn end-to-end policies with features from vision
foundation models, requiring large datasets. Others combine foundation models
in a zero-shot setting, suffering from cascading errors. In addition, they
primarily leverage vision and language foundation models, focusing less on
action priors. In this paper, we aim to develop an effective policy by
integrating foundation priors from vision, language, and action. We propose
A$^2$, an action prior alignment method that aligns unconditioned action priors
with 3D vision-language priors by learning one attention layer. The alignment
formulation enables our policy to train with less data and preserve zero-shot
generalization capabilities. We show that a shared policy for both pick and
place actions enhances the performance for each task, and introduce a policy
adaptation scheme to accommodate the multi-modal nature of actions. Extensive
experiments in simulation and the real-world show that our policy achieves
higher task success rates with fewer steps for both pick and place tasks in
clutter, effectively generalizing to unseen objects and language instructions.
Videos and codes are available at https://xukechun.github.io/papers/A2.
♻ ☆ Bench4Merge: A Comprehensive Benchmark for Merging in Realistic Dense Traffic with Micro-Interactive Vehicles
While the capabilities of autonomous driving have advanced rapidly, merging
into dense traffic remains a significant challenge, many motion planning
methods for this scenario have been proposed but it is hard to evaluate them.
Most existing closed-loop simulators rely on rule-based controls for other
vehicles, which results in a lack of diversity and randomness, thus failing to
accurately assess the motion planning capabilities in highly interactive
scenarios. Moreover, traditional evaluation metrics are insufficient for
comprehensively evaluating the performance of merging in dense traffic. In
response, we proposed a closed-loop evaluation benchmark for assessing motion
planning capabilities in merging scenarios. Our approach involves other
vehicles trained in large scale datasets with micro-behavioral characteristics
that significantly enhance the complexity and diversity. Additionally, we have
restructured the evaluation mechanism by leveraging Large Language Models
(LLMs) to assess each autonomous vehicle merging onto the main lane. Extensive
experiments and test-vehicle deployment have demonstrated the progressiveness
of this benchmark. Through this benchmark, we have obtained an evaluation of
existing methods and identified common issues. The simulation environment and
evaluation process can be accessed at https://github.com/WZM5853/Bench4Merge.
comment: 6 pages, 8 figures, on submitted
♻ ☆ Can DeepSeek Reason Like a Surgeon? An Empirical Evaluation for Vision-Language Understanding in Robotic-Assisted Surgery
DeepSeek series have demonstrated outstanding performance in general scene
understanding, question-answering (QA), and text generation tasks, owing to its
efficient training paradigm and strong reasoning capabilities. In this study,
we investigate the dialogue capabilities of the DeepSeek model in robotic
surgery scenarios, focusing on tasks such as Single Phrase QA, Visual QA, and
Detailed Description. The Single Phrase QA tasks further include sub-tasks such
as surgical instrument recognition, action understanding, and spatial position
analysis. We conduct extensive evaluations using publicly available datasets,
including EndoVis18 and CholecT50, along with their corresponding dialogue
data. Our comprehensive evaluation results indicate that, when provided with
specific prompts, DeepSeek-V3 performs well in surgical instrument and tissue
recognition tasks However, DeepSeek-V3 exhibits significant limitations in
spatial position analysis and struggles to understand surgical actions
accurately. Additionally, our findings reveal that, under general prompts,
DeepSeek-V3 lacks the ability to effectively analyze global surgical concepts
and fails to provide detailed insights into surgical scenarios. Based on our
observations, we argue that the DeepSeek-V3 is not ready for vision-language
tasks in surgical contexts without fine-tuning on surgery-specific datasets.
comment: Technical Report
♻ ☆ EPIC: A Lightweight LiDAR-Based UAV Exploration Framework for Large-Scale Scenarios
Autonomous exploration is a fundamental problem for various applications of
unmanned aerial vehicles (UAVs). Recently, LiDAR-based exploration has gained
significant attention due to its ability to generate high-precision point cloud
maps of large-scale environments. While the point clouds are inherently
informative for navigation, many existing exploration methods still rely on
additional, often expensive, environmental representations. This reliance stems
from two main reasons: the need for frontier detection or information gain
computation, which typically depends on memory-intensive occupancy grid maps,
and the high computational complexity of path planning directly on point
clouds, primarily due to costly collision checking. To address these
limitations, we present EPIC, a lightweight LiDAR-based UAV exploration
framework that directly exploits point cloud data to explore large-scale
environments. EPIC introduces a novel observation map derived directly from the
quality of point clouds, eliminating the need for global occupancy grid maps
while preserving comprehensive exploration capabilities. We also propose an
incremental topological graph construction method operating directly on point
clouds, enabling real-time path planning in large-scale environments.
Leveraging these components, we build a hierarchical planning framework that
generates agile and energy-efficient trajectories, achieving significantly
reduced memory consumption and computation time compared to most existing
methods. Extensive simulations and real-world experiments demonstrate that EPIC
achieves faster exploration while significantly reducing memory consumption
compared to state-of-the-art methods.
comment: RAL 2025 accepted. Open-sourced at https://github.com/SYSU-STAR/EPIC
♻ ☆ Learning Perceptive Humanoid Locomotion over Challenging Terrain
Humanoid robots are engineered to navigate terrains akin to those encountered
by humans, which necessitates human-like locomotion and perceptual abilities.
Currently, the most reliable controllers for humanoid motion rely exclusively
on proprioception, a reliance that becomes both dangerous and unreliable when
coping with rugged terrain. Although the integration of height maps into
perception can enable proactive gait planning, robust utilization of this
information remains a significant challenge, especially when exteroceptive
perception is noisy. To surmount these challenges, we propose a solution based
on a teacher-student distillation framework. In this paradigm, an oracle policy
accesses noise-free data to establish an optimal reference policy, while the
student policy not only imitates the teacher's actions but also simultaneously
trains a world model with a variational information bottleneck for sensor
denoising and state estimation. Extensive evaluations demonstrate that our
approach markedly enhances performance in scenarios characterized by unreliable
terrain estimations. Moreover, we conducted rigorous testing in both
challenging urban settings and off-road environments, the model successfully
traverse 2 km of varied terrain without external intervention.
♻ ☆ TeraSim: Uncovering Unknown Unsafe Events for Autonomous Vehicles through Generative Simulation
Haowei Sun, Xintao Yan, Zhijie Qiao, Haojie Zhu, Yihao Sun, Jiawei Wang, Shengyin Shen, Darian Hogue, Rajanikant Ananta, Derek Johnson, Greg Stevens, Greg McGuire, Yifan Wei, Wei Zheng, Yong Sun, Yasuo Fukai, Henry X. Liu
Traffic simulation is essential for autonomous vehicle (AV) development,
enabling comprehensive safety evaluation across diverse driving conditions.
However, traditional rule-based simulators struggle to capture complex human
interactions, while data-driven approaches often fail to maintain long-term
behavioral realism or generate diverse safety-critical events. To address these
challenges, we propose TeraSim, an open-source, high-fidelity traffic
simulation platform designed to uncover unknown unsafe events and efficiently
estimate AV statistical performance metrics, such as crash rates. TeraSim is
designed for seamless integration with third-party physics simulators and
standalone AV stacks, to construct a complete AV simulation system.
Experimental results demonstrate its effectiveness in generating diverse
safety-critical events involving both static and dynamic agents, identifying
hidden deficiencies in AV systems, and enabling statistical performance
evaluation. These findings highlight TeraSim's potential as a practical tool
for AV safety assessment, benefiting researchers, developers, and policymakers.
The code is available at https://github.com/mcity/TeraSim.