Robotics 18
♻ ☆ Tra-MoE: Learning Trajectory Prediction Model from Multiple Domains for Adaptive Policy Conditioning CVPR 2025
Learning from multiple domains is a primary factor that influences the
generalization of a single unified robot system. In this paper, we aim to learn
the trajectory prediction model by using broad out-of-domain data to improve
its performance and generalization ability. Trajectory model is designed to
predict any-point trajectories in the current frame given an instruction and
can provide detailed control guidance for robotic policy learning. To handle
the diverse out-of-domain data distribution, we propose a sparsely-gated MoE
(\textbf{Top-1} gating strategy) architecture for trajectory model, coined as
\textbf{Tra-MoE}. The sparse activation design enables good balance between
parameter cooperation and specialization, effectively benefiting from
large-scale out-of-domain data while maintaining constant FLOPs per token. In
addition, we further introduce an adaptive policy conditioning technique by
learning 2D mask representations for predicted trajectories, which is
explicitly aligned with image observations to guide action prediction more
flexibly. We perform extensive experiments on both simulation and real-world
scenarios to verify the effectiveness of Tra-MoE and adaptive policy
conditioning technique. We also conduct a comprehensive empirical study to
train Tra-MoE, demonstrating that our Tra-MoE consistently exhibits superior
performance compared to the dense baseline model, even when the latter is
scaled to match Tra-MoE's parameter count.
comment: Accepted to CVPR 2025. Code Page: https://github.com/MCG-NJU/Tra-MoE
♻ ☆ ActiveGAMER: Active GAussian Mapping through Efficient Rendering CVPR2025
We introduce ActiveGAMER, an active mapping system that utilizes 3D Gaussian
Splatting (3DGS) to achieve high-quality, real-time scene mapping and
exploration. Unlike traditional NeRF-based methods, which are computationally
demanding and restrict active mapping performance, our approach leverages the
efficient rendering capabilities of 3DGS, allowing effective and efficient
exploration in complex environments. The core of our system is a
rendering-based information gain module that dynamically identifies the most
informative viewpoints for next-best-view planning, enhancing both geometric
and photometric reconstruction accuracy. ActiveGAMER also integrates a
carefully balanced framework, combining coarse-to-fine exploration,
post-refinement, and a global-local keyframe selection strategy to maximize
reconstruction completeness and fidelity. Our system autonomously explores and
reconstructs environments with state-of-the-art geometric and photometric
accuracy and completeness, significantly surpassing existing approaches in both
aspects. Extensive evaluations on benchmark datasets such as Replica and MP3D
highlight ActiveGAMER's effectiveness in active mapping tasks.
comment: Accepted to CVPR2025
♻ ☆ TelePreview: A User-Friendly Teleoperation System with Virtual Arm Assistance for Enhanced Effectiveness
Teleoperation provides an effective way to collect robot data, which is
crucial for learning from demonstrations. In this field, teleoperation faces
several key challenges: user-friendliness for new users, safety assurance, and
transferability across different platforms. While collecting real robot
dexterous manipulation data by teleoperation to train robots has shown
impressive results on diverse tasks, due to the morphological differences
between human and robot hands, it is not only hard for new users to understand
the action mapping but also raises potential safety concerns during operation.
To address these limitations, we introduce TelePreview. This teleoperation
system offers real-time visual feedback on robot actions based on human user
inputs, with a total hardware cost of less than $1,000. TelePreview allows the
user to see a virtual robot that represents the outcome of the user's next
movement. By enabling flexible switching between command visualization and
actual execution, this system helps new users learn how to demonstrate quickly
and safely. We demonstrate that it outperforms other teleoperation systems
across five tasks, emphasize its ease of use, and highlight its straightforward
deployment across diverse robotic platforms. We release our code and a
deployment document on our website
https://nus-lins-lab.github.io/telepreview-web/.
comment: In submission
♻ ☆ One Policy to Run Them All: an End-to-end Learning Approach to Multi-Embodiment Locomotion
Nico Bohlinger, Grzegorz Czechmanowski, Maciej Krupka, Piotr Kicki, Krzysztof Walas, Jan Peters, Davide Tateo
Deep Reinforcement Learning techniques are achieving state-of-the-art results
in robust legged locomotion. While there exists a wide variety of legged
platforms such as quadruped, humanoids, and hexapods, the field is still
missing a single learning framework that can control all these different
embodiments easily and effectively and possibly transfer, zero or few-shot, to
unseen robot embodiments. We introduce URMA, the Unified Robot Morphology
Architecture, to close this gap. Our framework brings the end-to-end Multi-Task
Reinforcement Learning approach to the realm of legged robots, enabling the
learned policy to control any type of robot morphology. The key idea of our
method is to allow the network to learn an abstract locomotion controller that
can be seamlessly shared between embodiments thanks to our morphology-agnostic
encoders and decoders. This flexible architecture can be seen as a potential
first step in building a foundation model for legged robot locomotion. Our
experiments show that URMA can learn a locomotion policy on multiple
embodiments that can be easily transferred to unseen robot platforms in
simulation and the real world.
♻ ☆ A Digital Twin for Telesurgery under Intermittent Communication
Telesurgery is an effective way to deliver service from expert surgeons to
areas without immediate access to specialized resources. However, many of these
areas, such as rural districts or battlefields, might be subject to different
problems in communication, especially latency and intermittent periods of
communication outage. This challenge motivates the use of a digital twin for
the surgical system, where a simulation would mirror the robot hardware and
surgical environment in the real world. The surgeon would then be able to
interact with the digital twin during communication outage, followed by a
recovery strategy on the real robot upon reestablishing communication. This
paper builds the digital twin for the da Vinci surgical robot, with a buffering
and replay strategy that reduces the mean task completion time by 23% when
compared to the baseline, for a peg transfer task subject to intermittent
communication outage. The relevant code can be found here:
https://github.com/LCSR-CIIS/dvrk_digital_twin_teleoperation.
comment: 7 pages, 5 figures. To be published in 2025 International Symposium
on Medical Robotics (ISMR)
♻ ☆ Hierarchical Procedural Framework for Low-latency Robot-Assisted Hand-Object Interaction
Advances in robotics have been driving the development of human-robot
interaction (HRI) technologies. However, accurately perceiving human actions
and achieving adaptive control remains a challenge in facilitating seamless
coordination between human and robotic movements. In this paper, we propose a
hierarchical procedural framework to enable dynamic robot-assisted hand-object
interaction. An open-loop hierarchy leverages the computer vision (CV)-based 3D
reconstruction of the human hand, based on which motion primitives have been
designed to translate hand motions into robotic actions. The low-level
coordination hierarchy fine-tunes the robot's action by using the continuously
updated 3D hand models. Experimental validation demonstrates the effectiveness
of the hierarchical control architecture. The adaptive coordination between
human and robot behavior has achieved a delay of $\leq 0.3$ seconds in the
tele-interaction scenario. A case study of ring-wearing tasks indicates the
potential application of this work in assistive technologies such as healthcare
and manufacturing.
comment: 6 pages, 5 figures
♻ ☆ DELTA: Decomposed Efficient Long-Term Robot Task Planning using Large Language Models ICRA 2025
Recent advancements in Large Language Models (LLMs) have sparked a revolution
across many research fields. In robotics, the integration of common-sense
knowledge from LLMs into task and motion planning has drastically advanced the
field by unlocking unprecedented levels of context awareness. Despite their
vast collection of knowledge, large language models may generate infeasible
plans due to hallucinations or missing domain information. To address these
challenges and improve plan feasibility and computational efficiency, we
introduce DELTA, a novel LLM-informed task planning approach. By using scene
graphs as environment representations within LLMs, DELTA achieves rapid
generation of precise planning problem descriptions. To enhance planning
performance, DELTA decomposes long-term task goals with LLMs into an
autoregressive sequence of sub-goals, enabling automated task planners to
efficiently solve complex problems. In our extensive evaluation, we show that
DELTA enables an efficient and fully automatic task planning pipeline,
achieving higher planning success rates and significantly shorter planning
times compared to the state of the art. Project webpage:
https://delta-llm.github.io/
comment: Accepted at ICRA 2025
♻ ☆ RedMotion: Motion Prediction via Redundancy Reduction
We introduce RedMotion, a transformer model for motion prediction in
self-driving vehicles that learns environment representations via redundancy
reduction. Our first type of redundancy reduction is induced by an internal
transformer decoder and reduces a variable-sized set of local road environment
tokens, representing road graphs and agent data, to a fixed-sized global
embedding. The second type of redundancy reduction is obtained by
self-supervised learning and applies the redundancy reduction principle to
embeddings generated from augmented views of road environments. Our experiments
reveal that our representation learning approach outperforms PreTraM, Traj-MAE,
and GraphDINO in a semi-supervised setting. Moreover, RedMotion achieves
competitive results compared to HPTR or MTR++ in the Waymo Motion Prediction
Challenge. Our open-source implementation is available at:
https://github.com/kit-mrt/future-motion
comment: TMLR published version
♻ ☆ A formal implementation of Behavior Trees to act in robotics
Behavior Trees (BT) are becoming quite popular as an Acting component of
autonomous robotic systems. We propose to define a formal semantics to BT by
translating them to a formal language which enables us to perform verification
of programs written with BT, as well as runtime verification while these BT
execute. This allows us to formally verify BT correctness without requiring BT
programmers to master formal languages and without compromising BT most
valuable features: modularity, flexibility and reusability. We present the
formal framework we use: Fiacre, its language and the produced TTS model; Tina,
its model checking tools and Hippo, its runtime verification engine. We then
show how the translation from BT to Fiacre is automatically done, the type of
formal LTL and CTL properties we can check offline and how to execute the
formal model online in place of a regular BT engine. We illustrate our approach
on two robotics applications, and show how BT can be extended with state
variables, eval nodes, node evaluation results and benefit of other features
available in the Fiacre formal framework (e.g., time).
♻ ☆ AVOCADO: Adaptive Optimal Collision Avoidance driven by Opinion
Diego Martinez-Baselga, Eduardo Sebastián, Eduardo Montijano, Luis Riazuelo, Carlos Sagüés, Luis Montano
We present AVOCADO (AdaptiVe Optimal Collision Avoidance Driven by Opinion),
a novel navigation approach to address holonomic robot collision avoidance when
the robot does not know how cooperative the other agents in the environment
are. AVOCADO departs from a Velocity Obstacle's (VO) formulation akin to the
Optimal Reciprocal Collision Avoidance method. However, instead of assuming
reciprocity, it poses an adaptive control problem to adapt to the cooperation
level of other robots and agents in real time. This is achieved through a novel
nonlinear opinion dynamics design that relies solely on sensor observations. As
a by-product, we leverage tools from the opinion dynamics formulation to
naturally avoid the deadlocks in geometrically symmetric scenarios that
typically suffer VO-based planners. Extensive numerical simulations show that
AVOCADO surpasses existing motion planners in mixed cooperative/non-cooperative
navigation environments in terms of success rate, time to goal and
computational time. In addition, we conduct multiple real experiments that
verify that AVOCADO is able to avoid collisions in environments crowded with
other robots and humans.
comment: This paper is published at IEEE Transactions on Robotics under DOI
10.1109/TRO.2025.3552350
♻ ☆ A Graph-to-Text Approach to Knowledge-Grounded Response Generation in Human-Robot Interaction
Knowledge graphs are often used to represent structured information in a
flexible and efficient manner, but their use in situated dialogue remains
under-explored. This paper presents a novel conversational model for
human--robot interaction that rests upon a graph-based representation of the
dialogue state. The knowledge graph representing the dialogue state is
continuously updated with new observations from the robot sensors, including
linguistic, situated and multimodal inputs, and is further enriched by other
modules, in particular for spatial understanding. The neural conversational
model employed to respond to user utterances relies on a simple but effective
graph-to-text mechanism that traverses the dialogue state graph and converts
the traversals into a natural language form. This conversion of the state graph
into text is performed using a set of parameterized functions, and the values
for those parameters are optimized based on a small set of Wizard-of-Oz
interactions. After this conversion, the text representation of the dialogue
state graph is included as part of the prompt of a large language model used to
decode the agent response. The proposed approach is empirically evaluated
through a user study with a humanoid robot that acts as conversation partner to
evaluate the impact of the graph-to-text mechanism on the response generation.
After moving a robot along a tour of an indoor environment, participants
interacted with the robot using spoken dialogue and evaluated how well the
robot was able to answer questions about what the robot observed during the
tour. User scores show a statistically significant improvement in the perceived
factuality of the robot responses when the graph-to-text approach is employed,
compared to a baseline using inputs structured as semantic triples.
comment: Submitted to Dialogue & Discourse 2023
♻ ☆ Reactive Diffusion Policy: Slow-Fast Visual-Tactile Policy Learning for Contact-Rich Manipulation
Humans can accomplish complex contact-rich tasks using vision and touch, with
highly reactive capabilities such as quick adjustments to environmental changes
and adaptive control of contact forces; however, this remains challenging for
robots. Existing visual imitation learning (IL) approaches rely on action
chunking to model complex behaviors, which lacks the ability to respond
instantly to real-time tactile feedback during the chunk execution.
Furthermore, most teleoperation systems struggle to provide fine-grained
tactile / force feedback, which limits the range of tasks that can be
performed. To address these challenges, we introduce TactAR, a low-cost
teleoperation system that provides real-time tactile feedback through Augmented
Reality (AR), along with Reactive Diffusion Policy (RDP), a novel slow-fast
visual-tactile imitation learning algorithm for learning contact-rich
manipulation skills. RDP employs a two-level hierarchy: (1) a slow latent
diffusion policy for predicting high-level action chunks in latent space at low
frequency, (2) a fast asymmetric tokenizer for closed-loop tactile feedback
control at high frequency. This design enables both complex trajectory modeling
and quick reactive behavior within a unified framework. Through extensive
evaluation across three challenging contact-rich tasks, RDP significantly
improves performance compared to state-of-the-art visual IL baselines through
rapid response to tactile / force feedback. Furthermore, experiments show that
RDP is applicable across different tactile / force sensors. Code and videos are
available on https://reactive-diffusion-policy.github.io.
♻ ☆ AnyTouch: Learning Unified Static-Dynamic Representation across Multiple Visuo-tactile Sensors ICLR 2025
Visuo-tactile sensors aim to emulate human tactile perception, enabling
robots to precisely understand and manipulate objects. Over time, numerous
meticulously designed visuo-tactile sensors have been integrated into robotic
systems, aiding in completing various tasks. However, the distinct data
characteristics of these low-standardized visuo-tactile sensors hinder the
establishment of a powerful tactile perception system. We consider that the key
to addressing this issue lies in learning unified multi-sensor representations,
thereby integrating the sensors and promoting tactile knowledge transfer
between them. To achieve unified representation of this nature, we introduce
TacQuad, an aligned multi-modal multi-sensor tactile dataset from four
different visuo-tactile sensors, which enables the explicit integration of
various sensors. Recognizing that humans perceive the physical environment by
acquiring diverse tactile information such as texture and pressure changes, we
further propose to learn unified multi-sensor representations from both static
and dynamic perspectives. By integrating tactile images and videos, we present
AnyTouch, a unified static-dynamic multi-sensor representation learning
framework with a multi-level structure, aimed at both enhancing comprehensive
perceptual abilities and enabling effective cross-sensor transfer. This
multi-level architecture captures pixel-level details from tactile data via
masked modeling and enhances perception and transferability by learning
semantic-level sensor-agnostic features through multi-modal alignment and
cross-sensor matching. We provide a comprehensive analysis of multi-sensor
transferability, and validate our method on various datasets and in the
real-world pouring task. Experimental results show that our method outperforms
existing methods, exhibits outstanding static and dynamic perception
capabilities across various sensors.
comment: Accepted by ICLR 2025
♻ ☆ VET: A Visual-Electronic Tactile System for Immersive Human-Machine Interaction
In the pursuit of deeper immersion in human-machine interaction, achieving
higher-dimensional tactile input and output on a single interface has become a
key research focus. This study introduces the Visual-Electronic Tactile (VET)
System, which builds upon vision-based tactile sensors (VBTS) and integrates
electrical stimulation feedback to enable bidirectional tactile communication.
We propose and implement a system framework that seamlessly integrates an
electrical stimulation film with VBTS using a screen-printing preparation
process, eliminating interference from traditional methods. While VBTS captures
multi-dimensional input through visuotactile signals, electrical stimulation
feedback directly stimulates neural pathways, preventing interference with
visuotactile information. The potential of the VET system is demonstrated
through experiments on finger electrical stimulation sensitivity zones, as well
as applications in interactive gaming and robotic arm teleoperation. This
system paves the way for new advancements in bidirectional tactile interaction
and its broader applications.
♻ ☆ Temporal and Semantic Evaluation Metrics for Foundation Models in Post-Hoc Analysis of Robotic Sub-tasks IROS 2024
Recent works in Task and Motion Planning (TAMP) show that training control
policies on language-supervised robot trajectories with quality labeled data
markedly improves agent task success rates. However, the scarcity of such data
presents a significant hurdle to extending these methods to general use cases.
To address this concern, we present an automated framework to decompose
trajectory data into temporally bounded and natural language-based descriptive
sub-tasks by leveraging recent prompting strategies for Foundation Models (FMs)
including both Large Language Models (LLMs) and Vision Language Models (VLMs).
Our framework provides both time-based and language-based descriptions for
lower-level sub-tasks that comprise full trajectories. To rigorously evaluate
the quality of our automatic labeling framework, we contribute an algorithm
SIMILARITY to produce two novel metrics, temporal similarity and semantic
similarity. The metrics measure the temporal alignment and semantic fidelity of
language descriptions between two sub-task decompositions, namely an FM
sub-task decomposition prediction and a ground-truth sub-task decomposition. We
present scores for temporal similarity and semantic similarity above 90%,
compared to 30% of a randomized baseline, for multiple robotic environments,
demonstrating the effectiveness of our proposed framework. Our results enable
building diverse, large-scale, language-supervised datasets for improved
robotic TAMP.
comment: 8 pages, 3 figures. IROS 2024 Submission
♻ ☆ Scalable Real2Sim: Physics-Aware Asset Generation Via Robotic Pick-and-Place Setups
Simulating object dynamics from real-world perception shows great promise for
digital twins and robotic manipulation but often demands labor-intensive
measurements and expertise. We present a fully automated Real2Sim pipeline that
generates simulation-ready assets for real-world objects through robotic
interaction. Using only a robot's joint torque sensors and an external camera,
the pipeline identifies visual geometry, collision geometry, and physical
properties such as inertial parameters. Our approach introduces a general
method for extracting high-quality, object-centric meshes from photometric
reconstruction techniques (e.g., NeRF, Gaussian Splatting) by employing
alpha-transparent training while explicitly distinguishing foreground
occlusions from background subtraction. We validate the full pipeline through
extensive experiments, demonstrating its effectiveness across diverse objects.
By eliminating the need for manual intervention or environment modifications,
our pipeline can be integrated directly into existing pick-and-place setups,
enabling scalable and efficient dataset creation. Project page (with code and
data): https://scalable-real2sim.github.io/.
comment: Website: https://scalable-real2sim.github.io/
♻ ☆ RG-Attn: Radian Glue Attention for Multi-modality Multi-agent Cooperative Perception
Cooperative perception offers an optimal solution to overcome the perception
limitations of single-agent systems by leveraging Vehicle-to-Everything (V2X)
communication for data sharing and fusion across multiple agents. However, most
existing approaches focus on single-modality data exchange, limiting the
potential of both homogeneous and heterogeneous fusion across agents. This
overlooks the opportunity to utilize multi-modality data per agent, restricting
the system's performance. In the automotive industry, manufacturers adopt
diverse sensor configurations, resulting in heterogeneous combinations of
sensor modalities across agents. To harness the potential of every possible
data source for optimal performance, we design a robust LiDAR and camera
cross-modality fusion module, Radian-Glue-Attention (RG-Attn), applicable to
both intra-agent cross-modality fusion and inter-agent cross-modality fusion
scenarios, owing to the convenient coordinate conversion by transformation
matrix and the unified sampling/inversion mechanism. We also propose two
different architectures, named Paint-To-Puzzle (PTP) and
Co-Sketching-Co-Coloring (CoS-CoCo), for conducting cooperative perception. PTP
aims for maximum precision performance and achieves smaller data packet size by
limiting cross-agent fusion to a single instance, but requiring all
participants to be equipped with LiDAR. In contrast, CoS-CoCo supports agents
with any configuration-LiDAR-only, camera-only, or LiDAR-camera-both,
presenting more generalization ability. Our approach achieves state-of-the-art
(SOTA) performance on both real and simulated cooperative perception datasets.
The code is now available at GitHub.
♻ ☆ Whole-Body Dynamic Throwing with Legged Manipulators
Throwing with a legged robot involves precise coordination of object
manipulation and locomotion - crucial for advanced real-world interactions.
Most research focuses on either manipulation or locomotion, with minimal
exploration of tasks requiring both. This work investigates leveraging all
available motors (full-body) over arm-only throwing in legged manipulators. We
frame the task as a deep reinforcement learning (RL) objective, optimising
throwing accuracy towards any user-commanded target destination and the robot's
stability. Evaluations on a humanoid and an armed quadruped in simulation show
that full-body throwing improves range, accuracy, and stability by exploiting
body momentum, counter-balancing, and full-body dynamics. We introduce an
optimised adaptive curriculum to balance throwing accuracy and stability, along
with a tailored RL environment setup for efficient learning in sparse-reward
conditions. Unlike prior work, our approach generalises to targets in 3D space.
We transfer our learned controllers from simulation to a real humanoid
platform.