Cross-Embodiment Offline Reinforcement Learning for Heterogeneous Robot Datasets

Haruki Abe1,2, Takayuki Osa2, Yusuke Mukuta1,2, Tatsuya Harada1,2
1The University of Tokyo, 2RIKEN Center for Advanced Intelligence Project
ICLR 2026
Teaser image of cross-embodiment offline RL scaling

We scale robot pre-training by combining cross-embodiment learning and offline RL, so we can use both expert and suboptimal trajectories across many robot morphologies.

1. Why Cross-Embodiment + Offline RL?

Robotics faces a persistent data bottleneck: collecting high-quality demonstrations is costly and difficult to scale.

Cross-embodiment learning broadens data diversity by aggregating trajectories from multiple robot types into a shared training stream, expanding behavior and dynamics coverage beyond a single embodiment.

Offline reinforcement learning broadens data usability by learning effectively from mixed-quality trajectories, including suboptimal logs. Together, they form a practical path to scalable robot pre-training.

2. Benchmark + Dataset

The benchmark covers 16 robots (9 quadrupeds, 6 bipeds, 1 hexapod). For each embodiment, we construct 1M-step datasets with the following quality regimes:

  • Expert: 1M steps collected by rolling out a fully converged PPO policy (high-quality trajectories).
  • Expert Replay: interaction data from training start to expert-level performance, then uniformly subsampled to 1M steps.
  • 70% Suboptimal Replay: 70% sampled from the early suboptimal phase and 30% from the late expert-like phase.

Each regime is provided in Forward and Backward variants. Importantly, we use a family of suboptimal-ratio datasets to stress-test transfer, conflict, and robustness as data quality changes in cross-embodiment settings.

Full robot roster across embodiments

3. Transfer Findings and Gradient Conflict Analysis

Cross-embodiment offline RL does not yield uniformly positive transfer. As the suboptimal ratio increases, some robots improve while others degrade. To identify the mechanism, we analyze gradient conflicts directly.

Gradient Conflict Analysis

To explain this behavior, we measure pairwise actor-gradient cosine similarity between embodiments:

Key Finding

  • Higher suboptimal ratio (Expert -> 30% -> 70%) increases conflict.
  • More embodiments and greater diversity increase conflict frequency.

The same trend appears in both IQL and TD3+BC, indicating that naive scaling is limited by optimization conflict.

Chart showing fraction of C less than 0 versus suboptimal ratio
Chart showing fraction of C less than 0 versus embodiment count and diversity

Correlation Between Embodiment Distance and Gradient Conflicts

We represent each robot as a graph (torso, joints, feet) and compute embodiment similarity using fused Gromov-Wasserstein (FGW), then compare similarity with gradient cosine statistics.

Key Finding

  • Embodiment similarity shows a fairly strong correlation with gradient conflict. Morphologically similar robots tend to produce aligned updates, while dissimilar robots show more gradient conflict.

IQL: embodiment similarity matrix, gradient cosine matrix, and scatter correlation.

IQL embodiment similarity matrix
IQL gradient cosine similarity matrix
IQL embodiment similarity versus gradient cosine scatter plot

TD3+BC: embodiment similarity matrix, gradient cosine matrix, and scatter correlation.

TD3+BC embodiment similarity matrix
TD3+BC gradient cosine similarity matrix
TD3+BC embodiment similarity versus gradient cosine scatter plot

4. Embodiment Grouping (EG)

EG reduces cross-robot interference by updating morphologically compatible robots together. The method computes embodiment distances once, clusters robots, and then performs group-wise actor updates with a shared critic update.

  1. Robot-as-graph: represent each embodiment with a torso-joint-foot graph.
  2. FGW distances: compute pairwise embodiment distances from graph structure and node features.
  3. Hierarchical clustering: cluster robots into M fixed groups before training.
  4. Group-wise actor updates: update the critic globally, then update the actor sequentially by group.

Why it works: conflicting updates are suppressed across dissimilar robots, while positive transfer is preserved within each group.

Why practical: distances are precomputed once, and EG can be added to standard offline RL pipelines with minimal implementation overhead.

EG pipeline: robot graph to FGW matrix to clustering to group-wise actor updates

5. Results

EG achieves the best overall mean performance among compared methods. The largest improvements appear in suboptimal-heavy datasets, where conflicts are most severe. On 70% Suboptimal Forward, IQL+EG improves from 36.62 to 51.19 (+39.8%).

Dataset BC TD3+BC IQL IQL+SEL IQL+PCGrad BC+EG TD3+BC+EG IQL+EG (Ours)
Expert Forward 63.31 ± 0.10 52.14 ± 1.89 63.39 ± 0.05 63.37 ± 0.07 63.37 ± 0.04 63.47 ± 0.04 59.34 ± 1.19 63.52 ± 0.04
Expert Backward 67.17 ± 0.01 47.94 ± 0.48 67.10 ± 0.01 67.24 ± 0.02 67.05 ± 0.02 67.24 ± 0.02 51.98 ± 1.26 67.24 ± 0.01
Expert Replay Forward 49.71 ± 1.06 55.66 ± 0.84 54.61 ± 0.12 55.01 ± 0.55 53.84 ± 0.67 51.89 ± 0.65 57.04 ± 0.46 54.62 ± 0.53
Expert Replay Backward 42.87 ± 1.32 52.31 ± 1.22 51.86 ± 1.56 55.73 ± 1.06 55.94 ± 1.14 48.64 ± 2.65 55.67 ± 1.30 57.58 ± 0.05
70% Suboptimal Forward 30.52 ± 3.10 35.74 ± 1.51 36.62 ± 1.02 44.59 ± 2.02 39.63 ± 1.95 42.99 ± 1.23 43.41 ± 1.54 51.19 ± 1.06
70% Suboptimal Backward 41.42 ± 0.71 34.79 ± 1.40 38.69 ± 0.89 44.45 ± 1.75 41.04 ± 1.10 46.30 ± 2.39 40.88 ± 0.83 49.60 ± 2.39
Mean 49.17 46.43 52.05 55.07 53.48 53.42 51.39 57.29

BibTeX

@inproceedings{abe2026crossembodimentoffline,
  title     = {Cross-Embodiment Offline Reinforcement Learning for Heterogeneous Robot Datasets},
  author    = {Abe, Haruki and Osa, Takayuki and Mukuta, Yusuke and Harada, Tatsuya},
  booktitle = {International Conference on Learning Representations (ICLR)},
  year      = {2026}
}