We scale robot pre-training by combining cross-embodiment learning and offline RL, so we can use both expert and suboptimal trajectories across many robot morphologies.
Robotics faces a persistent data bottleneck: collecting high-quality demonstrations is costly and difficult to scale.
Cross-embodiment learning broadens data diversity by aggregating trajectories from multiple robot types into a shared training stream, expanding behavior and dynamics coverage beyond a single embodiment.
Offline reinforcement learning broadens data usability by learning effectively from mixed-quality trajectories, including suboptimal logs. Together, they form a practical path to scalable robot pre-training.
The benchmark covers 16 robots (9 quadrupeds, 6 bipeds, 1 hexapod). For each embodiment, we construct 1M-step datasets with the following quality regimes:
Each regime is provided in Forward and Backward variants. Importantly, we use a family of suboptimal-ratio datasets to stress-test transfer, conflict, and robustness as data quality changes in cross-embodiment settings.
Cross-embodiment offline RL does not yield uniformly positive transfer. As the suboptimal ratio increases, some robots improve while others degrade. To identify the mechanism, we analyze gradient conflicts directly.
To explain this behavior, we measure pairwise actor-gradient cosine similarity between embodiments:
Key Finding
The same trend appears in both IQL and TD3+BC, indicating that naive scaling is limited by optimization conflict.
We represent each robot as a graph (torso, joints, feet) and compute embodiment similarity using fused Gromov-Wasserstein (FGW), then compare similarity with gradient cosine statistics.
Key Finding
IQL: embodiment similarity matrix, gradient cosine matrix, and scatter correlation.
TD3+BC: embodiment similarity matrix, gradient cosine matrix, and scatter correlation.
EG reduces cross-robot interference by updating morphologically compatible robots together. The method computes embodiment distances once, clusters robots, and then performs group-wise actor updates with a shared critic update.
Why it works: conflicting updates are suppressed across dissimilar robots, while positive transfer is preserved within each group.
Why practical: distances are precomputed once, and EG can be added to standard offline RL pipelines with minimal implementation overhead.
EG achieves the best overall mean performance among compared methods. The largest improvements appear in suboptimal-heavy datasets, where conflicts are most severe. On 70% Suboptimal Forward, IQL+EG improves from 36.62 to 51.19 (+39.8%).
| Dataset | BC | TD3+BC | IQL | IQL+SEL | IQL+PCGrad | BC+EG | TD3+BC+EG | IQL+EG (Ours) |
|---|---|---|---|---|---|---|---|---|
| Expert Forward | 63.31 ± 0.10 | 52.14 ± 1.89 | 63.39 ± 0.05 | 63.37 ± 0.07 | 63.37 ± 0.04 | 63.47 ± 0.04 | 59.34 ± 1.19 | 63.52 ± 0.04 |
| Expert Backward | 67.17 ± 0.01 | 47.94 ± 0.48 | 67.10 ± 0.01 | 67.24 ± 0.02 | 67.05 ± 0.02 | 67.24 ± 0.02 | 51.98 ± 1.26 | 67.24 ± 0.01 |
| Expert Replay Forward | 49.71 ± 1.06 | 55.66 ± 0.84 | 54.61 ± 0.12 | 55.01 ± 0.55 | 53.84 ± 0.67 | 51.89 ± 0.65 | 57.04 ± 0.46 | 54.62 ± 0.53 |
| Expert Replay Backward | 42.87 ± 1.32 | 52.31 ± 1.22 | 51.86 ± 1.56 | 55.73 ± 1.06 | 55.94 ± 1.14 | 48.64 ± 2.65 | 55.67 ± 1.30 | 57.58 ± 0.05 |
| 70% Suboptimal Forward | 30.52 ± 3.10 | 35.74 ± 1.51 | 36.62 ± 1.02 | 44.59 ± 2.02 | 39.63 ± 1.95 | 42.99 ± 1.23 | 43.41 ± 1.54 | 51.19 ± 1.06 |
| 70% Suboptimal Backward | 41.42 ± 0.71 | 34.79 ± 1.40 | 38.69 ± 0.89 | 44.45 ± 1.75 | 41.04 ± 1.10 | 46.30 ± 2.39 | 40.88 ± 0.83 | 49.60 ± 2.39 |
| Mean | 49.17 | 46.43 | 52.05 | 55.07 | 53.48 | 53.42 | 51.39 | 57.29 |
@inproceedings{abe2026crossembodimentoffline,
title = {Cross-Embodiment Offline Reinforcement Learning for Heterogeneous Robot Datasets},
author = {Abe, Haruki and Osa, Takayuki and Mukuta, Yusuke and Harada, Tatsuya},
booktitle = {International Conference on Learning Representations (ICLR)},
year = {2026}
}