Cross-Embodiment Offline Reinforcement Learning for Heterogeneous Robot Datasets

Teaser image of cross-embodiment offline RL scaling

We scale robot pre-training by combining cross-embodiment learning and offline RL, so we can use both expert and suboptimal trajectories across many robot morphologies.

1. Why Cross-Embodiment + Offline RL?

Robotics faces a persistent data bottleneck: collecting high-quality demonstrations is costly and difficult to scale.

Cross-embodiment learning broadens data diversity by aggregating trajectories from multiple robot types into a shared training stream, expanding behavior and dynamics coverage beyond a single embodiment.

Offline reinforcement learning broadens data usability by learning effectively from mixed-quality trajectories, including suboptimal logs. Together, they form a practical path to scalable robot pre-training.

2. Benchmark + Dataset

The benchmark covers 16 robots (9 quadrupeds, 6 bipeds, 1 hexapod). For each embodiment, we construct 1M-step datasets with the following quality regimes:

Expert: 1M steps collected by rolling out a fully converged PPO policy (high-quality trajectories).
Expert Replay: interaction data from training start to expert-level performance, then uniformly subsampled to 1M steps.
70% Suboptimal Replay: 70% sampled from the early suboptimal phase and 30% from the late expert-like phase.

Each regime is provided in Forward and Backward variants. Importantly, we use a family of suboptimal-ratio datasets to stress-test transfer, conflict, and robustness as data quality changes in cross-embodiment settings.

3. Transfer Findings and Gradient Conflict Analysis

Cross-embodiment offline RL does not yield uniformly positive transfer. As the suboptimal ratio increases, some robots improve while others degrade. To identify the mechanism, we analyze gradient conflicts directly.

Gradient Conflict Analysis

To explain this behavior, we measure pairwise actor-gradient cosine similarity between embodiments:

Key Finding

Higher suboptimal ratio (Expert -> 30% -> 70%) increases conflict.
More embodiments and greater diversity increase conflict frequency.

The same trend appears in both IQL and TD3+BC, indicating that naive scaling is limited by optimization conflict.

Chart showing fraction of C less than 0 versus suboptimal ratio

Chart showing fraction of C less than 0 versus embodiment count and diversity

Correlation Between Embodiment Distance and Gradient Conflicts

We represent each robot as a graph (torso, joints, feet) and compute embodiment similarity using fused Gromov-Wasserstein (FGW), then compare similarity with gradient cosine statistics.

Key Finding

Embodiment similarity shows a fairly strong correlation with gradient conflict. Morphologically similar robots tend to produce aligned updates, while dissimilar robots show more gradient conflict.

IQL: embodiment similarity matrix, gradient cosine matrix, and scatter correlation.

IQL embodiment similarity versus gradient cosine scatter plot

TD3+BC: embodiment similarity matrix, gradient cosine matrix, and scatter correlation.

TD3+BC gradient cosine similarity matrix

TD3+BC embodiment similarity versus gradient cosine scatter plot

4. Embodiment Grouping (EG)

EG reduces cross-robot interference by updating morphologically compatible robots together. The method computes embodiment distances once, clusters robots, and then performs group-wise actor updates with a shared critic update.

Robot-as-graph: represent each embodiment with a torso-joint-foot graph.
FGW distances: compute pairwise embodiment distances from graph structure and node features.
Hierarchical clustering: cluster robots into M fixed groups before training.
Group-wise actor updates: update the critic globally, then update the actor sequentially by group.

Why it works: conflicting updates are suppressed across dissimilar robots, while positive transfer is preserved within each group.

Why practical: distances are precomputed once, and EG can be added to standard offline RL pipelines with minimal implementation overhead.

EG pipeline: robot graph to FGW matrix to clustering to group-wise actor updates

5. Results

EG achieves the best overall mean performance among compared methods. The largest improvements appear in suboptimal-heavy datasets, where conflicts are most severe. On 70% Suboptimal Forward, IQL+EG improves from 36.62 to 51.19 (+39.8%).

Dataset	BC	TD3+BC	IQL	IQL+SEL	IQL+PCGrad	BC+EG	TD3+BC+EG	IQL+EG (Ours)
Expert Forward	63.31 ± 0.10	52.14 ± 1.89	63.39 ± 0.05	63.37 ± 0.07	63.37 ± 0.04	63.47 ± 0.04	59.34 ± 1.19	63.52 ± 0.04
Expert Backward	67.17 ± 0.01	47.94 ± 0.48	67.10 ± 0.01	67.24 ± 0.02	67.05 ± 0.02	67.24 ± 0.02	51.98 ± 1.26	67.24 ± 0.01
Expert Replay Forward	49.71 ± 1.06	55.66 ± 0.84	54.61 ± 0.12	55.01 ± 0.55	53.84 ± 0.67	51.89 ± 0.65	57.04 ± 0.46	54.62 ± 0.53
Expert Replay Backward	42.87 ± 1.32	52.31 ± 1.22	51.86 ± 1.56	55.73 ± 1.06	55.94 ± 1.14	48.64 ± 2.65	55.67 ± 1.30	57.58 ± 0.05
70% Suboptimal Forward	30.52 ± 3.10	35.74 ± 1.51	36.62 ± 1.02	44.59 ± 2.02	39.63 ± 1.95	42.99 ± 1.23	43.41 ± 1.54	51.19 ± 1.06
70% Suboptimal Backward	41.42 ± 0.71	34.79 ± 1.40	38.69 ± 0.89	44.45 ± 1.75	41.04 ± 1.10	46.30 ± 2.39	40.88 ± 0.83	49.60 ± 2.39
Mean	49.17	46.43	52.05	55.07	53.48	53.42	51.39	57.29

BibTeX

@inproceedings{abe2026crossembodimentoffline,
  title     = {Cross-Embodiment Offline Reinforcement Learning for Heterogeneous Robot Datasets},
  author    = {Abe, Haruki and Osa, Takayuki and Mukuta, Yusuke and Harada, Tatsuya},
  booktitle = {International Conference on Learning Representations (ICLR)},
  year      = {2026}
}