Free from Bellman Completeness:
Trajectory Stitching via Model-based
Return-conditioned Supervised Learning

Zhaoyi Zhou¹, Chuning Zhu², Runlong Zhou², Qiwen Cui², Abhishek Gupta², Simon Shaolei Du²,

¹Tsinghua University, ²University of Washington

Overview

Background

Off-policy dynamic programming (DP) techniques such as Q-learning are not guaranteed to converge in the presence of function approximation, often diverging due to the absence of Bellman-completeness in the function classes considered.

Return-conditioned supervised learning (RCSL) is an alternative off-policy framework, which learns a return-conditioned distribution of actions in each state by directly applying supervised learning on the trajectories in the dataset, achieving satisfactory performance by simply conditioning the policy on high desired returns.

Strength of RCSL: Freedom from Bellman Completeness

RCSL is able to circumvent these challenges of Bellman completeness, converging under significantly more relaxed assumptions inherited from supervised learning than DP-based methods. We prove there exists a natural environment in which if one uses two-layer multilayer perceptron as the function approximator, the layer width needs to grow linearly with the state space size to satisfy Bellman-completeness while a constant layer width is enough for RCSL.

Weakness of RCSL: No Trajectory Stitching

For either special case of RCSL:

Markovian RCSL (context length equal to 1);
Decision transformer (using cross-entropy loss) with any context length,

we rigorously prove that RCSL cannot perform trajectory stitching, even if the environment satisfies both properties below:

Deterministic transition;
Uniform coverage of dataset.

Can we design off-policy reinforcement learning algorithms that are free from the requirements of Bellman completeness, while retaining the advantages of DP-based off-policy RL?

Model-based return-conditioned supervised learning (MBRCSL)

MBRCSL leverages learned dynamics models and forward sampling to accomplish trajectory stitching, bringing the benefits of dynamic programming to RCSL without ever having to do dynamic programming backups.

Experiments

Point Maze

We evaluate MBRCSL on a custom Point Maze environment built on top of D4RL Maze. We construct an offline dataset consisting of two kinds of suboptimal trajectories with equal number, and the optimal trajectory should be a stitching of the two trajectories in dataset. MBRCSL outperforms all baselines.

Simulated Robotics

We also evaluate MBRCSL on three simulated robotic tasks from COG . In each task, a robot arm is required to complete some goal by accomplishing two phases of actions. The dataset for every task consists of two kinds of trajectories, with each kind completing exactly one phase of actions at a certain successful rate.

MBRCSL outperforms the baselines in all three tasks.

BibTeX

@article{zhou2023free,
  author    = {Zhou, Zhaoyi and Zhu, Chuning and Zhou, Runlong and Cui, Qiwen and Gupta, Abhishek and Du, Simon S.},
  title     = {Free from Bellman Completeness: Trajectory Stitching via Model-based Return-conditioned Supervised Learning}, 
  booktitle = {ArXiv Preprint},
  year      = {2023},
}