Free from Bellman Completeness:
Trajectory Stitching via Model-based
Return-conditioned Supervised Learning

1Tsinghua University, 2University of Washington

Overview

Background

Off-policy dynamic programming (DP) techniques such as Q-learning are not guaranteed to converge in the presence of function approximation, often diverging due to the absence of Bellman-completeness in the function classes considered.

Bellman-completeness.

Return-conditioned supervised learning (RCSL) is an alternative off-policy framework, which learns a return-conditioned distribution of actions in each state by directly applying supervised learning on the trajectories in the dataset, achieving satisfactory performance by simply conditioning the policy on high desired returns.

RCSL.

Strength of RCSL: Freedom from Bellman Completeness

RCSL is able to circumvent these challenges of Bellman completeness, converging under significantly more relaxed assumptions inherited from supervised learning than DP-based methods. We prove there exists a natural environment in which if one uses two-layer multilayer perceptron as the function approximator, the layer width needs to grow linearly with the state space size to satisfy Bellman-completeness while a constant layer width is enough for RCSL.

Strength of RCSL.

Weakness of RCSL: No Trajectory Stitching

For either special case of RCSL:

  • Markovian RCSL (context length equal to 1);
  • Decision transformer (using cross-entropy loss) with any context length,
we rigorously prove that RCSL cannot perform trajectory stitching, even if the environment satisfies both properties below:
  1. Deterministic transition;
  2. Uniform coverage of dataset.

Can we design off-policy reinforcement learning algorithms that are free from the requirements of Bellman completeness, while retaining the advantages of DP-based off-policy RL?

Model-based return-conditioned supervised learning (MBRCSL)

MBRCSL leverages learned dynamics models and forward sampling to accomplish trajectory stitching, bringing the benefits of dynamic programming to RCSL without ever having to do dynamic programming backups.

mbrcsl

Experiments

Point Maze

We evaluate MBRCSL on a custom Point Maze environment built on top of D4RL Maze. We construct an offline dataset consisting of two kinds of suboptimal trajectories with equal number, and the optimal trajectory should be a stitching of the two trajectories in dataset. MBRCSL outperforms all baselines.

table_maze
Pointmaze illustration pointmaze

Simulated Robotics

We also evaluate MBRCSL on three simulated robotic tasks from COG . In each task, a robot arm is required to complete some goal by accomplishing two phases of actions. The dataset for every task consists of two kinds of trajectories, with each kind completing exactly one phase of actions at a certain successful rate.

cog

MBRCSL outperforms the baselines in all three tasks.

table_maze
pickplace doubledraweropen doubledrawercloseopen

BibTeX

@article{zhou2023free,
  author    = {Zhou, Zhaoyi and Zhu, Chuning and Zhou, Runlong and Cui, Qiwen and Gupta, Abhishek and Du, Simon S.},
  title     = {Free from Bellman Completeness: Trajectory Stitching via Model-based Return-conditioned Supervised Learning}, 
  booktitle = {ArXiv Preprint},
  year      = {2023},
}