Off-policy dynamic programming (DP) techniques such as Q-learning are not guaranteed to converge in the presence of function approximation, often diverging due to the absence of Bellman-completeness in the function classes considered.

Return-conditioned supervised learning (RCSL) is an alternative off-policy framework, which learns a return-conditioned distribution of actions in each state by directly applying supervised learning on the trajectories in the dataset, achieving satisfactory performance by simply conditioning the policy on high desired returns.

RCSL is able to circumvent these challenges of Bellman completeness,
converging under significantly more relaxed assumptions inherited from supervised learning
than DP-based methods.
We prove there exists a natural environment in which
if one uses two-layer multilayer perceptron as the function approximator,
the layer width needs to grow *linearly* with the state space size to satisfy
Bellman-completeness while a *constant* layer width is enough for RCSL.

For either special case of RCSL:

- Markovian RCSL (context length equal to 1);
- Decision transformer (using cross-entropy loss) with any context length,

- Deterministic transition;
- Uniform coverage of dataset.

Can we design off-policy reinforcement learning algorithms that are
free from the requirements of Bellman completeness, while retaining the advantages of DP-based
off-policy RL?

MBRCSL leverages learned dynamics models and forward sampling to accomplish trajectory stitching, bringing the benefits of dynamic programming to RCSL without ever having to do dynamic programming backups.

We evaluate MBRCSL on a custom Point Maze environment built on top of D4RL Maze. We construct an offline dataset consisting of two kinds of suboptimal trajectories with equal number, and the optimal trajectory should be a stitching of the two trajectories in dataset. MBRCSL outperforms all baselines.

We also evaluate MBRCSL on three simulated robotic tasks from COG . In each task, a robot arm is required to complete some goal by accomplishing two phases of actions. The dataset for every task consists of two kinds of trajectories, with each kind completing exactly one phase of actions at a certain successful rate.

MBRCSL outperforms the baselines in all three tasks.

```
@article{zhou2023free,
author = {Zhou, Zhaoyi and Zhu, Chuning and Zhou, Runlong and Cui, Qiwen and Gupta, Abhishek and Du, Simon S.},
title = {Free from Bellman Completeness: Trajectory Stitching via Model-based Return-conditioned Supervised Learning},
booktitle = {ArXiv Preprint},
year = {2023},
}
```