Decision Transformer: A Friendly Dive into RL via Sequence Modeling

What if Reinforcement Learning (RL) looked more like next-word prediction on a big language model?
That’s exactly the question that Decision Transformer by Chen et al. (2021) sets out to explore. In this paper, the authors recast RL as sequence modeling, using the power of the Transformer architecture to generate actions simply by predicting the next token in a sequence.

1. Background: Why Another RL Approach?

Everyone and their cousin in RL land usually starts with something about Q-learning, policy gradients, or value functions. But training these can get messy:

“By conditioning an autoregressive model on the desired return (reward), past states, and actions, our Decision Transformer model can generate future actions that achieve the desired return.”

I’m highlighting this because it’s a big departure from how RL folks typically do it. Normally, we keep hacking away at a value function to ensure we pick good actions. Decision Transformer says, “Eh, let’s just treat it like a sequence generation problem.” The same way a language model predicts the next word, we predict the next action (with a side of desired return), and voilà, we have a policy!

A Quick Example

Imagine you’re texting a friend (like GPT does). Each word is predicted by looking at the context. Similarly, actions can be seen as “tokens” in a sequence—so the Transformer just has to guess the next action-token that makes sense given the context of states and the return we want.


2. What’s So New Here?

No more bootstrapping or Bellman backup:

“This allows us to bypass the need for bootstrapping for long term credit assignment—thereby avoiding one of the ‘deadly triad’ known to destabilize RL.”

The “deadly triad” is that terrifying trifecta of:

  1. Function approximation
  2. Bootstrapping
  3. Off-policy data

In classic RL, these can cause wild training instabilities. By just predicting sequences (instead of computing target values and backups iteratively), Decision Transformer sidesteps a chunk of these issues.


3. The Magic Formula: Sequence Modeling

Here’s the gist: you feed the Transformer a sequence:

where is the return-to-go at time . So the tokens are basically:

  1. Return-to-go (how much total reward you still want to get)
  2. State (where you are in the environment)
  3. Action (what you do)

Then the model is trained to predict the next action token. Simple as that.

How do we use it at test-time?

  • We set an initial desired return, . For instance, if we want a “near expert” score, we pick some high number.
  • Then we feed in: (desired return, first state), let the model generate the action.
  • Step in the environment, get a reward .
  • Subtract from the old desired return to get a new .
  • Keep going.

No fancy Q-value iteration. No complicated critics. Just “predict the next action if we want X total reward.”

Example: Shortest Path in a Graph

The paper opens with a fun puzzle: a random graph, random walks in the dataset, but we can prompt a desired shorter path! By picking higher returns, the Transformer ends up generating the optimal path to the goal. This is a perfect mic-drop illustration of how just modeling data + a desired return token recovers a better policy than the random data would suggest.


4. Cool Results

Let’s talk about how it performs in offline RL tasks:

  1. Atari:
    They test on 1% of the DQN-replay dataset. That is tiny. Despite that, Decision Transformer does about as well or better than specialized offline RL methods like CQL.

  2. D4RL Locomotion (HalfCheetah, Hopper, Walker)
    This benchmark is the new gold standard for offline RL. Decision Transformer outperforms or ties with CQL, BEAR, BRAC, etc.
    The kicker? It doesn’t rely on any explicit “conservative” trick.

“Decision Transformer matches or exceeds the performance of state-of-the-art model-free offline RL baselines on Atari, OpenAI Gym, and Key-to-Door tasks.”

Translation: They basically marched in, said, “We got rid of Q-learning, so let’s try this GPT-ish approach,” and it worked shockingly well.


5. Long-Horizon Credit Assignment & Sparsity

“Credit assignment” means: when your reward shows up only after dozens of steps (maybe picking up a key in room 1 leads to success in room 3), how do you figure out what to do? Many RL algorithms either drown in confusion or must do fancy reward shaping. But with a Transformer:

“With this work, we aim to bridge sequence modeling and transformers with RL, and hope that sequence modeling serves as a strong algorithmic paradigm for RL.”

The Key-to-Door experiment shows how random data is super suboptimal. Yet, Transformers can still learn which early action was relevant—they do so by attending to relevant tokens in the sequence. Meanwhile, a typical Q-learner might be clueless because it sees too few positive signals.

And how about delayed returns? The authors tested a version of MuJoCo tasks where all the reward is given at the very end. Q-learning basically shrugs and fails. Decision Transformer? Doesn’t mind—it just sees sequences with a final big reward token and learns to replicate the sequence of states and actions that led to that big payoff.


6. Key Takeaways

  • No need for Pessimism or Specialized Regularizers
    Decision Transformer is not just naive imitation. It does better than simply cloning the top trajectories (“percentile BC”).
  • User-Friendly Prompting
    You want a certain level of performance or behavior? Condition on that return token. If it’s feasible under the dataset distribution, the Transformer yields it.
  • Better with More Data
    As in language modeling, we could presumably keep scaling the dataset and architecture size. Larger Transformers might get even better at RL tasks—mirroring the GPT story in NLP.

Where Next?

We could dream of:

  • Fine-Tuning with Online Data: Start with a pre-trained Decision Transformer on a giant offline dataset, then do a bit of online exploration.
  • Transformer as a Model of Trajectories: Instead of just predicting actions, some also explore predicting next states or next returns—like a powerful “model-based” approach.

In short, this paper suggests a huge step toward bridging RL and the “all you need is attention” success we’ve seen in language and vision. It’s a fresh perspective if you’re tired of the usual Bellman backups.


Final Thoughts

If you’ve ever admired the elegance of GPT generating text, imagine that same flow in RL: we set a target return, the model “writes the next action token,” and so on. That’s Decision Transformer in a nutshell—a deceptively simple idea with big implications for how we might do RL in the future.


Disclaimer: The quotes and references above are adapted from “Decision Transformer: Reinforcement Learning via Sequence Modeling”.