RT-1 Unveiled: A Simple and Generalizable Transformer for Scalable Robotics Control

How to build a scalable model for robotic data that demonstrates diverse capabilities?
Teaching robots has always been a task-by-task job. Whether using reinforcement learning or imitation learning, each task required collecting specific data, labeling it, and training a model from scratch—just like how early language and vision models worked.

But then, foundation models changed everything. These large, pre-trained models learned general patterns from massive datasets, making them more flexible and powerful. Seeing this success in language and vision, the authors of this paper wondered: Can we do the same for robots? Could a single model learn general skills and apply them to new tasks and environments ?

That question led to Robotic Transformer—a Real world general-purpose robotic model designed to bring smarter learning.


The Key Contributions of RT-1

Building a general-purpose robot learning system isn’t a walk in the park. It needs tons of data to learn zero-shot skills—adapting to new tasks instead of just memorizing the ones it was trained on. But there are two big hurdles:

  1. Collecting a massive, diverse dataset
  2. Designing a model that can actually deal with such data

So, how did the authors tackle this?

  • Built a large dataset – Gathered 130K trajectories from 13 different robots, covering 17 tasks. That’s a significant amount of robot practice!

  • Introduced RT-1, a transformer-based model – Since transformers are great at handling huge amounts of data, they designed RT-1 to take in images and language instructions and output actions. Basically, it tells robots what to do in a simpler and more efficient way.

  • Put RT-1 to the test – They challenged it with unseen tasks, distractions, and background changes, along with long-horizon sequences. The result? It adapted well, proving it’s not just memorizing but actually learning.


Fundamental idea behind RT-1?

The core idea behind RT-1 is pretty straightforward: Given a language instruction and an image observation , the robot’s learning policy selects an action from its action distribution and applies it.

The goal? To learn such that it maximizes the robot’s average reward:

where is the total reward accumulated over a trajectory .

RT-1 employs sequence modeling with transformers, mapping a sequence of language-conditioned image observations to a sequence of actions:

Instead of relying on reinforcement learning (RL), RT-1 is trained using behavioral cloning. This means the model learns by minimizing the negative log-likelihood (NLL) of the demonstrated actions:

Essentially, RT-1 gets better at predicting the correct actions based on given images and language instructions, rather than exploring through trial and error like traditional RL methods.


How RT-1 Works Under the Hood

RT-1 efficiently processes task descriptions and image histories to generate 11-D discrete action tokens, which are then mapped to continuous action vectors and applied to the robot at each timestep.

The Model Follows Three Main Steps:

Processing Image and Language Inputs

  • Images are processed using an EfficientNet backbone to extract visual features.
  • Language instructions are passed through a universal language encoder to create language embeddings.
  • These language embeddings condition the visual features through a FiLM (Feature-wise Linear Modulation) layer, ensuring the image features are interpreted in the context of the given task.

Token Compression

  • The conditioned visual tokens are passed through a TokenLearner module.
  • This step reduces the number of tokens while retaining the most important information, increasing the speed of model inference.

Transformers for Action Prediction

  • The selected tokens are fed into a transformer, which predicts 11-dimensional action tokens representing the robot’s next move.

What’s in the 11-D Action Space?

The action tokens define how the robot moves:

  • 7 dimensions → Robot arm movement:
  • 3 dimensions → Robot base movement:
  • 1 mode selection variable → Choosing between arm, base, or episode end

Experimental Discussion

RT-1’s effectiveness shines through in over 3000 real-world trials conducted across three Google robotic environments, benchmarked against models like GATO and BC-Z.

Performance Breakdown

  • Seen tasks → Over 95% success rate
  • Unseen tasks → About 76% success rate
  • Changing backgrounds → Around 60% success rate
  • Handling distractors83% success rate
  • Long-horizon tasks87% in planning and 67% in execution

Generalization & Adaptability

  • RT-1 not only excels at familiar tasks but also generalizes well to:

    • New tasks
    • Different robots
    • Varied environments, even in realistic, everyday scenarios
  • It is versatile enough to be trained on data from multiple robots and even mix real-world data with simulated examples.

Key Insight: Data Diversity Over Volume

  • Studies on scalability revealed a crucial finding:
    • Data diversity is more important than simply having a large dataset.
    • A varied dataset improves generalization far more than sheer volume alone.

Conclusion

RT-1, a transformer-based robot learning model trained on 130K real-world episodes, delivers on the goal of building a generalizable system capable of:

  • Performing multiple tasks
  • Adapting to new, unseen challenges
  • Handling changing environments, distractions, and background variations
  • Handles long-horizon tasks with up to 50 steps
  • Learns from a mix of real-world and simulated data, making it highly adaptable

The Future of Robotics

Most importantly, RT-1 is the first model to bring true generalization to robotics, paving the way for more flexible and scalable robot learning in the future!


Disclaimer: The quotes and references above are adapted from “Robotic Transformer : Robotics Transformer for real-world control at scale”.