RT-1 Unveiled: A Simple and Generalizable Transformer for Scalable Robotics Control

How to build a scalable model for robotic data that demonstrates diverse capabilities?
Teaching robots has always been a task-by-task job. Whether using reinforcement learning or imitation learning, each task required collecting specific data, labeling it, and training a model from scratch—just like how early language and vision models worked.

But then, foundation models changed everything. These large, pre-trained models learned general patterns from massive datasets, making them more flexible and powerful. Seeing this success in language and vision, the authors of this paper wondered: Can we do the same for robots? Could a single model learn general skills and apply them to new tasks and environments ?

That question led to Robotic Transformer—a Real world general-purpose robotic model designed to bring smarter learning.

The Key Contributions of RT-1

Building a general-purpose robot learning system isn’t a walk in the park. It needs tons of data to learn zero-shot skills—adapting to new tasks instead of just memorizing the ones it was trained on. But there are two big hurdles:

Collecting a massive, diverse dataset
Designing a model that can actually deal with such data

So, how did the authors tackle this?

Built a large dataset – Gathered 130K trajectories from 13 different robots, covering 17 tasks. That’s a significant amount of robot practice!
Introduced RT-1, a transformer-based model – Since transformers are great at handling huge amounts of data, they designed RT-1 to take in images and language instructions and output actions. Basically, it tells robots what to do in a simpler and more efficient way.
Put RT-1 to the test – They challenged it with unseen tasks, distractions, and background changes, along with long-horizon sequences. The result? It adapted well, proving it’s not just memorizing but actually learning.

Fundamental idea behind RT-1?

The core idea behind RT-1 is pretty straightforward: Given a language instruction $l_{i}$ and an image observation $x_{i}$ , the robot’s learning policy $π$ selects an action from its action distribution and applies it.

The goal? To learn $π$ such that it maximizes the robot’s average reward:

π^{*} = ar g π max E [R (τ)]

where $R (τ)$ is the total reward accumulated over a trajectory $τ$ .

RT-1 employs sequence modeling with transformers, mapping a sequence of language-conditioned image observations to a sequence of actions:

a_{t} = π (x_{t}, l)

Instead of relying on reinforcement learning (RL), RT-1 is trained using behavioral cloning. This means the model learns by minimizing the negative log-likelihood (NLL) of the demonstrated actions:

L (π) = - E_{(x, l, a) \sim D} [lo g π (a ∣ x, l)]

Essentially, RT-1 gets better at predicting the correct actions based on given images and language instructions, rather than exploring through trial and error like traditional RL methods.

How RT-1 Works Under the Hood

RT-1 efficiently processes task descriptions and image histories to generate 11-D discrete action tokens, which are then mapped to continuous action vectors and applied to the robot at each timestep.

The Model Follows Three Main Steps:

Processing Image and Language Inputs

Images are processed using an EfficientNet backbone to extract visual features.
Language instructions are passed through a universal language encoder to create language embeddings.
These language embeddings condition the visual features through a FiLM (Feature-wise Linear Modulation) layer, ensuring the image features are interpreted in the context of the given task.

Token Compression

The conditioned visual tokens are passed through a TokenLearner module.
This step reduces the number of tokens while retaining the most important information, increasing the speed of model inference.

Transformers for Action Prediction

The selected tokens are fed into a transformer, which predicts 11-dimensional action tokens representing the robot’s next move.

What’s in the 11-D Action Space?

The action tokens define how the robot moves:

7 dimensions → Robot arm movement:
$(x, y, z, roll, pitch, yaw, end gripper)$
3 dimensions → Robot base movement:
$(x, y, yaw)$
1 mode selection variable → Choosing between arm, base, or episode end

Experimental Discussion

RT-1’s effectiveness shines through in over 3000 real-world trials conducted across three Google robotic environments, benchmarked against models like GATO and BC-Z.

Performance Breakdown

Seen tasks → Over 95% success rate
Unseen tasks → About 76% success rate
Changing backgrounds → Around 60% success rate
Handling distractors → 83% success rate
Long-horizon tasks → 87% in planning and 67% in execution

Generalization & Adaptability

RT-1 not only excels at familiar tasks but also generalizes well to:
- New tasks
- Different robots
- Varied environments, even in realistic, everyday scenarios
It is versatile enough to be trained on data from multiple robots and even mix real-world data with simulated examples.

Key Insight: Data Diversity Over Volume

Studies on scalability revealed a crucial finding:
- Data diversity is more important than simply having a large dataset.
- A varied dataset improves generalization far more than sheer volume alone.

Conclusion

RT-1, a transformer-based robot learning model trained on 130K real-world episodes, delivers on the goal of building a generalizable system capable of:

Performing multiple tasks
Adapting to new, unseen challenges
Handling changing environments, distractions, and background variations
Handles long-horizon tasks with up to 50 steps
Learns from a mix of real-world and simulated data, making it highly adaptable

The Future of Robotics

Most importantly, RT-1 is the first model to bring true generalization to robotics, paving the way for more flexible and scalable robot learning in the future!

Disclaimer: The quotes and references above are adapted from “Robotic Transformer : Robotics Transformer for real-world control at scale”.

🪴 Learn without Regrets

Explorer

1. RT-1