DeepMind’s RT-2 simplifies robot control with AI chat.

DeepMind's RT-2 simplifies robot control with AI chat.

Robotics Transformer 2: Communicating with Robots Using Language and Images

DeepMind’s robotics transformer version 2

In the world of robotics, a key element of the future lies in our ability to instruct machines in real-time. However, the question of how we can best communicate with robots is still open for exploration. Google’s DeepMind unit, in their latest research, proposes an innovative solution by leveraging a large language model, similar to OpenAI’s ChatGPT, combined with images and coordinated robot movement data. This approach creates a way for humans to communicate complex instructions to machines as easily as having a conversation with a chatbot.

The research paper, titled “RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control,” authored by Anthony Brohan and colleagues, introduces RT-2, a groundbreaking “vision-language-action” model. The acronym RT stands for “robotics transformer.” The challenge addressed by this model is how to train a program that can process both images and text to generate meaningful actions for a robot. DeepMind’s solution is to represent robot actions as another language, similar to the way ChatGPT generates new text by being trained on text from the internet.

The unique aspect of the RT-2 model is its representation of robot actions as coordinates in space, known as degrees of freedom. These coordinates include positional and rotational displacement of the robot end-effector, the level of extension of the robot gripper, and a special command for terminating the episode. During training, these coordinate tokens are treated like language tokens and image tokens, all part of the same phrase. This integration of coordinates into the language and image model is a significant milestone, as it combines the usually separate programming for robot physics with language and image neural networks.

RT-2 builds upon the foundation of Google’s prior efforts, PaLI-X and PaLM-E, both of which are vision-language models. While PaLI-X focuses on image and text tasks, PaLM-E takes it a step further by using language and image data to drive a robot by generating commands. RT-2 goes beyond PaLM-E by generating not only the plan of action but also the coordinates of movement in space. This advancement makes knowledge transfer from the internet to robots more direct and scalable.

To create RT-2, the DeepMind team based the model on PaLI-X and PaLM-E, large language models with billions of parameters. These models have significantly more neural weights, making RT-2 more proficient in executing tasks compared to its predecessor, RT-1. The use of larger language models in RT-2 enables it to achieve about 60% of tasks involving previously unseen objects, outperforming the previous programs by a significant margin.

When presented with a natural-language command and an accompanying image, RT-2 generates a plan of action accompanied by coordinates in space. For example, if the prompt is to pick the object that is different from the rest on a table, RT-2 predicts the action and provides the corresponding set of coordinates for the robot to pick up the desired object. This capability showcases the generalization and reasoning skills of RT-2, even in situations where the objects, relationships, or cues are not explicitly provided during robot training.

The authors of the research paper highlight the potential for further advancements in robot learning by combining recent developments in language and image handling. As language and image models improve, the vision-language-action approach can benefit from these advancements. However, one challenge highlighted by the authors is the computational cost of large language models, which could limit real-time inference when high-frequency control is required. Future research will explore techniques to address this bottleneck, such as quantization and distillation methods that can enable these models to run at higher rates or on lower-cost hardware.

In conclusion, DeepMind’s RT-2 model represents a significant step forward in how we communicate with robots. By combining large language models with images and coordinated robot movement data, we now have a powerful tool for instructing robots in real-time. The ability of RT-2 to generalize to various real-world situations and exhibit reasoning and human recognition skills opens up new possibilities for the future of robotics. As language and image handling continue to advance, the field of robot learning can further evolve and improve.