🌟 Photo Sharing Tips: How to Stand Out and Win?
1.Highlight Gate Elements: Include Gate logo, app screens, merchandise or event collab products.
2.Keep it Clear: Use bright, focused photos with simple backgrounds. Show Gate moments in daily life, travel, sports, etc.
3.Add Creative Flair: Creative shots, vlogs, hand-drawn art, or DIY works will stand out! Try a special [You and Gate] pose.
4.Share Your Story: Sincere captions about your memories, growth, or wishes with Gate add an extra touch and impress the judges.
5.Share on Multiple Platforms: Posting on Twitter (X) boosts your exposure an
The robot ChatGPT is here: the big model enters the real world, DeepMind's heavyweight breakthrough
We know that after mastering the language and images on the Internet, the large model will eventually enter the real world, and "embodied intelligence" should be the next development direction.
Connecting the large model to the robot, using simple natural language instead of complex instructions to form a specific action plan, without additional data and training, this vision looks good, but it seems a bit far away. After all, the field of robotics is notoriously difficult.
However, AI is evolving faster than we thought.
This Friday, Google DeepMind announced the launch of RT-2: the world's first Vision-Language-Action (VLA) model for controlling robots.
Now that complex instructions are no longer used, the robot can be manipulated directly like ChatGPT.
Tell the robot to give Taylor Swift the Coke can:
The development of large language models such as ChatGPT is setting off a revolution in the field of robots. Google has installed the most advanced language models on robots, so that they finally have an artificial brain.
In a paper recently submitted by DeepMind, the researchers stated that the RT-2 model is trained based on network and robot data, using the research progress of large-scale language models such as Bard, and combining it with robot data. The new model can also Understand instructions in languages other than English.
**How is RT-2 implemented? **
DeepMind's RT-2 is disassembled and read as Robotic Transformer - the transformer model of the robot.
It is not an easy task for robots to understand human speech and demonstrate survivability like in science fiction movies. Compared with the virtual environment, the real physical world is complex and disordered, and robots usually need complex instructions to do some simple things for humans. Instead, humans instinctively know what to do.
Previously, it took a long time to train the robot, and researchers had to build solutions for different tasks individually, but with the power of the RT-2, the robot can analyze more information by itself and infer what to do next.
RT-2 builds on the Vision-Language Model (VLM) and creates a new concept: the Vision-Language-Action (VLA) model, which can learn from network and robot data and combine this knowledge Translate into general instructions that the robot can control. The model was even able to use thought-chain cues like which drink would be best for a tired person (energy drinks).
In fact, as early as last year, Google launched the RT-1 version of the robot. Only a single pre-trained model is needed, and RT-1 can generate instructions from different sensory inputs (such as vision, text, etc.) to execute multiple tasks. kind of task.
As a pre-trained model, it naturally requires a lot of data for self-supervised learning to build well. RT-2 builds on RT-1 and uses RT-1 demonstration data collected by 13 robots in an office, kitchen environment over 17 months.
DeepMind created VLA model
We have mentioned earlier that RT-2 is built on the basis of VLM, where VLMs models have been trained on Web-scale data and can be used to perform tasks such as visual question answering, image captioning, or object recognition. In addition, the researchers also made adaptive adjustments to the two previously proposed VLM models PaLI-X (Pathways Language and Image model) and PaLM-E (Pathways Language model Embodied), as the backbone of RT-2, and these models The Vision-Language-Movement versions are called RT-2-PaLI-X and RT-2-PaLM-E.
In order for the vision-language model to be able to control the robot, it is still necessary to control the motion. The study took a very simple approach: they represented robot actions in another language, text tokens, and trained them with a web-scale vision-language dataset.
The motion encoding for the robot is based on the discretization method proposed by Brohan et al. for the RT-1 model.
As shown in the figure below, this research represents robot actions as text strings, which can be a sequence of robot action token numbers, such as "1 128 91 241 5 101 127 217".
Since actions are represented as text strings, it is as easy for a robot to execute an action command as a string command. With this representation, we can directly fine-tune existing vision-language models and convert them to vision-language-action models.
During inference, text tokens are decomposed into robot actions to achieve closed-loop control.
Experimental
The researchers performed a series of qualitative and quantitative experiments on the RT-2 model.
The figure below demonstrates the performance of RT-2 on semantic understanding and basic reasoning. For example, for the task of "putting strawberries into the correct bowl", RT-2 not only needs to understand the representation of strawberries and bowls, but also needs to reason in the context of the scene to know that strawberries should be placed with similar fruits. Together. For the task of picking up a bag that is about to fall off a table, RT-2 needs to understand the physical properties of the bag to disambiguate between the two bags and identify objects in unstable positions.
It should be noted that all of the interactions tested in these scenarios have never been seen in robotics data.
Similar to ChatGPT, if such a capability is applied on a large scale, the world is estimated to undergo considerable changes. However, Google has no immediate plans to apply the RT-2 robot, saying only that the researchers believe that these robots that can understand human speech will never stop at the level of demonstrating capabilities.
Just imagine a robot with a built-in language model that can be placed in a warehouse, grab your medicine for you, or even be used as a home assistant—folding laundry, removing items from the dishwasher, and tidying up around the house.
**Embodied intelligence, not far from us? **
Recently, embodied intelligence is a direction that a large number of researchers are exploring. This month, Stanford University's Li Feifei team demonstrated some new results. Through a large language model plus a visual language model, AI can analyze and plan in 3D space and guide robot actions.
It can be seen that in the field of large models, there are still big things about to happen.
Reference content: