📢 Gate Square #MBG Posting Challenge# is Live— Post for MBG Rewards!
Want a share of 1,000 MBG? Get involved now—show your insights and real participation to become an MBG promoter!
💰 20 top posts will each win 50 MBG!
How to Participate:
1️⃣ Research the MBG project
Share your in-depth views on MBG’s fundamentals, community governance, development goals, and tokenomics, etc.
2️⃣ Join and share your real experience
Take part in MBG activities (CandyDrop, Launchpool, or spot trading), and post your screenshots, earnings, or step-by-step tutorials. Content can include profits, beginner-friendl
The large model is actually equipped with autonomous driving, and the AI will explain how it drives!
Source: Xinzhiyuan
Since its invention, the most worrying thing about autonomous driving is that humans cannot know what is going on in its mind.
Starting today, it can actually "speak out" its thoughts?
Recently, Wayve launched LINGO-1, a large autonomous driving interaction model based on visual language action large models (VLAMs), which deeply integrates large language models and autonomous driving.
It will explain all your doubts about the intelligent driving system clearly to you.
After being trained on a variety of visual and language data, LINGO-1 can not only perform visual question answering (VQA) tasks such as perception, counterfactuals, planning, reasoning, and attention, but also describe driving behavior and reasoning.
In other words, we can understand the factors that affect driving decisions by asking questions to LINGO-1.
It is conceivable that as we push the boundaries of embodied artificial intelligence, the vision-speech-action model will have a huge impact, because language provides a new paradigm to enhance the way we interpret and train self-driving models.
**A compliment comes in the self-driving car? **
The unique feature of LINGO-1 is the introduction of a human expert to train on the verbal commentary data of driving scenes, allowing the model to connect environmental perception, action decision-making and human-like scene interpretation.
Jim Fan, senior AI scientist at NVIDIA, commented excitedly: This is the most interesting work in the field of autonomous driving that I have read recently!
What are the advantages of this new explicit reasoning step? Jim Fan explains as follows——
Not only that, LINGO-1 is also closely related to some research in the field of game artificial intelligence, such as MineDojo and Thought Cloning, which are both AI agents.
The former can learn a reward model that associates review text with Minecraft video pixels. The latter can realize a complete set of links of "pixel->language->action loop".
LINGO-1——Open Loop Driving Narrator
My own explanation
What is the model paying attention to? doing what? Now, this is no longer a mystery.
LINGO-1 will explain clearly to you what it does every step of the way.
In addition to explaining itself, LINGO-1 can also answer your questions, allowing us to evaluate its scene understanding and reasoning capabilities.
It says, "I have to pay attention to the light ahead of me, the cyclist in front of me, and the pedestrian crossing the road."
It will say: "It is a rainy day and I need to be extremely careful when driving because the road surface is slippery and visibility is reduced in rainy days."
It will say: "I have to keep a distance from cyclists and stop when necessary. It is a potential danger. In addition, I have to pay attention to the cars parked on the roadside."
The key to developing LINGO-1 was creating a scalable and diverse data set. This dataset contains commentary from professional drivers while driving across the UK, including images, language and action data.
This reminds us of the scene when we were learning to drive from the instructors at the driving school - from time to time they would make comments and explanations like the following to explain why they behaved this way when driving, so that the students could draw inferences.
When the above sentences, sensory images, and underlying driving actions are synchronized in time, researchers will obtain a rich visual-language-action data set that can be used to train models for various tasks.
Visual-Language-Action Model (VLAM)
After the rise of LLM, many visual language models (VLM) combine the reasoning capabilities of LLM with images and videos.
Wayve further launched the Vision-Language-Action Model (VLAM), which contains three types of information-images, driving data and language.
In the past, natural language was rarely used in robot training (especially in the field of autonomous driving).
If natural language is added, it will allow us to more powerfully interpret and train basic driving models. This new model will have a huge impact.
By using language to explain various causal factors in driving scenarios, the training speed of the model can be accelerated and extended to new scenarios.
And since we can ask the model questions, we can know what the model understands and how well it can reason and make decisions.
The autonomous driving system is no longer a mysterious black box. We can ask it from time to time when driving: What are you thinking?
This will undoubtedly increase public trust in autonomous driving.
In addition, although there are only a small number of training samples, the rapid learning ability of natural language allows the model to quickly and efficiently learn new tasks and adapt to new scenarios.
For example, as long as we use natural language to tell the model "this behavior is wrong," we can correct the wrong behavior of the autonomous driving system.
From now on, perhaps only natural language is needed to establish a basic model for end-to-end autonomous driving!
Accuracy 60%
During this time, the team has been improving the model architecture and training data set.
It is not difficult to see from the figure that the performance of LINGO-1 has doubled compared to the beginning.
Currently, the accuracy of LINGO-1 has reached 60% of human level.
Improve the interpretability of end-to-end models
The lack of interpretability of machine learning models has always been the focus of research.
By creating an interactive interface based on natural language, users can directly ask questions and let AI answer them, thereby gaining an in-depth understanding of the model's understanding of the scene and how it makes decisions.
This unique dialogue between passengers and self-driving cars can increase transparency and make it easier to understand and trust these systems.
At the same time, natural language also enhances the model’s ability to adapt to and learn from human feedback.
Like an instructor guiding a student behind the wheel, corrective instructions and user feedback refine the model's understanding and decision-making process over time.
Better planning and reasoning, improved driving performance
There are two main factors that affect autonomous driving performance:
The ability of language models to accurately interpret various input mode scenarios
The model’s proficiency in converting mid-level reasoning into effective low-level planning
In this regard, the team is trying to enhance the closed-loop driving model through LINGO's natural language, reasoning and planning capabilities.
Efficient learning of new scenarios or long-tail scenarios
Usually, a picture is worth a thousand words.
But when training a model, a piece of text is worth a thousand pictures.
Now, instead of having thousands of examples of cars slowing down for pedestrians, we only need a few examples, along with a short text description, to teach the model to slow down and learn how it should act in this situation. What to consider.
You know, one of the most important parts of autonomous driving is causal reasoning, which allows the system to understand the relationship between elements and behaviors in the scene.
A well-performing VLAM allows the system to connect pedestrians waiting at zebra crossings with "Do Not Cross" traffic signals. This is extremely meaningful in challenging scenarios with limited data.
In addition, LLM already has a large amount of knowledge about human behavior from Internet data sets, so it can understand concepts such as identifying objects, traffic regulations, and driving operations, such as between trees, shops, houses, dogs chasing balls, and buses parked in front of schools. difference.
Through VLAM's broader information encoding of graphics data, autonomous driving will become more advanced and safer.
Limitations
Of course, LINGO-1 also has certain limitations.
Generalization
LINGO-1 is trained on central London driving experience and Internet-scale text.
Although I have learned about driving cultures from all over the world, what I am currently best at is interpreting British traffic laws.
It also requires learning from driving experience in other countries.
Hallucination
Hallucinations are a well-known problem in large language models, and LINGO-1 is no exception.
However, compared with ordinary LLM, LINGO-1 has an advantage: because it is based on vision, language and action, it has more sources of supervision and can better understand the world.
Context
Video deep learning is challenging because video data are typically orders of magnitude larger than image or text datasets.
Video-based multimodal language models especially require long context lengths to be able to embed many video frames to reason about complex dynamic driving scenarios.
Closed-loop reasoning
Currently, Wayve is working on model interpretability, but ultimately, their LLM’s reasoning capabilities will be able to truly impact human driving.
Researchers are developing a closed-loop architecture that can run LINGO-1 on autonomous vehicles in the future.
Netizen Discussion
Netizens also found this very exciting.
“Interestingly, I think the language model interprets the steering, braking, and throttle predictions of the driving control model in words, rather than affecting the driving itself, because natural language would lose the required precision.”
"You can think of it as adding language to the world model. I never understand why it has never been tried before, because the idea of training an agent to communicate seems to be something everyone can think of."
LINGO-1 has officially taken an important step in using natural language to enhance the learning and interpretability of basic driving models.
Just imagine, in the future, we only need to use simple text prompts to let AI tell the road conditions ahead, or let AI learn the traffic regulations of different regions. This scene is so exciting!
Therefore, natural language has great potential in developing safer and more reliable self-driving cars.
References: