The large model is actually equipped with autonomous driving, and the AI will explain how it drives!

2023-09-17 03:13:13

Source: Xinzhiyuan

LINGO-1 launched by Wayve gives autonomous driving a mouthful! How to drive this car, it explains to itself, a large language model, it is no longer a black box.

Since its invention, the most worrying thing about autonomous driving is that humans cannot know what is going on in its mind.

Starting today, it can actually "speak out" its thoughts?

Recently, Wayve launched LINGO-1, a large autonomous driving interaction model based on visual language action large models (VLAMs), which deeply integrates large language models and autonomous driving.

Specifically, LINGO-1 trained a video language model, which can be said to be quite personalized - it can comment on the scene going on in front of it!

-What are you doing? -I'm crossing traffic carefully because the road conditions are complicated at the moment. -What are you going to do next? -I'm going to turn left. -What are the potential risks in the scenario at hand? - traffic lights, cyclists, pedestrians crossing the road

It will explain all your doubts about the intelligent driving system clearly to you.

Q: Why did it stop? Answer: Because it is a red light now.

After being trained on a variety of visual and language data, LINGO-1 can not only perform visual question answering (VQA) tasks such as perception, counterfactuals, planning, reasoning, and attention, but also describe driving behavior and reasoning.

In other words, we can understand the factors that affect driving decisions by asking questions to LINGO-1.

Netizens lamented: "The end-to-end black box has been opened in this way. It is simply a miracle in the GPT era. The autonomous driving system has taken a big step forward from interacting with the real physical world."

It is conceivable that as we push the boundaries of embodied artificial intelligence, the vision-speech-action model will have a huge impact, because language provides a new paradigm to enhance the way we interpret and train self-driving models.

A compliment comes in the self-driving car?

The unique feature of LINGO-1 is the introduction of a human expert to train on the verbal commentary data of driving scenes, allowing the model to connect environmental perception, action decision-making and human-like scene interpretation.

In this way, it can interpret the decisions and actions of the autonomous driving system through natural language interaction.

Jim Fan, senior AI scientist at NVIDIA, commented excitedly: This is the most interesting work in the field of autonomous driving that I have read recently!

In the past, automatic driving was "perception -> driving operation", but in the future, automatic driving has added the step of reasoning, which is "perception -> verbal reasoning -> action".

What are the advantages of this new explicit reasoning step? Jim Fan explains as follows——

Interpretability: The driving model is no longer a mysterious black box.

Counterfactual scenarios: It is able to imagine scenarios that are not in the training data and reason about how to handle them correctly.

Long-tail programming: There are a large number of edge scenarios in driving, and it is impossible to provide good data coverage for all scenarios. Rather than collecting thousands of examples to "neuroprogram" a case, human experts can write prompts () to explain a small number of examples, thereby guiding the system on how to handle specific or complex scenarios.

Not only that, LINGO-1 is also closely related to some research in the field of game artificial intelligence, such as MineDojo and Thought Cloning, which are both AI agents.

The former can learn a reward model that associates review text with Minecraft video pixels. The latter can realize a complete set of links of "pixel->language->action loop".

LINGO-1——Open Loop Driving Narrator

My own explanation

What is the model paying attention to? doing what? Now, this is no longer a mystery.

LINGO-1 will explain clearly to you what it does every step of the way.

For example, if it wants to overtake a vehicle parked on the side of the road -

LINGO-1: I am overtaking due to slow traffic. LINGO-1: I'm passing a parked car. LINGO-1: I'm accelerating because the road ahead is clear.

When approaching a zebra crossing——

LINGO-1: I'm maintaining my speed; the road is still clear. LINGO-1: Braking and slowing down now. LINGO-1: Stop at the zebra crossing. LINGO-1: I'm accelerating from where I stopped. LINGO-1: I'm speeding because the road is clear.

When turning left at an intersection——

LINGO-1: I stayed still because the car in front also stopped. LINGO-1: I'm accelerating because the car in front has moved away. LINGO-1: I'm applying the brakes to stop at the intersection. LINGO-1: I'm moving forward because the lane is clear. LINGO-1: Complete the left turn and follow the route.

### Visual Question Answering (VQA)

In addition to explaining itself, LINGO-1 can also answer your questions, allowing us to evaluate its scene understanding and reasoning capabilities.

For example, if you ask LINGO-1, what should you pay attention to at this intersection?

It says, "I have to pay attention to the light ahead of me, the cyclist in front of me, and the pedestrian crossing the road."

Ask LINGO-1, what is the weather now? What impact will it have on driving?

It will say: "It is a rainy day and I need to be extremely careful when driving because the road surface is slippery and visibility is reduced in rainy days."

LINGO-1, how do you safely ride side by side with a bicycle?

It will say: "I have to keep a distance from cyclists and stop when necessary. It is a potential danger. In addition, I have to pay attention to the cars parked on the roadside."

### Key: Driving commentary data set

The key to developing LINGO-1 was creating a scalable and diverse data set. This dataset contains commentary from professional drivers while driving across the UK, including images, language and action data.

This reminds us of the scene when we were learning to drive from the instructors at the driving school - from time to time they would make comments and explanations like the following to explain why they behaved this way when driving, so that the students could draw inferences.

The vehicle/traffic light ahead has changed, please slow down - It's time to change lanes - You can speed up, be careful not to exceed the maximum speed limit - Attention! Other vehicles have entered the road/stopped at the intersection - Pay attention to roundabouts and give way signs ahead - There are parked vehicles/traffic lights/schools ahead - Next vehicles are changing lanes/overtaking parked vehicles - Cyclists/pedestrians are Waiting at the zebra crossing

When the above sentences, sensory images, and underlying driving actions are synchronized in time, researchers will obtain a rich visual-language-action data set that can be used to train models for various tasks.

Visual-Language-Action Model (VLAM)

After the rise of LLM, many visual language models (VLM) combine the reasoning capabilities of LLM with images and videos.

Wayve further launched the Vision-Language-Action Model (VLAM), which contains three types of information-images, driving data and language.

In the past, natural language was rarely used in robot training (especially in the field of autonomous driving).

If natural language is added, it will allow us to more powerfully interpret and train basic driving models. This new model will have a huge impact.

By using language to explain various causal factors in driving scenarios, the training speed of the model can be accelerated and extended to new scenarios.

And since we can ask the model questions, we can know what the model understands and how well it can reason and make decisions.

The autonomous driving system is no longer a mysterious black box. We can ask it from time to time when driving: What are you thinking?

This will undoubtedly increase public trust in autonomous driving.

In addition, although there are only a small number of training samples, the rapid learning ability of natural language allows the model to quickly and efficiently learn new tasks and adapt to new scenarios.

For example, as long as we use natural language to tell the model "this behavior is wrong," we can correct the wrong behavior of the autonomous driving system.

From now on, perhaps only natural language is needed to establish a basic model for end-to-end autonomous driving!

Accuracy 60%

During this time, the team has been improving the model architecture and training data set.

It is not difficult to see from the figure that the performance of LINGO-1 has doubled compared to the beginning.

Currently, the accuracy of LINGO-1 has reached 60% of human level.

Obviously, the introduction of natural language can revolutionize autonomous driving technology in many aspects.

Improve the interpretability of end-to-end models

The lack of interpretability of machine learning models has always been the focus of research.

By creating an interactive interface based on natural language, users can directly ask questions and let AI answer them, thereby gaining an in-depth understanding of the model's understanding of the scene and how it makes decisions.

This unique dialogue between passengers and self-driving cars can increase transparency and make it easier to understand and trust these systems.

At the same time, natural language also enhances the model’s ability to adapt to and learn from human feedback.

Like an instructor guiding a student behind the wheel, corrective instructions and user feedback refine the model's understanding and decision-making process over time.

Better planning and reasoning, improved driving performance

There are two main factors that affect autonomous driving performance:

The ability of language models to accurately interpret various input mode scenarios
The model’s proficiency in converting mid-level reasoning into effective low-level planning

In this regard, the team is trying to enhance the closed-loop driving model through LINGO's natural language, reasoning and planning capabilities.

Wayve LINGO-1 closed-loop autonomous driving model

Efficient learning of new scenarios or long-tail scenarios

Usually, a picture is worth a thousand words.

But when training a model, a piece of text is worth a thousand pictures.

Now, instead of having thousands of examples of cars slowing down for pedestrians, we only need a few examples, along with a short text description, to teach the model to slow down and learn how it should act in this situation. What to consider.

You know, one of the most important parts of autonomous driving is causal reasoning, which allows the system to understand the relationship between elements and behaviors in the scene.

A well-performing VLAM allows the system to connect pedestrians waiting at zebra crossings with "Do Not Cross" traffic signals. This is extremely meaningful in challenging scenarios with limited data.

In addition, LLM already has a large amount of knowledge about human behavior from Internet data sets, so it can understand concepts such as identifying objects, traffic regulations, and driving operations, such as between trees, shops, houses, dogs chasing balls, and buses parked in front of schools. difference.

Through VLAM's broader information encoding of graphics data, autonomous driving will become more advanced and safer.

Limitations

Of course, LINGO-1 also has certain limitations.

Generalization

LINGO-1 is trained on central London driving experience and Internet-scale text.

Although I have learned about driving cultures from all over the world, what I am currently best at is interpreting British traffic laws.

It also requires learning from driving experience in other countries.

Hallucination

Hallucinations are a well-known problem in large language models, and LINGO-1 is no exception.

However, compared with ordinary LLM, LINGO-1 has an advantage: because it is based on vision, language and action, it has more sources of supervision and can better understand the world.

Context

Video deep learning is challenging because video data are typically orders of magnitude larger than image or text datasets.

Video-based multimodal language models especially require long context lengths to be able to embed many video frames to reason about complex dynamic driving scenarios.

Closed-loop reasoning

Currently, Wayve is working on model interpretability, but ultimately, their LLM’s reasoning capabilities will be able to truly impact human driving.

Researchers are developing a closed-loop architecture that can run LINGO-1 on autonomous vehicles in the future.

Netizen Discussion

Netizens also found this very exciting.

“Interestingly, I think the language model interprets the steering, braking, and throttle predictions of the driving control model in words, rather than affecting the driving itself, because natural language would lose the required precision.”

"With this approach, we are getting very close to AGI."

"You can think of it as adding language to the world model. I never understand why it has never been tried before, because the idea of training an agent to communicate seems to be something everyone can think of."

“This shift from pure perceptual action to adding textual reasoning is a game changer! In my opinion, this is the missing piece of autonomous driving because it makes interpretability possible in a non-technical way.”

At present, the integration of robot training and natural language is still in its early stages.

LINGO-1 has officially taken an important step in using natural language to enhance the learning and interpretability of basic driving models.

Just imagine, in the future, we only need to use simple text prompts to let AI tell the road conditions ahead, or let AI learn the traffic regulations of different regions. This scene is so exciting!

Therefore, natural language has great potential in developing safer and more reliable self-driving cars.

References:

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.