With GPT-4, the robot learned to turn the pen and plate walnuts

Original Machine Heart

Editor: Zhang Qian, Chen Ping

With the combination of GPT-4 and reinforcement learning, what will the future of robotics look like?

When it comes to learning, GPT-4 is a formidable student. After digesting a large amount of human data, it mastered various knowledge, and even inspired mathematician Tao Zhexuan in chat.

At the same time, it has become an excellent teacher, and not only teaches book knowledge, but also teaches robots to turn pens.

The robot, named Eureka, was a study from Nvidia, the University of Pennsylvania, the California Institute of Technology, and the University of Texas at Austin. This study combines the results of large language models and reinforcement learning: GPT-4 is used to refine the reward function, and reinforcement learning is used to train the robot controller.

With GPT-4's ability to write code, Eureka has excellent reward function design capabilities, and its self-generated rewards are superior to those of human experts in 83% of tasks. This ability allows the robot to perform many tasks that were not easy to do before, such as turning pens, opening drawers and cabinets, throwing balls to catch and dribbling, operating scissors, etc. For the time being, though, this is all done in a virtual environment.

In addition, Eureka implemented a new type of in-context RLHF that incorporates natural language feedback from human operators to guide and align reward functions. It can provide robotics engineers with powerful auxiliary functions to help engineers design complex motion behaviors. Jim Fan, a senior AI scientist at NVIDIA and one of the authors of the paper, likened the study to "Voyager in the physics simulator API space."

It is worth mentioning that this study is completely open source, and the open source address is as follows:

Paper Link:

Project Link:

Code link:

Paper Overview

Large Language Models (LLMs) excel at high-level semantic planning of robotic tasks (such as Google's SayCan, RT-2 bots), but whether they can be used to learn complex, low-level operational tasks, such as pen turning, remains an open question. Existing attempts require a lot of domain expertise to build task prompts or learn only simple skills, far from human-level flexibility.

Google's RT-2 robot

Reinforcement learning (RL), on the other hand, has achieved impressive results in flexibility and many other aspects (such as OpenAI's manipulator that plays Rubik's Cube), but it requires human designers to carefully construct reward functions that accurately codify and provide learning signals for the desired behavior. Since many real-world reinforcement learning tasks only provide sparse rewards that are difficult to use for learning, reward shaping is needed in practice to provide progressive learning signals. Although the reward function is very important, it is notoriously difficult to design. A recent survey found that 92% of reinforcement learning researchers and practitioners surveyed said they made human trial and error when designing rewards, and 89% said they designed rewards that were suboptimal and would lead to unexpected behavior.

Given that reward design is so important, we can't help but ask, is it possible to develop a universal reward programming algorithm using state-of-the-art coding LLMs such as GPT-4? These LLMs excel in coding, zero-shot generation, and in-context learning, and have greatly improved the performance of programming agents. Ideally, this reward design algorithm should have human-level reward-generating capabilities that can scale to a wide range of tasks, automate tedious trial-and-error processes without human supervision, and be compatible with human supervision to ensure safety and consistency.

This paper proposes an LLM-driven reward design algorithm, EUREKA (Evolution-driven Universal REward Kit for Agent). The algorithm achieves the following:

The performance of the reward design reaches human level in 29 different open source RL environments, which include 10 different robot forms (quadruped, quadcopter, bipedal, manipulator, and several dexterous hands, see Figure 1). Without any task-specific prompts or reward templates, EUREKA's self-generated rewards outperformed those of human experts in 83% of tasks and achieved an average normalization improvement of 52%.

2. Solve the dexterous operation tasks that could not be achieved through manual reward engineering before. Take the pen turning problem, for example, in which a hand with only five fingers needs to quickly rotate the pen according to a pre-set rotation configuration and rotate as many cycles as possible. By combining EUREKA with coursework, the researchers demonstrated for the first time the operation of a quick pen turn on a simulated anthropomorphic "Shadow Hand" (see bottom of Figure 1).

  1. This paper provides a new gradient-free context learning method for reinforcement learning based on human feedback (RLHF), which can generate more efficient and human-aligned reward functions based on various forms of human input. The paper shows that EUREKA can benefit from and improve on existing human reward functions. Similarly, the researchers demonstrated EUREKA's ability to use human textual feedback to assist in designing reward functions, which help capture subtle human preferences.

Unlike previous L2R work that used LLM-assisted reward design, EUREKA has no task-specific prompts, reward templates, and a handful of examples. In the experiment, EUREKA performed significantly better than L2R due to its ability to generate and refine free-form, expressive reward programs.

EUREKA's versatility is due to three key algorithm design choices: context as context, evolutionary search, and reward reflection.

First, by using the environment source code as context, EUREKA can generate executable reward functions from zero samples in the backbone coding LLM (GPT-4). EUREKA then greatly improves the quality of rewards by performing evolutionary searches, iteratively proposing reward candidate batches, and refining the most promising rewards in the LLM context window. This improvement in in-context is achieved through reward reflection, which is a reward-quality text summary based on strategy training statistics that enables automatic and targeted reward editing.

FIG. 3 SHOWS AN EXAMPLE OF EUREKA ZERO-SAMPLE REWARD AND THE IMPROVEMENTS ACCUMULATED DURING OPTIMIZATION. To ensure that EUREKA is able to scale its reward search to its maximum potential, EUREKA uses GPU-accelerated distributed reinforcement learning on IsaacGym to evaluate intermediate rewards, which provides up to three orders of magnitude improvement in policy learning speed, making EUREKA a broad algorithm that scales naturally as the amount of computation increases.

This is shown in Figure 2. The researchers are committed to open sourcing all tips, environments, and generated reward functions to facilitate further research on LLM-based reward design.

Introduction to the method

EUREKA can write the reward algorithm autonomously, how it is implemented, let's look at it next.

EUREKA consists of three algorithmic components: 1) using the environment as the context, thus supporting zero-shot generation of executable rewards; 2) evolutionary search, iteratively proposing and refining reward candidates; 3) Reward reflection and support fine-grained reward improvement.

Environment as context

This article recommends providing the original environment code directly as context. With only minimal instructions, EUREKA can generate rewards in different environments with zero samples. An example of EUREKA output is shown in Figure 3. EUREKA expertly combines existing observation variables (e.g., fingertip position) in the provided environment code and produces a valid reward code – all without any environment-specific prompt engineering or reward templates.

However, on the first attempt, the resulting reward may not always be executable, and even if it is, it may be suboptimal. This raises the question of how to effectively overcome the suboptimality of single-sample reward generation?

Evolutionary Search

Next, the paper describes how evolutionary search solves the problems of suboptimal solutions mentioned above. They are perfected in such a way that in each iteration, EUREKA samples several independent outputs of LLM (line 5 in algorithm 1). Since each iteration is independently and homogeneously, the probability of errors in all reward functions in the iteration decreases exponentially as the sample size increases.

Reward Reflection

To provide more complex and targeted reward analysis, this article proposes to build automated feedback to summarize the policy training dynamics in the text. Specifically, considering that the EUREKA reward function requires individual components in the reward program (such as the reward component in Figure 3), this article tracks the scalar values of all reward components at intermediate policy checkpoints throughout the training process.

Constructing this reward reflection process is simple, but it is important because of the dependency of the reward optimization algorithm. That is, whether the reward function is valid or not is affected by the specific choice of the RL algorithm, and the same reward may behave very differently even under the same optimizer for a given hyperparameter difference. By detailing how the RL algorithm optimizes the individual reward components, reward reflection enables EUREKA to produce more targeted reward edits and synthesize reward functions to better work with the fixed RL algorithm.

Experiment

The experimental part provides a comprehensive assessment of Eureka, including the ability to generate reward functions, the ability to solve new tasks, and the ability to integrate various human inputs.

The experimental environment includes 10 different robots and 29 tasks, 29 of which are implemented by the IsaacGym simulator. The experiment uses 9 primitive environments from IsaacGym (Isaac), covering a variety of robot forms from quadruped, bipedal, quadcopter, manipulator to robotic dexterous hand. In addition, this article ensures the depth of the assessment by incorporating 20 tasks from the Dexterity benchmark.

Eureka can produce a superhuman-level reward function. Out of 29 tasks, the reward function given by Eureka performed better than expert-written rewards on 83% of tasks, improving by an average of 52%. In particular, Eureka achieved greater benefits in a high-dimensional Dexterity benchmark environment.

Eureka is able to evolve reward search so that rewards improve over time. Eureka progressively produces better rewards by combining large-scale reward searches and detailed reward reflection feedback, eventually surpassing human levels.

Eureka can also generate novel rewards. This paper evaluates the novelty of Eureka rewards by calculating the correlation between Eureka rewards and human rewards on all Isaac tasks. As shown in the figure, Eureka mainly generates weakly correlated reward functions, which outperform human reward functions. In addition, the paper also observes that the harder the task, the less relevant the Eureka reward is. In some cases, Eureka rewards are even negatively correlated with human rewards, but perform significantly better than human rewards.

想要实现机器人的灵巧手能够不停的转笔,需要操作程序有尽可能多的循环。本文通过以下方式解决此任务:(1) Instruct Eureka to generate a reward function that redirects pens to a random target configuration, and then (2) fine-tune this pre-trained strategy with Eureka Rewards to achieve the desired pen sequence-rotation configuration. As shown, Eureka fine-tuned quickly adapted to the strategy, successfully spinning many cycles in a row. In contrast, neither pre-trained nor learned strategies from scratch can complete a spin in a single cycle.

This paper also examines whether starting with the initialization of the human reward function is beneficial for Eureka. As shown, Eureka improves and benefits from human rewards, regardless of the quality of human rewards.

Eureka also implemented RLHF, which can modify rewards based on human feedback to guide agents step-by-step through safer and more human-like behavior. The example shows how Eureka teaches a humanoid robot to run upright with some human feedback that replaces the previous automatic reward reflection.

Humanoid robot learns running gait with Eureka

For more information, please refer to the original paper.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Share
Comment
0/400
No comments
Trade Crypto Anywhere Anytime
qrCode
Scan to download Gate app
Community
English
  • 简体中文
  • English
  • Tiếng Việt
  • 繁體中文
  • Español
  • Русский
  • Français (Afrique)
  • Português (Portugal)
  • Bahasa Indonesia
  • 日本語
  • بالعربية
  • Українська
  • Português (Brasil)