🎉 Gate Square Growth Points Summer Lucky Draw Round 1️⃣ 2️⃣ Is Live!
🎁 Prize pool over $10,000! Win Huawei Mate Tri-fold Phone, F1 Red Bull Racing Car Model, exclusive Gate merch, popular tokens & more!
Try your luck now 👉 https://www.gate.com/activities/pointprize?now_period=12
How to earn Growth Points fast?
1️⃣ Go to [Square], tap the icon next to your avatar to enter [Community Center]
2️⃣ Complete daily tasks like posting, commenting, liking, and chatting to earn points
100% chance to win — prizes guaranteed! Come and draw now!
Event ends: August 9, 16:00 UTC
More details: https://www
With GPT-4, the robot learned to turn the pen and plate walnuts
Original Machine Heart
Editor: Zhang Qian, Chen Ping
With the combination of GPT-4 and reinforcement learning, what will the future of robotics look like?
When it comes to learning, GPT-4 is a formidable student. After digesting a large amount of human data, it mastered various knowledge, and even inspired mathematician Tao Zhexuan in chat.
At the same time, it has become an excellent teacher, and not only teaches book knowledge, but also teaches robots to turn pens.
With GPT-4's ability to write code, Eureka has excellent reward function design capabilities, and its self-generated rewards are superior to those of human experts in 83% of tasks. This ability allows the robot to perform many tasks that were not easy to do before, such as turning pens, opening drawers and cabinets, throwing balls to catch and dribbling, operating scissors, etc. For the time being, though, this is all done in a virtual environment.
Project Link:
Code link:
Paper Overview
Large Language Models (LLMs) excel at high-level semantic planning of robotic tasks (such as Google's SayCan, RT-2 bots), but whether they can be used to learn complex, low-level operational tasks, such as pen turning, remains an open question. Existing attempts require a lot of domain expertise to build task prompts or learn only simple skills, far from human-level flexibility.
Reinforcement learning (RL), on the other hand, has achieved impressive results in flexibility and many other aspects (such as OpenAI's manipulator that plays Rubik's Cube), but it requires human designers to carefully construct reward functions that accurately codify and provide learning signals for the desired behavior. Since many real-world reinforcement learning tasks only provide sparse rewards that are difficult to use for learning, reward shaping is needed in practice to provide progressive learning signals. Although the reward function is very important, it is notoriously difficult to design. A recent survey found that 92% of reinforcement learning researchers and practitioners surveyed said they made human trial and error when designing rewards, and 89% said they designed rewards that were suboptimal and would lead to unexpected behavior.
Given that reward design is so important, we can't help but ask, is it possible to develop a universal reward programming algorithm using state-of-the-art coding LLMs such as GPT-4? These LLMs excel in coding, zero-shot generation, and in-context learning, and have greatly improved the performance of programming agents. Ideally, this reward design algorithm should have human-level reward-generating capabilities that can scale to a wide range of tasks, automate tedious trial-and-error processes without human supervision, and be compatible with human supervision to ensure safety and consistency.
This paper proposes an LLM-driven reward design algorithm, EUREKA (Evolution-driven Universal REward Kit for Agent). The algorithm achieves the following:
The performance of the reward design reaches human level in 29 different open source RL environments, which include 10 different robot forms (quadruped, quadcopter, bipedal, manipulator, and several dexterous hands, see Figure 1). Without any task-specific prompts or reward templates, EUREKA's self-generated rewards outperformed those of human experts in 83% of tasks and achieved an average normalization improvement of 52%.
Unlike previous L2R work that used LLM-assisted reward design, EUREKA has no task-specific prompts, reward templates, and a handful of examples. In the experiment, EUREKA performed significantly better than L2R due to its ability to generate and refine free-form, expressive reward programs.
EUREKA's versatility is due to three key algorithm design choices: context as context, evolutionary search, and reward reflection.
First, by using the environment source code as context, EUREKA can generate executable reward functions from zero samples in the backbone coding LLM (GPT-4). EUREKA then greatly improves the quality of rewards by performing evolutionary searches, iteratively proposing reward candidate batches, and refining the most promising rewards in the LLM context window. This improvement in in-context is achieved through reward reflection, which is a reward-quality text summary based on strategy training statistics that enables automatic and targeted reward editing.
FIG. 3 SHOWS AN EXAMPLE OF EUREKA ZERO-SAMPLE REWARD AND THE IMPROVEMENTS ACCUMULATED DURING OPTIMIZATION. To ensure that EUREKA is able to scale its reward search to its maximum potential, EUREKA uses GPU-accelerated distributed reinforcement learning on IsaacGym to evaluate intermediate rewards, which provides up to three orders of magnitude improvement in policy learning speed, making EUREKA a broad algorithm that scales naturally as the amount of computation increases.
EUREKA can write the reward algorithm autonomously, how it is implemented, let's look at it next.
EUREKA consists of three algorithmic components: 1) using the environment as the context, thus supporting zero-shot generation of executable rewards; 2) evolutionary search, iteratively proposing and refining reward candidates; 3) Reward reflection and support fine-grained reward improvement.
Environment as context
This article recommends providing the original environment code directly as context. With only minimal instructions, EUREKA can generate rewards in different environments with zero samples. An example of EUREKA output is shown in Figure 3. EUREKA expertly combines existing observation variables (e.g., fingertip position) in the provided environment code and produces a valid reward code – all without any environment-specific prompt engineering or reward templates.
However, on the first attempt, the resulting reward may not always be executable, and even if it is, it may be suboptimal. This raises the question of how to effectively overcome the suboptimality of single-sample reward generation?
Next, the paper describes how evolutionary search solves the problems of suboptimal solutions mentioned above. They are perfected in such a way that in each iteration, EUREKA samples several independent outputs of LLM (line 5 in algorithm 1). Since each iteration is independently and homogeneously, the probability of errors in all reward functions in the iteration decreases exponentially as the sample size increases.
To provide more complex and targeted reward analysis, this article proposes to build automated feedback to summarize the policy training dynamics in the text. Specifically, considering that the EUREKA reward function requires individual components in the reward program (such as the reward component in Figure 3), this article tracks the scalar values of all reward components at intermediate policy checkpoints throughout the training process.
Constructing this reward reflection process is simple, but it is important because of the dependency of the reward optimization algorithm. That is, whether the reward function is valid or not is affected by the specific choice of the RL algorithm, and the same reward may behave very differently even under the same optimizer for a given hyperparameter difference. By detailing how the RL algorithm optimizes the individual reward components, reward reflection enables EUREKA to produce more targeted reward edits and synthesize reward functions to better work with the fixed RL algorithm.
The experimental part provides a comprehensive assessment of Eureka, including the ability to generate reward functions, the ability to solve new tasks, and the ability to integrate various human inputs.
The experimental environment includes 10 different robots and 29 tasks, 29 of which are implemented by the IsaacGym simulator. The experiment uses 9 primitive environments from IsaacGym (Isaac), covering a variety of robot forms from quadruped, bipedal, quadcopter, manipulator to robotic dexterous hand. In addition, this article ensures the depth of the assessment by incorporating 20 tasks from the Dexterity benchmark.
For more information, please refer to the original paper.