🎉 Gate Square Growth Points Summer Lucky Draw Round 1️⃣ 2️⃣ Is Live!
🎁 Prize pool over $10,000! Win Huawei Mate Tri-fold Phone, F1 Red Bull Racing Car Model, exclusive Gate merch, popular tokens & more!
Try your luck now 👉 https://www.gate.com/activities/pointprize?now_period=12
How to earn Growth Points fast?
1️⃣ Go to [Square], tap the icon next to your avatar to enter [Community Center]
2️⃣ Complete daily tasks like posting, commenting, liking, and chatting to earn points
100% chance to win — prizes guaranteed! Come and draw now!
Event ends: August 9, 16:00 UTC
More details: https://www
GPT-4 does not know that it is wrong! LLM's new flaws were exposed, and the self-correction success rate was only 1%, and LeCun Marcus exclaimed that the more he corrected the more wrong
Original source: Shin Ji Yuan
The big model was exposed to major flaws, which attracted the attention of LeCun and Marcus at the same time!
Professor Subbarao Kambhampati, a co-author of the paper, has been committed to the research of AI reasoning ability, and published a paper in September, even completely denying the reasoning and planning ability of GPT-4.
In addition to this professor, recent researchers at DeepMind and UIUC University have also questioned LLM's ability to "self-correct" in reasoning tasks.
The paper even calls on all scholars who do relevant research to take your research seriously, and not to tell the big model the correct answer and then let it carry out so-called "self-correction".
Because if the model does not know the correct answer, the output quality will deteriorate after the model "self-corrects".
GPT-4 "self-correcting", the output is worse
The first paper focused on GPT-4, asking GPT-4 to provide a solution to the problem of graphics shading, and then having GPT-4 "self-correct" its own solution.
At the same time, the authors introduced an external evaluation system to evaluate the direct output of GPT-4 and the output after a "self-correcting" cycle.
But surprisingly, the accuracy in the "self-correcting" mode dropped significantly (the second bar below) – completely contrary to all self-correction intentions!
Because even when GPT-4 accidentally guesses the correct color, its "self-correction" will make it think that the correct answer is problematic, and then replace the correct answer.
In this case, the prompt generated by "self-correction" can indeed improve the quality of the output (bars 3-5 of the figure above)
In summary, for the "coloring problem" task, GPT-4's independent "self-correction" will impair the performance of the output, because GPT-4 cannot verify that the answer is correct.
However, if the correct external verification process is provided, the "self-correction" generated by GPT-4 can indeed improve performance.
Another paper looked at the ability of large language models to "self-correct" from the perspective of planning tasks, and the results were similar to the previous paper.
** "Coloring Questions" performed poorly and LLM could not independently verify correct answers**
Research Design Framework
The "coloring problem" is a very classic reasoning problem, even if it is not difficult, the answers are diverse enough, and the correctness of the answers is easy to verify.
The results of diversity make it difficult to cover the entire training data of LLM, and the possibility of contamination of LLM training data is avoided as much as possible.
These reasons make the "coloring problem" very suitable for studying LLM's reasoning ability, and it is also convenient to study LLM's ability to "self-correct" in reasoning.
The researchers built their own dataset, using GrinPy2 to handle common graph manipulations. Each graph is constructed using the Erdos-Rényi method (̋p = 0.4).
Once the correct answer is found, it is compiled into the standard DIMACS format with a comment containing its precomputed chromatic number.
For the next experiment, the researchers generated 100 instances, each with an average of 24 edges, distributed over a range of nodes from 10 to 17—a distribution that has been shown by experience to be a sufficiently variable range.
The diagram used by the researchers is shown in Figure 1 below, which includes the first reply of LLM, the back prompt of the response, and the final correct color scheme.
Prompt Generator:
This prompt generator takes a DIMACS instance, translates each edge into a sentence, and then wraps the whole in a set of generic instructions to construct a natural language prompt.
The researchers intentionally narrowed down the differences between the different instance prompts to reduce the problem-specific information that researchers leaked to LLM. Examples of various types of prompts can be found in the appendix.
Large Language Models:
GPT-4 is called via the OpenAI API, which is currently the most advanced model.
Researchers provide a system role: "You are a constraint satisfaction solver that solves various CSP (constraint satisfaction problems)".
Back Generation
In authentication mode, LLM receives a different type of prompt.
In addition to the standard instructions, it only contains a description of the diagram and a recommended coloring scheme. Its task is to verify correctness, optimality, and that each vertex has been painted a color.
If the resulting reply has a set of edges that are contradictory, the coloring scheme is wrong.
To compare each point, the researchers also built a validator that listed each contradictory edge.
Since LLM's responses are also in natural language form, the researchers first translated them into a format that was easy to analyze. To make this process more consistent, the researchers designed initial hints to describe the precise output format a model needs to follow. The response is then evaluated for correctness.
To judge LLM validation results, researchers examine how well they perform at identifying errors in the proposed shading scheme.
Intuitively, these should be easy to identify: if two vertices that make up an edge share a color, immediately return to that edge. From an algorithmic point of view, it is enough to detect all the edges and compare the color of each vertex with the color of the point to which it is connected.
Verification
To gain a deeper understanding of LLM's verification capabilities, the researchers studied their performance in identifying errors in the proposed coloring scheme.
Intuitively, these errors should be easy to identify: if two vertices that make up an edge share a color, the edge is returned immediately. From an algorithmic point of view, all that needs to be done is to iterate through all the edges and compare the color of each vertex to the color of its corresponding vertex.
The researchers used the same analysis process, but built a new domain that the researchers called color_verification. LLM is guided to check the correctness of shading, optimality, and whether each vertex has been assigned a color.
If the shading is incorrect, it is instructed to list errors in shading, that is, if two connected nodes share a color, that edge is returned to represent the error. No backs are given.
Correct: An error-free optimal shading scheme generated by an iterative, random greedy algorithm (using a precomputed number of colors to ensure optimality).
Ablated: Changes the color of a random node from a previous set of shading schemes to its neighbors.
Non-optimal: In the correct set, a color part is randomly selected and recolored into a new hue.
Random: Completely randomly assigned colors, the number of different colors is equal to the number of colors of the figure.
LLM: A randomly selected coloring scheme from the output generated by LLM from previous experiments.
Conclusion
When the researchers ran the same instance, but this time returned the prompt using feedback generated by the same language model acting as a validator, performance dropped dramatically — only one out of 100 instances got the correct answer.
The results of the return prompt with an externally qualified validator may seem more effective at first.
The number of instances of correct responses is close to 40 percent, but if that means GPT-4 is listening, improving, and reasoning based on feedback, then the researchers expect better results from more accurate return prompts.
However, in this domain, the raw fraction (see Figure 2 above) does not prove this.
LLM Verification Capability
The researchers tested GPT-4's ability to verify graph shading schemes on the same instance, generating five different types of shading schemes for each instance.
The obvious result is exactly the same as the LLM self-correction result above: the model is almost reluctant to mark any answers as correct. Out of 100 optimal shading schemes, it agrees that only 2 of them are correct.
Out of the entire collection of 500 coloring schemes, 118 of which are correct, it only claims that 30 of them are correct. Of these 30, only 5 were actually correct.
Overall, this pattern remains the same. In less than 10% of cases, LLM gave a response of "correct", "non-optimal", or "missing assignment". In these cases, the behavior appears somewhat random.
In about a quarter of the instances, it responds with a "this is incorrect" validation while the interpretation corresponds to reality, and it only does this by indicating no more than one side, minimizing the chance of misstating something.
LLM self-criticism, performance does not increase but decreases
In the paper submitted on the 12th, the authors also came to the same conclusion as above.
Whether it is planning, simple arithmetic or logic, the GPT-4, the current state-of-the-art large model, is not fully competent.
Many researchers have explored and improved it, including allowing LLM to learn self-iteration, self-validation and other strategies to improve performance.
As a result, people in the industry are optimistic that the big model can still be saved!
However, the complexity of the inference task in the classical sense has nothing to do with the large model, because LLM is a model that uses approximate retrieval rather than precise reasoning.
In a paper presented by arXiv on the 12th, ASU researchers systematically evaluated and analyzed LLM's ability to self-criticize in planning tasks and iterative optimization.
In the study, the authors propose a planning system that includes the generator LLM and the validator LLM.
The researchers then conducted experiments in the field of Blocksworld planning and conducted empirical evaluations of:
The impact of self-criticism on the planned generation performance of the entire LLM+LLM system
the performance of the validator LLM relative to ground truth verification;
When criticizing LLM generation, the same feedback level affects overall system performance.
The results show that self-criticism reduces LLM planning generation performance compared to using an external reliable validator.
The binary classification accuracy of the validator LLM is only 61%, and there are a large number of false positives (judging the wrong scheme as correct).
About the Author
Subbarao Kambhampati
Subbarao Kambhampati is a professor of computer science at Arizona State University. Kambhampati researches fundamental issues in planning and decision-making, particularly driven by the challenges of artificial intelligence systems for human perception.