GPT-4 does not know that it is wrong! LLM's new flaws were exposed, and the self-correction success rate was only 1%, and LeCun Marcus exclaimed that the more he corrected the more wrong

2023-10-22 05:30:11

GPT-4 simply does not know that it is making a mistake? The latest research has found that LLM in the reasoning task, after self-correction, can not save the performance deterioration, leading AI boss LeCun Marcus to watch.

Original source: Shin Ji Yuan

Image source: Generated by Unbounded AI

The big model was exposed to major flaws, which attracted the attention of LeCun and Marcus at the same time!

In the inference experiment, the model that claimed to improve accuracy self-corrected, "improved" the accuracy rate from 16% to 1%!

Simply put, LLM cannot improve output in the form of self-correction in the form of inference tasks, unless LLM already knows the correct answer in the process of self-correction.

Two papers published by ASU researchers refute the "self-correction" method proposed by many previous studies - allowing large models to self-correct their output results can improve the quality of the model's output.

Paper Address:

Professor Subbarao Kambhampati, a co-author of the paper, has been committed to the research of AI reasoning ability, and published a paper in September, even completely denying the reasoning and planning ability of GPT-4.

Paper Address:

In addition to this professor, recent researchers at DeepMind and UIUC University have also questioned LLM's ability to "self-correct" in reasoning tasks.

The paper even calls on all scholars who do relevant research to take your research seriously, and not to tell the big model the correct answer and then let it carry out so-called "self-correction".

Because if the model does not know the correct answer, the output quality will deteriorate after the model "self-corrects".

Next, let's take a look at these two latest papers.

GPT-4 "self-correcting", the output is worse

The first paper focused on GPT-4, asking GPT-4 to provide a solution to the problem of graphics shading, and then having GPT-4 "self-correct" its own solution.

At the same time, the authors introduced an external evaluation system to evaluate the direct output of GPT-4 and the output after a "self-correcting" cycle.

Experimental results show that GPT-4 is less than 20% accurate in guessing color, which does not seem to be surprising.

But surprisingly, the accuracy in the "self-correcting" mode dropped significantly (the second bar below) – completely contrary to all self-correction intentions!

According to the authors, this seemingly counterintuitive situation can be explained by this: GPT-4 also does a terrible job of verifying correct answers!

Because even when GPT-4 accidentally guesses the correct color, its "self-correction" will make it think that the correct answer is problematic, and then replace the correct answer.

Further research also found that GPT-4 would indeed improve its solution if an external validator provided a verifiably correct answer to the color it guessed.

In this case, the prompt generated by "self-correction" can indeed improve the quality of the output (bars 3-5 of the figure above)

In summary, for the "coloring problem" task, GPT-4's independent "self-correction" will impair the performance of the output, because GPT-4 cannot verify that the answer is correct.

However, if the correct external verification process is provided, the "self-correction" generated by GPT-4 can indeed improve performance.

Another paper looked at the ability of large language models to "self-correct" from the perspective of planning tasks, and the results were similar to the previous paper.

Moreover, the researchers found that what really improved the accuracy of the output was not the "self-correction" of the LLM, but the feedback from an external independent validator.

In the final analysis, LLM has no way to conduct independent verification, and must rely on the "correct answer" given by an external validator in order to effectively "self-correct".

"Coloring Questions" performed poorly and LLM could not independently verify correct answers

Research Design Framework

The "coloring problem" is a very classic reasoning problem, even if it is not difficult, the answers are diverse enough, and the correctness of the answers is easy to verify.

The results of diversity make it difficult to cover the entire training data of LLM, and the possibility of contamination of LLM training data is avoided as much as possible.

These reasons make the "coloring problem" very suitable for studying LLM's reasoning ability, and it is also convenient to study LLM's ability to "self-correct" in reasoning.

The researchers built their own dataset, using GrinPy2 to handle common graph manipulations. Each graph is constructed using the Erdos-Rényi method (̋p = 0.4).

Once the correct answer is found, it is compiled into the standard DIMACS format with a comment containing its precomputed chromatic number.

For the next experiment, the researchers generated 100 instances, each with an average of 24 edges, distributed over a range of nodes from 10 to 17—a distribution that has been shown by experience to be a sufficiently variable range.

The diagram used by the researchers is shown in Figure 1 below, which includes the first reply of LLM, the back prompt of the response, and the final correct color scheme.

### Architecture for Iterative Backing

Prompt Generator:

This prompt generator takes a DIMACS instance, translates each edge into a sentence, and then wraps the whole in a set of generic instructions to construct a natural language prompt.

The researchers intentionally narrowed down the differences between the different instance prompts to reduce the problem-specific information that researchers leaked to LLM. Examples of various types of prompts can be found in the appendix.

Large Language Models:

GPT-4 is called via the OpenAI API, which is currently the most advanced model.

Researchers provide a system role: "You are a constraint satisfaction solver that solves various CSP (constraint satisfaction problems)".

Back Generation

In authentication mode, LLM receives a different type of prompt.

In addition to the standard instructions, it only contains a description of the diagram and a recommended coloring scheme. Its task is to verify correctness, optimality, and that each vertex has been painted a color.

If the resulting reply has a set of edges that are contradictory, the coloring scheme is wrong.

To compare each point, the researchers also built a validator that listed each contradictory edge.

Since LLM's responses are also in natural language form, the researchers first translated them into a format that was easy to analyze. To make this process more consistent, the researchers designed initial hints to describe the precise output format a model needs to follow. The response is then evaluated for correctness.

To judge LLM validation results, researchers examine how well they perform at identifying errors in the proposed shading scheme.

Intuitively, these should be easy to identify: if two vertices that make up an edge share a color, immediately return to that edge. From an algorithmic point of view, it is enough to detect all the edges and compare the color of each vertex with the color of the point to which it is connected.

Verification

To gain a deeper understanding of LLM's verification capabilities, the researchers studied their performance in identifying errors in the proposed coloring scheme.

Intuitively, these errors should be easy to identify: if two vertices that make up an edge share a color, the edge is returned immediately. From an algorithmic point of view, all that needs to be done is to iterate through all the edges and compare the color of each vertex to the color of its corresponding vertex.

The researchers used the same analysis process, but built a new domain that the researchers called color_verification. LLM is guided to check the correctness of shading, optimality, and whether each vertex has been assigned a color.

If the shading is incorrect, it is instructed to list errors in shading, that is, if two connected nodes share a color, that edge is returned to represent the error. No backs are given.

The researchers used the same graph example as before, but generated four shading schemes for testing the model:

Correct: An error-free optimal shading scheme generated by an iterative, random greedy algorithm (using a precomputed number of colors to ensure optimality).

Ablated: Changes the color of a random node from a previous set of shading schemes to its neighbors.

Non-optimal: In the correct set, a color part is randomly selected and recolored into a new hue.

Random: Completely randomly assigned colors, the number of different colors is equal to the number of colors of the figure.

LLM: A randomly selected coloring scheme from the output generated by LLM from previous experiments.

Conclusion

The LLM is prompted, the answers are evaluated, and the next instance is moved on without any backs, resulting in a baseline score of 16%.

When the researchers ran the same instance, but this time returned the prompt using feedback generated by the same language model acting as a validator, performance dropped dramatically — only one out of 100 instances got the correct answer.

The results of the return prompt with an externally qualified validator may seem more effective at first.

The number of instances of correct responses is close to 40 percent, but if that means GPT-4 is listening, improving, and reasoning based on feedback, then the researchers expect better results from more accurate return prompts.

However, in this domain, the raw fraction (see Figure 2 above) does not prove this.

LLM Verification Capability

The researchers tested GPT-4's ability to verify graph shading schemes on the same instance, generating five different types of shading schemes for each instance.

The obvious result is exactly the same as the LLM self-correction result above: the model is almost reluctant to mark any answers as correct. Out of 100 optimal shading schemes, it agrees that only 2 of them are correct.

Out of the entire collection of 500 coloring schemes, 118 of which are correct, it only claims that 30 of them are correct. Of these 30, only 5 were actually correct.

Overall, this pattern remains the same. In less than 10% of cases, LLM gave a response of "correct", "non-optimal", or "missing assignment". In these cases, the behavior appears somewhat random.

In about a quarter of the instances, it responds with a "this is incorrect" validation while the interpretation corresponds to reality, and it only does this by indicating no more than one side, minimizing the chance of misstating something.

The results are shown in Table 2 above. Note that when the error rate of the domain increases, the hallucination ratio decreases. That is, when there are more incorrect edges, the model is more likely to point out where something went wrong.

LLM self-criticism, performance does not increase but decreases

In the paper submitted on the 12th, the authors also came to the same conclusion as above.

Whether it is planning, simple arithmetic or logic, the GPT-4, the current state-of-the-art large model, is not fully competent.

Many researchers have explored and improved it, including allowing LLM to learn self-iteration, self-validation and other strategies to improve performance.

As a result, people in the industry are optimistic that the big model can still be saved!

However, the complexity of the inference task in the classical sense has nothing to do with the large model, because LLM is a model that uses approximate retrieval rather than precise reasoning.

In a paper presented by arXiv on the 12th, ASU researchers systematically evaluated and analyzed LLM's ability to self-criticize in planning tasks and iterative optimization.

In the study, the authors propose a planning system that includes the generator LLM and the validator LLM.

Among them, the GPT-4 generator is responsible for generating candidate plans, and the GPT-4 validator is responsible for verifying the correctness of the plan and providing feedback.

The researchers then conducted experiments in the field of Blocksworld planning and conducted empirical evaluations of:

The impact of self-criticism on the planned generation performance of the entire LLM+LLM system
the performance of the validator LLM relative to ground truth verification;
When criticizing LLM generation, the same feedback level affects overall system performance.

The results show that self-criticism reduces LLM planning generation performance compared to using an external reliable validator.

The performance degradation can be directly attributed to the poor results of the validator LLM, which produces a large number of false positives, which can seriously impair the reliability of the system.

The binary classification accuracy of the validator LLM is only 61%, and there are a large number of false positives (judging the wrong scheme as correct).

In addition, according to the comparison of the level of detail of the feedback, it is found that it has little impact on the performance of planning generation.

Overall, the systematic investigation of this study provides preliminary evidence that questions the effectiveness of LLM as a validator of planning tasks within an iterative, self-critical framework.

About the Author

Subbarao Kambhampati

Subbarao Kambhampati is a professor of computer science at Arizona State University. Kambhampati researches fundamental issues in planning and decision-making, particularly driven by the challenges of artificial intelligence systems for human perception.

Resources:

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

2 Likes