Large models can't be code farmers! Princeton's surprising discovery: GPT-4 has a 0 success rate in solving GitHub programming problems

2023-10-17 07:46:51

Article source: Shin Ji Yuan

AI coding tools like ChatGPT are menacing, and Stack Overflow lays off again! However, Princeton and Chicago found that GPT-4 had a 0% resolution rate for real-world GitHub problems.

Stack Overflow, already created by ChatGPT!

Because coders flocked to ChatGPT and Github Copilot, Stack Overflow had to announce layoffs of more than 100 employees today, accounting for almost 1/3 of the number of employees.

So, AI coding tools like ChatGPT are really going to subvert the entire industry?

But a recent study by Princeton and Chicago found that LLM is not so easy to do as a code farmer.

Paper Address:

In the face of 2294 real GitHub problems, the pass rate of GPT-4 solving random GitHub problems turned out to be 0%!

Even the best model, the Claude 2, only solves 1.96% of them.

Can coders lose their jobs because of ChatGPT? The answer is – absolutely not at the moment.

Adapt or perish

As every developer's favorite code-assisted site in the world, Stack Overflow has been doing well before, setting off a hiring spree last year, doubling the company's workforce to 540.

However, everything has changed since OpenAI released ChatGPT last November.

The help provided by AI chatbots is more specific than the forum post 5 years ago. With LLM, developers can instantly correct the exact code, optimization suggestions, and a description of what each line of code is doing.

While LLM doesn't provide 100% reliable answers, the unique ability to validate code immediately by simply testing it in the IDE integrated development environment makes writing code an ideal use case for ChatGPT.

As a result, Stack Overflow's traffic has been greatly reduced, and AI programming tools such as ChatGPT and GPT-4-powered Github Copilot have become new places for code farmers.

Today, CEO Prashanth Chandrasekar announced that Stack Overflow has laid off more than a hundred employees, or 28% of its workforce.

The CEO's explanation for the layoffs is that Stack Overflow is trying to get on the road to profitability under macroeconomic pressure, and continues to introduce product innovations.

Cross the river and tear down the bridge?

The biggest irony of ChatGPT's impact on Stack Overflow is that the power of the big language model comes largely from scraping sites like Stack Overflow.

What happens if big language models empty this data without giving anything back, and if all the data sources are forced out of the business?

Now, many tech companies already have a looming problem: if there are fewer programmers, there will be less artificial data.

How do you train new AI models without up-to-date data?

Want to use our data? Take the money**

Stack Overflow, of course, can't sit still, it chose two ways to save itself -

One is to develop its own AI coding tool, OverflowAI, and the other is to seek partnerships directly with tech companies like OpenAI, which use Stack Overflow's data to build AI models.

OpenAI is developing web crawler controls for ChatGPT so that data from sites like Stack Overflow can't be crawled.

The CEO said Stack Overflow has made its stance: whoever wants to use our data to train LLM pays.

CEOs believe that sites like Stack Overflow are critical to the development of big language models, and in order to advance, they need to be trained on new knowledge.

Prashanth Chandrasekar, CEO of Stack Overflow

LLM wants to get the code farmer, it's still early

So, can big language models really take code farmers?

The Princeton and Chicago teams found that it wasn't that easy!

In the latest paper, the researchers propose a new framework, SWE-bench, to assess the ability of large models to solve 2294 real-world problems on GitHub.

It was found that leading large models like GPT-4 and Claude 2 had less than 5% ability to solve real problems.

To be more specific, GPT-4 can solve random GitHub problems with a pass rate of 0%, while the best model, Claude 2, can only solve 1.96% of them.

What's more, when using the BM-25 to retrieve the relevant code files for each issue, only 23% of the patches written by Claude 2 were valid (repo-ready), and only ~1% actually solved the problem.

In addition, the performance of different models in solving problems with 12 popular Python libraries also varies.

The GPT-4 model has achieved such a result, which is really surprising, after all, many people have long regarded it as a "programming weapon".

But to see clearly, the real strength of AI, don't be scored and get worried.

Some netizens said that this is the best answer to the question of "whether code farmers are unemployed due to programming".

Finally someone made a real dataset for the code model, and Hum was just LLM's leetcode interview. We all know that this is the wrong measure for human engineers. Less than 4% sounds right, as large models are still far from fully autonomous.

So, is this really the case with SWE-bench's results in evaluating the capabilities of large models?

SWE-bench: Designed for coding models

In this study, the authors found that many current benchmarks for measuring the coding ability of large models have become saturated and cannot measure the true strength of large models.

For example, in Human, the challenge problem is too simple, and LLM only needs a few lines of code to solve a stand-alone problem.

However, software engineering is not so simple in reality.

Fixing a bug may require browsing a huge library of resources, understanding the relationships between functions in different files, or finding a small bug in the intricate code.

Inspired by this, researchers from Princeton and Chicago introduced SWE-bench.

SWE-bench gets task instances from a real Python repository by connecting GitHub issues and merge request solutions to solve related tests.

As shown in the image, the model's task, usually a bug report or feature request, is to resolve an issue committed to the GitHub repository.

Each task requires generating a patch and describing the changes to be applied to the existing codebase.

Then use the repository's test framework SWE-bench to evaluate the modified codebase.

To find high-quality examples of large-scale tasks, researchers went through three stages of screening:

**Stage 1: Warehouse selection and data search. **

Pull requests (PRs) were first collected from 12 popular open-source Python repositories on GitHub, generating a total of about 90,000 PRs.

The researchers focused on popular repositories because they tend to be better maintained, have clear contributor guidelines, and have better test coverage. Each PR has an associated codebase, i.e. the state of the repository before the PR was merged.

**Phase 2: Attribute-based filtering. **

The candidate task is created by selecting a merged PR that meets the following criteria: (1) solves the GitHub problem; (2) Modified the test file of the repository, which indicates that the user most likely contributed tests to check if the problem is resolved.

**Phase 3: Execution-based filtering. **

For each candidate task, the test content of the PR is applied, and the relevant test results before and after the other content of the PR are applied.

The researcher filters out instances of tasks that do not have at least one test, and the status of these tests changes from failed to pass (hereinafter referred to as "failed to pass test"). In addition, instances that cause installation or operation errors are filtered out.

Through these stages of screening, the original 90,000 PRs are screened into 2,294 task instances, which make up the SWE-bench.

The final classification of these task instances in different repositories is shown in Figure 3 below, with the table being the main feature of SWE-bench task instances.

The researchers emphasize that these codebases are large, containing thousands of files, and that reference pull requests often modify multiple files at the same time.

SWE-bench offers several advantages over existing LM programming benchmarks.

These include real-world settings with user-submitted issues and solutions, diverse inputs featuring unique code questions from 12 repositories, a robust execution-based evaluation framework, and the ability to continuously update benchmarks with new instances with minimal human intervention.

LLM Task: Edit Code Base, Solve Problems

The researcher will give the large model a textual description of the problem, as well as a complete code base.

The task of the large model is to edit the codebase to solve the problem.

In practice, researchers represent changes as patch files, which specify which lines in the codebase to modify to solve the problem.

How to evaluate whether the solution given by LLM is good or not?

Researchers use Unix patches, apply generated patches to the code base, and then perform unit and system tests related to task instances.

If the patch is applied successfully, and all these tests are passed, the scheme recommended by LLM can be considered to have successfully resolved the issue.

A metric of the baseline, which is the percentage of resolved task instances.

Building a unique dataset for SWE-bench

Traditional NLP benchmarks typically involve only short input and output sequences and consider some "artificial" problems created specifically for benchmarks.

In contrast, to build the SWE-bench, the researchers injected unique properties into the dataset.

For example, real software engineering tasks are used.

Since each task instance in SWE-bench contains a large and complex code base and a description of the associated problem, solving SWE-bench requires the complex skills and knowledge of experienced software engineers, which are often not evaluated in traditional code generation benchmarks.

What's more, the collection process can be easily applied to any Python repository on GitHub with little to no human intervention.

As a result, researchers can extend the SWE-bench by providing new task instances and evaluate the language model for problems created after the training date, ensuring that the training corpus does not contain a solution.

In addition, the researchers guarantee different long inputs in the benchmark, robust evaluation, cross-context code editing, wide range of solutions, etc.

Fine-tuning SWE-Llama

Next, it's time to evaluate the effectiveness of open and proprietary models in the SWE-bench framework.

However, the researchers found that off-the-shelf CodeLlama fine-tuning models could not follow detailed instructions to generate library-wide code edits, often outputting placeholder responses or irrelevant code.

To assess the capabilities of these models, the researchers performed supervised fine-tuning (SFT) on the 7 billion parameter CodeLlama-Python model and the 13 billion parameter CodeLlama-Python model.

The resulting model is a specialized repository editor that runs on consumer-grade hardware and solves GitHub problems.

### Big models fail

Next, the researchers evaluated GPT-3.5, GPT-4, Cluade 2, and fine-tuned models.

It turned out that all models failed - none of them solved all but the simplest problems.

For example, Claude 2 and GPT-4 can only solve 4.8% and 1.7% of tasks, respectively.

After using the BM25 retriever, the performance of Claude 2 dropped further to 1.96%.

**Different libraries have different levels of difficulty. **

If you break down performance by repository, you'll see that all models exhibit similar trends across libraries.

Still, the problems addressed by each model do not necessarily overlap extensively. For example, in an oracle setup, Claude 2 and SWE-Llama 13b perform comparably, solving 110 and 91 instances per model, respectively.

**Difficulty is dependent on context length. **

Models can be pre-trained on long code sequences, but typically require a single function to be generated at a time and provide a framework with limited context to determine the problem.

As shown, you can see that the performance of Claude 2 degrades significantly as the total length of the context increases, which can also be observed in other models.

Even if increasing the maximum context size of the BM25 would improve the recall relative to Oracle files, performance would still degrade because the model simply could not locate the problematic code in the vast thesaurus.

**Difficulty is independent of issue resolution date. **

Table 7 shows the model results by date for PRs created before or after 2023 under the "oracle" search settings.

For most models, with the exception of GPT-4, there is little difference in performance before or after this date.

In addition, the study found that fine-tuning the model is sensitive to changes in the context distribution, and it is easier to generate a patch than to generate an entire file. And large models tend to produce shorter, simpler edits.

LLM is not a substitute for programmers, but it can speed up workflows

Some netizens have hopes and hopes for the future of the "generalist model".

That's right, that's what I'm saying. The generalist model is not good enough to have a wide enough context length to code on its own, except for relatively short code snippets.

But I think it's only a matter of time. I can foresee that in the near future, generalist LLM with specific training will become a very professional model.

While large models are no substitute for programmers, they can speed up their workflows. What used to take a team of 10 people may now only need 4 people. This frees up resources for other objectives prepared by the company.

Instead of laying off employees to save money, let developers accomplish great things at breakneck speed!

Resources:

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

1 Likes