🎉 Gate Square Growth Points Summer Lucky Draw Round 1️⃣ 2️⃣ Is Live!
🎁 Prize pool over $10,000! Win Huawei Mate Tri-fold Phone, F1 Red Bull Racing Car Model, exclusive Gate merch, popular tokens & more!
Try your luck now 👉 https://www.gate.com/activities/pointprize?now_period=12
How to earn Growth Points fast?
1️⃣ Go to [Square], tap the icon next to your avatar to enter [Community Center]
2️⃣ Complete daily tasks like posting, commenting, liking, and chatting to earn points
100% chance to win — prizes guaranteed! Come and draw now!
Event ends: August 9, 16:00 UTC
More details: https://www
Large models can't be code farmers! Princeton's surprising discovery: GPT-4 has a 0 success rate in solving GitHub programming problems
Article source: Shin Ji Yuan
Stack Overflow, already created by ChatGPT!
Because coders flocked to ChatGPT and Github Copilot, Stack Overflow had to announce layoffs of more than 100 employees today, accounting for almost 1/3 of the number of employees.
But a recent study by Princeton and Chicago found that LLM is not so easy to do as a code farmer.
In the face of 2294 real GitHub problems, the pass rate of GPT-4 solving random GitHub problems turned out to be 0%!
Even the best model, the Claude 2, only solves 1.96% of them.
Adapt or perish
As every developer's favorite code-assisted site in the world, Stack Overflow has been doing well before, setting off a hiring spree last year, doubling the company's workforce to 540.
However, everything has changed since OpenAI released ChatGPT last November.
While LLM doesn't provide 100% reliable answers, the unique ability to validate code immediately by simply testing it in the IDE integrated development environment makes writing code an ideal use case for ChatGPT.
As a result, Stack Overflow's traffic has been greatly reduced, and AI programming tools such as ChatGPT and GPT-4-powered Github Copilot have become new places for code farmers.
Today, CEO Prashanth Chandrasekar announced that Stack Overflow has laid off more than a hundred employees, or 28% of its workforce.
** Cross the river and tear down the bridge? **
The biggest irony of ChatGPT's impact on Stack Overflow is that the power of the big language model comes largely from scraping sites like Stack Overflow.
What happens if big language models empty this data without giving anything back, and if all the data sources are forced out of the business?
Now, many tech companies already have a looming problem: if there are fewer programmers, there will be less artificial data.
How do you train new AI models without up-to-date data?
Want to use our data? Take the money**
Stack Overflow, of course, can't sit still, it chose two ways to save itself -
One is to develop its own AI coding tool, OverflowAI, and the other is to seek partnerships directly with tech companies like OpenAI, which use Stack Overflow's data to build AI models.
The CEO said Stack Overflow has made its stance: whoever wants to use our data to train LLM pays.
CEOs believe that sites like Stack Overflow are critical to the development of big language models, and in order to advance, they need to be trained on new knowledge.
LLM wants to get the code farmer, it's still early
So, can big language models really take code farmers?
The Princeton and Chicago teams found that it wasn't that easy!
It was found that leading large models like GPT-4 and Claude 2 had less than 5% ability to solve real problems.
To be more specific, GPT-4 can solve random GitHub problems with a pass rate of 0%, while the best model, Claude 2, can only solve 1.96% of them.
In addition, the performance of different models in solving problems with 12 popular Python libraries also varies.
But to see clearly, the real strength of AI, don't be scored and get worried.
SWE-bench: Designed for coding models
In this study, the authors found that many current benchmarks for measuring the coding ability of large models have become saturated and cannot measure the true strength of large models.
For example, in Human, the challenge problem is too simple, and LLM only needs a few lines of code to solve a stand-alone problem.
However, software engineering is not so simple in reality.
Inspired by this, researchers from Princeton and Chicago introduced SWE-bench.
SWE-bench gets task instances from a real Python repository by connecting GitHub issues and merge request solutions to solve related tests.
As shown in the image, the model's task, usually a bug report or feature request, is to resolve an issue committed to the GitHub repository.
Each task requires generating a patch and describing the changes to be applied to the existing codebase.
Then use the repository's test framework SWE-bench to evaluate the modified codebase.
Pull requests (PRs) were first collected from 12 popular open-source Python repositories on GitHub, generating a total of about 90,000 PRs.
The researchers focused on popular repositories because they tend to be better maintained, have clear contributor guidelines, and have better test coverage. Each PR has an associated codebase, i.e. the state of the repository before the PR was merged.
**Phase 2: Attribute-based filtering. **
The candidate task is created by selecting a merged PR that meets the following criteria: (1) solves the GitHub problem; (2) Modified the test file of the repository, which indicates that the user most likely contributed tests to check if the problem is resolved.
**Phase 3: Execution-based filtering. **
For each candidate task, the test content of the PR is applied, and the relevant test results before and after the other content of the PR are applied.
The researcher filters out instances of tasks that do not have at least one test, and the status of these tests changes from failed to pass (hereinafter referred to as "failed to pass test"). In addition, instances that cause installation or operation errors are filtered out.
Through these stages of screening, the original 90,000 PRs are screened into 2,294 task instances, which make up the SWE-bench.
The final classification of these task instances in different repositories is shown in Figure 3 below, with the table being the main feature of SWE-bench task instances.
The researchers emphasize that these codebases are large, containing thousands of files, and that reference pull requests often modify multiple files at the same time.
SWE-bench offers several advantages over existing LM programming benchmarks.
These include real-world settings with user-submitted issues and solutions, diverse inputs featuring unique code questions from 12 repositories, a robust execution-based evaluation framework, and the ability to continuously update benchmarks with new instances with minimal human intervention.
LLM Task: Edit Code Base, Solve Problems
The researcher will give the large model a textual description of the problem, as well as a complete code base.
The task of the large model is to edit the codebase to solve the problem.
In practice, researchers represent changes as patch files, which specify which lines in the codebase to modify to solve the problem.
Researchers use Unix patches, apply generated patches to the code base, and then perform unit and system tests related to task instances.
A metric of the baseline, which is the percentage of resolved task instances.
Building a unique dataset for SWE-bench
Traditional NLP benchmarks typically involve only short input and output sequences and consider some "artificial" problems created specifically for benchmarks.
In contrast, to build the SWE-bench, the researchers injected unique properties into the dataset.
For example, real software engineering tasks are used.
What's more, the collection process can be easily applied to any Python repository on GitHub with little to no human intervention.
As a result, researchers can extend the SWE-bench by providing new task instances and evaluate the language model for problems created after the training date, ensuring that the training corpus does not contain a solution.
In addition, the researchers guarantee different long inputs in the benchmark, robust evaluation, cross-context code editing, wide range of solutions, etc.
Fine-tuning SWE-Llama
Next, it's time to evaluate the effectiveness of open and proprietary models in the SWE-bench framework.
However, the researchers found that off-the-shelf CodeLlama fine-tuning models could not follow detailed instructions to generate library-wide code edits, often outputting placeholder responses or irrelevant code.
To assess the capabilities of these models, the researchers performed supervised fine-tuning (SFT) on the 7 billion parameter CodeLlama-Python model and the 13 billion parameter CodeLlama-Python model.
The resulting model is a specialized repository editor that runs on consumer-grade hardware and solves GitHub problems.
Next, the researchers evaluated GPT-3.5, GPT-4, Cluade 2, and fine-tuned models.
It turned out that all models failed - none of them solved all but the simplest problems.
For example, Claude 2 and GPT-4 can only solve 4.8% and 1.7% of tasks, respectively.
After using the BM25 retriever, the performance of Claude 2 dropped further to 1.96%.
**Different libraries have different levels of difficulty. **
If you break down performance by repository, you'll see that all models exhibit similar trends across libraries.
Still, the problems addressed by each model do not necessarily overlap extensively. For example, in an oracle setup, Claude 2 and SWE-Llama 13b perform comparably, solving 110 and 91 instances per model, respectively.
**Difficulty is dependent on context length. **
Models can be pre-trained on long code sequences, but typically require a single function to be generated at a time and provide a framework with limited context to determine the problem.
As shown, you can see that the performance of Claude 2 degrades significantly as the total length of the context increases, which can also be observed in other models.
Even if increasing the maximum context size of the BM25 would improve the recall relative to Oracle files, performance would still degrade because the model simply could not locate the problematic code in the vast thesaurus.
Table 7 shows the model results by date for PRs created before or after 2023 under the "oracle" search settings.
For most models, with the exception of GPT-4, there is little difference in performance before or after this date.
LLM is not a substitute for programmers, but it can speed up workflows
Some netizens have hopes and hopes for the future of the "generalist model".
That's right, that's what I'm saying. The generalist model is not good enough to have a wide enough context length to code on its own, except for relatively short code snippets.
But I think it's only a matter of time. I can foresee that in the near future, generalist LLM with specific training will become a very professional model.
Instead of laying off employees to save money, let developers accomplish great things at breakneck speed!