🎉 The #CandyDrop Futures Challenge is live — join now to share a 6 BTC prize pool!
📢 Post your futures trading experience on Gate Square with the event hashtag — $25 × 20 rewards are waiting!
🎁 $500 in futures trial vouchers up for grabs — 20 standout posts will win!
📅 Event Period: August 1, 2025, 15:00 – August 15, 2025, 19:00 (UTC+8)
👉 Event Link: https://www.gate.com/candy-drop/detail/BTC-98
Dare to trade. Dare to win.
A screenshot of Microsoft's paper revealed that GPT-3.5 only has 20 billion parameters? The AI circle was shocked, and netizens shouted that it was outrageous!
Original source: New Zhiyuan
GPT-3.5 only has 20 billion parameters?
Today, the large model circle has been blown up by a screenshot in Microsoft's paper, what is going on?
Just a few days ago, Microsoft published a paper on arXiv, which proposed a small-scale diffusion model with only 75M parameters - CodeFusion.
In terms of performance, CodeFusion's 75 million parameters are comparable to the state-of-the-art 350M-175B model in terms of top-1 accuracy indicators.
The work of this paper is very interesting, but what attracts everyone's special attention is -
When the author compares ChatGPT (gpt-3.5-turbo), the nominal number of parameters is only 20B!
As soon as the news came out, it directly appeared on Zhihu's hot search, and netizens exploded.
As soon as the netizens' revelations came out, they instantly sparked heated discussions.
So far, more than 680,000 people have come to watch.
"It's unimaginable! Neither the Falcon-180B nor the Llama2-70B can beat the 20B model."
And this "leak" of the parameters just confirms those rumors that GPT-3.5-Turbo is not as good as the old GPT-3.5.
The Microsoft paper, which revealed that GPT3.5 only has 20B parameters, wants to introduce a diffusion model for code generation.
The researchers evaluated CodeFusion, a model for the task of generating code for natural language for Bash, Python, and Microsoft Excel conditional formatting (CF) rules.
Experiments have shown that CodeFusion (only 75M parameters) is comparable to the state-of-the-art LLM (350M-175B parameters) in terms of top-1 accuracy, and has excellent performance and parameter ratio in terms of top-3 and top-5 accuracy.
CODEFUSION IS USED FOR CODE GENERATION TASKS, AND ITS TRAINING IS DIVIDED INTO TWO PHASES, THE FIRST STAGE IS UNSUPERVISED PRE-TRAINING, AND THE SECOND STAGE IS SUPERVISED FINE-TUNING.
IN THE SECOND PHASE, CODEFUSION PERFORMS SUPERVISED FINE-TUNING, USING DATA FROM TEXT-CODE PAIRS. At this stage, the encoder, denoiser, and decoder are all tuned to perform the task better.
IN ADDITION, CODEFUSION DRAWS ON PREVIOUS RESEARCH ON TEXT DIFFUSION TO FUSE THE HIDDEN REPRESENTATION D FROM THE DECODER INTO THE MODEL. This is to improve the performance of the model. During the training process, in different steps, the model introduces some noise and then calculates the loss function to ensure that the generated code snippet is more in line with the expected standard.
IN SUMMARY, CODEFUSION IS A SMALL MODEL THAT PERFORMS CODE GENERATION WORK, AND CONTINUOUSLY IMPROVES ITS PERFORMANCE THROUGH TWO PHASES OF TRAINING AND NOISE INGESTION. This model is inspired by the study of text diffusion and improves the loss function by fusing the hidden representation of the decoder to better generate high-quality code snippets.
Assessment Results
The following table summarizes the performance of the CODEFUSION model and each baseline model at the top-1, top-3, and top-5 settings.
In top-1, CODEFUSION's performance is comparable to, and in some cases even better, especially in Python tasks, where only GPT-3 (175B) performs slightly better than CODEFUSION (75M). However, in terms of top-3 and top-5, CODEFUSION significantly outperformed all baseline models.
COMPARED TO AUTOREGRESSIVE MODELS, CODEFUSION GENERATES MORE DIVERSE RESULTS AND PERFORMS BETTER.
THIS APPROACH HELPS TO SUMMARIZE AND DEMONSTRATE THE STEP-BY-STEP PROGRESS OF THE CODEFUSION MODEL, AS SHOWN IN THE FIGURE BELOW.
Is GPT-3.5 an ensemble of small expert models or a generalist model? Is it distilled by a larger model or trained on a larger data?
The answers to these questions will only be revealed when they are truly open source.
Resources: