A screenshot of Microsoft's paper revealed that GPT-3.5 only has 20 billion parameters? The AI circle was shocked, and netizens shouted that it was outrageous!

Question

Original source: New Zhiyuan![](https://img-cdn.gateio.im/resized-social/moments-bab2147faf-305182ec3b-dd1a6f-69ad2a) Image source: Generated by Unbounded AIGPT-3.5 only has 20 billion parameters?Today, the large model circle has been blown up by a screenshot in Microsoft's paper, what is going on?Just a few days ago, Microsoft published a paper on arXiv, which proposed a small-scale diffusion model with only 75M parameters - CodeFusion.In terms of performance, CodeFusion's 75 million parameters are comparable to the state-of-the-art 350M-175B model in terms of top-1 accuracy indicators.![](https://img-cdn.gateio.im/resized-social/moments-bab2147faf-acfed1ccb0-dd1a6f-69ad2a) Address:The work of this paper is very interesting, but what attracts everyone's special attention is -When the author compares ChatGPT (gpt-3.5-turbo), the nominal number of parameters is only 20B!![](https://img-cdn.gateio.im/resized-social/moments-bab2147faf-e997f2dfa9-dd1a6f-69ad2a) Prior to this, everyone's guess about the number of GPT-3.5 parameters was 175 billion, which is equivalent to a reduction of almost ten times!![](https://img-cdn.gateio.im/resized-social/moments-bab2147faf-f9a57eb606-dd1a6f-69ad2a) According to the revelations of this paper, netizens also went to Wikipedia to update the introduction of GPT-3.5 and directly changed the parameter size to 20B.As soon as the news came out, it directly appeared on Zhihu's hot search, and netizens exploded.![](https://img-cdn.gateio.im/resized-social/moments-bab2147faf-1151c7af0f-dd1a6f-69ad2a) Some people said, hurry back and take out my previous model distillation blog post to review and review.![](https://img-cdn.gateio.im/resized-social/moments-bab2147faf-0c8dc90491-dd1a6f-69ad2a) ## **Is it "oolong" or "fact"? **As soon as the netizens' revelations came out, they instantly sparked heated discussions.So far, more than 680,000 people have come to watch.![](https://img-cdn.gateio.im/resized-social/moments-bab2147faf-03822c6176-dd1a6f-69ad2a) The old brother said that several authors of the paper are also using Twitter, and it is estimated that it will not be long before they will explain in person.![](https://img-cdn.gateio.im/resized-social/moments-bab2147faf-ae81eab0f0-dd1a6f-69ad2a) As for this mysterious "20B", netizens also have different opinions.![](https://img-cdn.gateio.im/resized-social/moments-bab2147faf-a431a67d4a-dd1a6f-69ad2a) Some speculate that this is most likely a mistake by the author. For example, it was originally 120B, or 200B.![](https://img-cdn.gateio.im/resized-social/moments-bab2147faf-68d7626a77-dd1a6f-69ad2a) Combined with various evaluations in reality, there are indeed many small models that can achieve similar results as ChatGPT, such as Mistral-7B.![](https://img-cdn.gateio.im/resized-social/moments-bab2147faf-41a9ceb875-dd1a6f-69ad2a) Perhaps, this is also a side confirmation that GPT-3.5 is really not large.![](https://img-cdn.gateio.im/resized-social/moments-bab2147faf-507e9e6fc9-dd1a6f-69ad2a) Many netizens also think that the parameters of 20B may be accurate, and they have sighed:"It's unimaginable! Neither the Falcon-180B nor the Llama2-70B can beat the 20B model."![](https://img-cdn.gateio.im/resized-social/moments-bab2147faf-031d17ddca-dd1a6f-69ad2a) Some netizens also believe that GPT-3.5-Turbo is a refined version of GPT-3.5.And this "leak" of the parameters just confirms those rumors that GPT-3.5-Turbo is not as good as the old GPT-3.5.![](https://img-cdn.gateio.im/resized-social/moments-bab2147faf-2e40bddc14-dd1a6f-69ad2a) However, according to OpenAI's official documentation, except for text-davinci and code-davinci, which are no longer used, all members of the GPT-3.5 family are based on gpt-3.5-turbo.![](https://img-cdn.gateio.im/resized-social/moments-bab2147faf-60557d480b-dd1a6f-69ad2a) ![](https://img-cdn.gateio.im/resized-social/moments-bab2147faf-cf126d0a67-dd1a6f-69ad2a) ![](https://img-cdn.gateio.im/resized-social/moments-bab2147faf-e05153a7a7-dd1a6f-69ad2a) ## **Microsoft Releases CodeFusion**The Microsoft paper, which revealed that GPT3.5 only has 20B parameters, wants to introduce a diffusion model for code generation.The researchers evaluated CodeFusion, a model for the task of generating code for natural language for Bash, Python, and Microsoft Excel conditional formatting (CF) rules.Experiments have shown that CodeFusion (only 75M parameters) is comparable to the state-of-the-art LLM (350M-175B parameters) in terms of top-1 accuracy, and has excellent performance and parameter ratio in terms of top-3 and top-5 accuracy.![](https://img-cdn.gateio.im/resized-social/moments-bab2147faf-28bac32ec5-dd1a6f-69ad2a) **Model Architecture**CODEFUSION IS USED FOR CODE GENERATION TASKS, AND ITS TRAINING IS DIVIDED INTO TWO PHASES, THE FIRST STAGE IS UNSUPERVISED PRE-TRAINING, AND THE SECOND STAGE IS SUPERVISED FINE-TUNING.![](https://img-cdn.gateio.im/resized-social/moments-bab2147faf-236bb1a5ff-dd1a6f-69ad2a) IN THE FIRST PHASE, CODEFUSION USES UNLABELED CODE SNIPPETS TO TRAIN THE DENOISER AND DECODER. It also uses a trainable embedding layer, L, to embed code snippets into contiguous spaces.IN THE SECOND PHASE, CODEFUSION PERFORMS SUPERVISED FINE-TUNING, USING DATA FROM TEXT-CODE PAIRS. At this stage, the encoder, denoiser, and decoder are all tuned to perform the task better.IN ADDITION, CODEFUSION DRAWS ON PREVIOUS RESEARCH ON TEXT DIFFUSION TO FUSE THE HIDDEN REPRESENTATION D FROM THE DECODER INTO THE MODEL. This is to improve the performance of the model. During the training process, in different steps, the model introduces some noise and then calculates the loss function to ensure that the generated code snippet is more in line with the expected standard.IN SUMMARY, CODEFUSION IS A SMALL MODEL THAT PERFORMS CODE GENERATION WORK, AND CONTINUOUSLY IMPROVES ITS PERFORMANCE THROUGH TWO PHASES OF TRAINING AND NOISE INGESTION. This model is inspired by the study of text diffusion and improves the loss function by fusing the hidden representation of the decoder to better generate high-quality code snippets.## **Assessment Results**The following table summarizes the performance of the CODEFUSION model and each baseline model at the top-1, top-3, and top-5 settings.In top-1, CODEFUSION's performance is comparable to, and in some cases even better, especially in Python tasks, where only GPT-3 (175B) performs slightly better than CODEFUSION (75M). However, in terms of top-3 and top-5, CODEFUSION significantly outperformed all baseline models.![](https://img-cdn.gateio.im/resized-social/moments-bab2147faf-2f73f1df62-dd1a6f-69ad2a) The table below shows the average diversity results of CODEFUSION and autoregressive models (including T5, CodeT5, StarCoder, CodeGen, and GPT-3) on each benchmark task, and examines the results generated by the first 5 generations of each model.COMPARED TO AUTOREGRESSIVE MODELS, CODEFUSION GENERATES MORE DIVERSE RESULTS AND PERFORMS BETTER.![](https://img-cdn.gateio.im/resized-social/moments-bab2147faf-1289d29ab6-dd1a6f-69ad2a) In the ablation experiment, the authors stopped the denoising process and generated a code snippet of the current state in the range of time step t∈[0, T]. Normalize the string edit distance is used to measure the results obtained for each time step (in increments of every 100 steps).THIS APPROACH HELPS TO SUMMARIZE AND DEMONSTRATE THE STEP-BY-STEP PROGRESS OF THE CODEFUSION MODEL, AS SHOWN IN THE FIGURE BELOW.![](https://img-cdn.gateio.im/resized-social/moments-bab2147faf-59696a2139-dd1a6f-69ad2a) Having said all that, what exactly is the number of parameters in GPT-3.5? What is the technical and other connection between GPT-4 and GPT-3.5?Is GPT-3.5 an ensemble of small expert models or a generalist model? Is it distilled by a larger model or trained on a larger data?The answers to these questions will only be revealed when they are truly open source.Resources: