The best 7B model changes hands again! Defeat 70 billion LLaMA2, and Apple computers will be able to run|open source and free

Original source: qubits

Image source: Generated by Unbounded AI

The 7 billion parameter model that took 500 dollars to "tune" defeated the 70 billion parameter Llama 2!

And the notebook can run easily, and the effect is comparable to ChatGPT.

Important: Free, no money.

The open-source model Zephyr-7B created by the HuggingFace H4 team, shark crazy.

Its underlying model is an open-source large model Mistral-7B, which exploded some time ago and was built by Mistral AI, which is known as "European OpenAI".

You know, less than 2 weeks after the release of the Mistral-7B, various fine-tuned versions have appeared one after another, and there is a lot of "alpaca" style that quickly appeared when Llama was first released.

The key to Zephyr's ability to stand out among the variants was that the team fine-tuned the model on a public dataset using Direct Preference Optimization (DPO) on top of Mistral.

The team also found that removing the built-in alignment of the dataset could further improve MT Bench performance. The original Zephyr-7B-alpha had an average MT-Bench score of 7.09, surpassing the Llama2-70B-Chat.

** **###### MT-Bench is a benchmark test to evaluate the model's ability to handle multiple rounds of dialogue, and the question set covers 8 categories such as writing, role-playing, and extraction.

The point is, it then went on to upgrade!

The H4 team launched the second generation of Zephyr-7B-beta. They added that they explored the idea of extracting alignment from GPT-4, Claude 2 and then injecting it into small models, developing a method for using distillation direct preference optimization (dDPO) for small models.

In the second generation of Zephyr, the average MT-Bench score increased to 7.34.

On Alpaca, Zephyr has a 90.6% win rate, which is better than ChatGPT (3.5):

Netizens who rushed to Zephyr gave unanimous praise, and the lmsys team also showed the Elo score of Zephyr-7b-beta, which has soared very high 🔥:

The internal Arena leaderboard has surpassed 13B models.

Some people even said:

Seeing the DPO approach perform well in the field is probably the most exciting thing about the development of large language models this year.

More netizens have started to test the effect of Zephyr, and the results are surprisingly good.

The word Mistral means dry, cold and strong wind in French, while Zephyr means mild, pleasant westerly wind.

There is no doubt that there is a zoo on the other side of Llama, and there is no doubt that there is a weather bureau on this side.

Best 7B model changes hands again

Let's start with the computer requirements for running Zephyr. Netizens said "Thai pants are spicy" after the test! , the notebook (Apple M1 Pro) is enough, "the result is very good".

In terms of effectiveness, the Llama Index (formerly known as GPT Index) team also tested it.

It turns out that Zephyr is currently the only open-source 7B model that performs well on high-level RAG/agentic tasks.

The data also shows that the effect of Zephyr's advanced RAG task can compete with GPT-3.5 and Claude 2.

They went on to add that Zephyr not only works well on RAG, but also in routing, query planning, retrieving complex SQL statements, and structured data extraction.

Officials also gave test results, and on MT-Bench, Zephyr-7B-beta has strong performance compared to larger models such as Llama2-Chat-70B.

But on more complex tasks such as coding and math, Zephyr-7B-beta lags behind proprietary models and requires more research to close the gap.

Abandon Reinforcement Learning

While everyone is testing Zephyr's effects, developers say the most interesting thing is not the metrics, but the way the model is trained.

The highlights are summarized below:

  • Fine-tune the best small, open-source pretrained model: Mistral 7B
  • Usage of Large-Scale Preference Datasets: UltraFeedback
  • Use Direct Preference Optimization (DPO) instead of reinforcement learning
  • Unexpectedly, overfitting of the preference dataset yields better results

Broadly speaking, as mentioned at the beginning, the main reason why Zephyr is able to surpass the 70B Llama 2 is due to the use of a special fine-tuning method.

Unlike the traditional PPO reinforcement learning approach, the research team used a recent collaboration between Stanford University and CZ Biohub to propose a DPO approach.

According to the researchers:

DPO is much more stable than PPO.

In simple terms, DPO can be explained as follows:

In order to make the output of the model more in line with human preferences, the traditional approach has been to fine-tune the target model with a reward model. If the output is good, you will be rewarded, and if the output is bad, you will not be rewarded.

The DPO approach, on the other hand, bypasses the modeling reward function and is equivalent to optimizing the model directly on the preference data.

In general, DPO solves the problem of difficult and expensive reinforcement learning training due to human feedback.

In terms of Zephyr's training specifically, the research team initially fine-tuned Zephyr-7B-alpha on a streamlined variant of the UltraChat dataset, which contains 1.6 million conversations generated by ChatGPT (about 200,000 remaining).

(The reason for the streamlining was that the team found that Zephyr was sometimes cased incorrectly, such as "Hi. how are you?”; Sometimes the response starts with "I don't have personal X". )

Later, they further aligned the model with the publicly available openbmb/UltraFeedback dataset using TRL's DPO Trainer method.

The dataset contains 64,000 prompt-response pairs from various models. Each response is ranked by GPT-4 based on criteria such as usefulness and given a score from which an AI preference is derived.

An interesting finding is that when using the DPO method, the effect is actually better after overfitting as the training time increases. The researchers believe this is similar to overfitting in SFT.

It is worth mentioning that the research team also introduced that fine-tuning the model with this method costs only $500, which is 8 hours of running on 16 A100s.

When upgrading Zephyr to beta, the team went on to explain their approach.

They thought about distillation supervised fine-tuning (dSFT) used in large models, but with this approach the model was misaligned and did not produce output that matched the user's intent.

So the team tried to use preference data from AI Feedback (AIF) to rank the outputs with a "teacher model" to form a dataset, and then apply distillation direct preference optimization (dDPO) to train a model that aligned with user intent without any additional sampling during fine-tuning.

The researchers also tested the effect when SFT was not used, and the results resulted in a significant reduction in performance, indicating that the dSFT step is critical.

At present, in addition to the model has been open source and commercial, there is also a Demo to try, so we can get started and experience it simply.

Demo Experience

First of all, I had to move out of the "mentally handicapped" question to take a test.

On the question "Mom and Dad don't take me when they get married", Zephyr's overall answer is more accurate.

ChatGPT can't beat this question.

In the test, we also found that Zephyr also knows about recent events such as OpenAI's release of GPT-4:

This is actually related to its underlying model, although Mistral official did not specify the deadline for training data.

But some netizens have tested it before, and it also knows about it in March this year.

In contrast, the pre-training data of Llama 2 is as of September 2022, and only some fine-tuned data is up to June 2023.

In addition, Zephyr is very responsive, so you can write code and make up stories. :

It is worth mentioning that Zephyr is better at answering questions in English, and also has a common problem with the "hallucination" model.

Researchers also mentioned hallucinations, and a small line of text was marked below the input box indicating that the content generated by the model may be inaccurate or incorrect.

The point is that Zephyr doesn't use methods like reinforcement learning with human feedback to align with human preferences, nor does it use ChatGPT's response filtering.

Always choose one of emmm fish and bear paws.

Zephyr was able to do this with only 70B parameters, which surprised Andriy Burkov, author of "The 100-Page Machine Learning Book", and even said:

Zephyr-7B defeats Llama 2-70B with a base model of Mistral-7B with a context window of 8k tokens, which theoretically has an attention range of up to 128K tokens.
What if the Zephyr were a 70B model? Will it outperform GPT-4? It looks likely.

If you're interested in Zephyr-7B, you can try it out on huggingface.

Paper Links:

Reference Links:
[1]
[2]
[3]
[4]
[5]

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Share
Comment
0/400
No comments
Trade Crypto Anywhere Anytime
qrCode
Scan to download Gate app
Community
English
  • 简体中文
  • English
  • Tiếng Việt
  • 繁體中文
  • Español
  • Русский
  • Français (Afrique)
  • Português (Portugal)
  • Bahasa Indonesia
  • 日本語
  • بالعربية
  • Українська
  • Português (Brasil)