One line of code improves large model performance by 10%, developer: free lunch

2023-10-22 09:23:51

Original source: Qubits

Image source: Generated by Unbounded AI

There is a "free lunch" for large model fine-tuning, where a single line of code can improve performance by at least 10%.

There was even a doubling of performance on Llama 2 with 7B parameters, and Mistral also saw a quarter increase.

Although this method is used during the supervised fine-tuning phase, RLHF models can also benefit from it.

来自马里兰州大学、纽约大学等机构的研究人员提出了名为NEFT(une) Fine-tuning method.

This is a new regularization technique that can be used to improve the performance of fine-tuned supervised (SFT) models.

This method has been included in the TRL library by HuggingFace and can be called by adding an additional line of code to import.

Not only is NEFT easy to operate, but it doesn't add significant costs, and the authors call it a "free lunch."

Some netizens tried to fine-tune the Mistral-7B based on Guanaco (a model of the alpaca family) in this way, and the performance improvement was obvious.

So, how does NEFTune "blood" a large number of large models with a single line of code?

Add noise to the model

The full name of NEFTune is Noisy Embedding Fine Tuning, which stands for "Embedded Tuning with Noise".

The developers believe that overfitting is a major factor limiting the performance of large models, so the method of adding noise to the embedding layer during the training phase is adopted to avoid the occurrence of overfitting, thereby improving performance.

Specifically, the text in the training database is first tokenized and converted into an embedding vector.

The system then randomly generates a noise vector and adjusts the noise to a set intensity with a scaler.

The scaled noise is added to the embedding vector as input to the model, and training begins.

With each training iteration, new noise is generated and added to the embedding layer.

from torch.nn import functional as F

def NEFTune(model, noise_alpha=5)
def noised_embed(orig_embed, noise_alpha):
def new_func(x):
if model.training:
embed_init = orig_embed(x)
dims = torch.tensor(embed_init.size(1) * embed_init.size(2))
mag_norm = noise_alpha/torch.sqrt(dims)
return embed_init + torch.zeros_like(embed_init).uniform_(-mag_norm, mag_norm)
else:
return orig_embed(x)
return new_func
model.base_model.model.model.embed_tokens.forward = noised_embed(model.base_model.model.model.embed_tokens, noise_alpha)
return model

In this code, the parameter noise_alpha in the NEFTune function is the noise intensity (coefficient), and mag_norm is the noise range in the actual process.

NEFT only adds noise to the model during the training process, and there is no such process in the inference stage, and the if statement in the code plays this role.

In the training mode, the return value of the new_func function is the embedding layer after adding noise.

This code is posted to explain the need, if you just want to call NEFT, you can call it directly from the TRL library without using the complete code above.

The following code is an example of fine-tuning the OPT-350M model:

from datasets import load_dataset
from trl import SFTTrainer

dataset = load_dataset("imdb", split="train")

trainer = SFTTrainer(
"facebook/opt-350m",
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=512,
)
trainer.train()

In terms of datasets, developers have used a total of four different datasets such as Alpaca and ShareGPT to fine-tune them.

According to the authors, the reasons for choosing these data include that they are more famous, have been SOTA, and so on.

In addition, due to hardware performance considerations, single-round dialogue datasets were selected during the experiment.

So, how does the large model perform after tuning with the NEFT method?

Up to 1x better performance

The research team mainly tested the text quality and dialogue ability generated before and after model tuning.

Its Chinese quality is primarily based on the Aplaca dataset, assessed using ChatGPT and GPT-4.

The model used as a reference is Text-Davinci-003, and the proportion of the trained model that outperforms TD3 is the evaluation index.

In order to save resources, the research team first used ChatGPT to determine whether to evaluate or call GPT-4 themselves, and in some cases manually judged.

Results In different training datasets, Llama 2 has a performance improvement of at least 10% after adjustment, and it is directly doubled on the Alpaca dataset.

Rolled out to OPT and Llama 1, the NEFT approach can also bring some performance improvements.

The task used to evaluate the model's chat ability is the OpenLLM Leadorboard.

It is found that the chat capability of the NEFT adjusted model is also further improved compared with Evol-Instruct.

The authors also assess whether improving text quality and chat capabilities without a significant increase in cost will lead to a decline in other capabilities.

The results show that the NEFT method has no significant effect on other capabilities of the model on different datasets and models.

During the experiment, the authors also found that the text generated by the model did not copy the training data, suggesting that the model had certain generalization capabilities.

To confirm this, the authors evaluated the model loss and found that the test dataset loss was lower than the training data, confirming this view.

In addition, the authors found that after NEFT adjustments, the text generated by the model not only increased in quality and length, but also increased the content without repetition.

In order to confirm that the improvement in text quality was caused by the addition of noise and not by the increase in text length, the researchers also performed ablation experiments.

The results show that simply forcing the model to generate longer text cannot achieve the effect of NEFT.

Paper Address:

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

4 Likes

Reward
4
Comment
Share

Comment

0/400

No comments

Topic
1/3
1CandyDrop Airdrop Event 6.0
93k Popularity
2Join Alpha RION Airdrop to Earn $40
65k Popularity
3White House Crypto Report
81k Popularity
4Fed Holds Rates Decision
11k Popularity
5Growth Points Draw Round 12 Opens
25k Popularity

sitemap