🎉 Gate Square Growth Points Summer Lucky Draw Round 1️⃣ 2️⃣ Is Live!
🎁 Prize pool over $10,000! Win Huawei Mate Tri-fold Phone, F1 Red Bull Racing Car Model, exclusive Gate merch, popular tokens & more!
Try your luck now 👉 https://www.gate.com/activities/pointprize?now_period=12
How to earn Growth Points fast?
1️⃣ Go to [Square], tap the icon next to your avatar to enter [Community Center]
2️⃣ Complete daily tasks like posting, commenting, liking, and chatting to earn points
100% chance to win — prizes guaranteed! Come and draw now!
Event ends: August 9, 16:00 UTC
More details: https://www
One line of code improves large model performance by 10%, developer: free lunch
Original source: Qubits
There is a "free lunch" for large model fine-tuning, where a single line of code can improve performance by at least 10%.
There was even a doubling of performance on Llama 2 with 7B parameters, and Mistral also saw a quarter increase.
Although this method is used during the supervised fine-tuning phase, RLHF models can also benefit from it.
This is a new regularization technique that can be used to improve the performance of fine-tuned supervised (SFT) models.
Not only is NEFT easy to operate, but it doesn't add significant costs, and the authors call it a "free lunch."
Add noise to the model
The full name of NEFTune is Noisy Embedding Fine Tuning, which stands for "Embedded Tuning with Noise".
The developers believe that overfitting is a major factor limiting the performance of large models, so the method of adding noise to the embedding layer during the training phase is adopted to avoid the occurrence of overfitting, thereby improving performance.
The system then randomly generates a noise vector and adjusts the noise to a set intensity with a scaler.
The scaled noise is added to the embedding vector as input to the model, and training begins.
With each training iteration, new noise is generated and added to the embedding layer.
from torch.nn import functional as F
def NEFTune(model, noise_alpha=5)
def noised_embed(orig_embed, noise_alpha):
def new_func(x):
if model.training:
embed_init = orig_embed(x)
dims = torch.tensor(embed_init.size(1) * embed_init.size(2))
mag_norm = noise_alpha/torch.sqrt(dims)
return embed_init + torch.zeros_like(embed_init).uniform_(-mag_norm, mag_norm)
else:
return orig_embed(x)
return new_func
model.base_model.model.model.embed_tokens.forward = noised_embed(model.base_model.model.model.embed_tokens, noise_alpha)
return model
In this code, the parameter noise_alpha in the NEFTune function is the noise intensity (coefficient), and mag_norm is the noise range in the actual process.
NEFT only adds noise to the model during the training process, and there is no such process in the inference stage, and the if statement in the code plays this role.
In the training mode, the return value of the new_func function is the embedding layer after adding noise.
This code is posted to explain the need, if you just want to call NEFT, you can call it directly from the TRL library without using the complete code above.
The following code is an example of fine-tuning the OPT-350M model:
from datasets import load_dataset
from trl import SFTTrainer
dataset = load_dataset("imdb", split="train")
trainer = SFTTrainer(
"facebook/opt-350m",
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=512,
)
trainer.train()
In terms of datasets, developers have used a total of four different datasets such as Alpaca and ShareGPT to fine-tune them.
According to the authors, the reasons for choosing these data include that they are more famous, have been SOTA, and so on.
In addition, due to hardware performance considerations, single-round dialogue datasets were selected during the experiment.
So, how does the large model perform after tuning with the NEFT method?
Up to 1x better performance
The research team mainly tested the text quality and dialogue ability generated before and after model tuning.
Its Chinese quality is primarily based on the Aplaca dataset, assessed using ChatGPT and GPT-4.
The model used as a reference is Text-Davinci-003, and the proportion of the trained model that outperforms TD3 is the evaluation index.
In order to save resources, the research team first used ChatGPT to determine whether to evaluate or call GPT-4 themselves, and in some cases manually judged.
Results In different training datasets, Llama 2 has a performance improvement of at least 10% after adjustment, and it is directly doubled on the Alpaca dataset.
It is found that the chat capability of the NEFT adjusted model is also further improved compared with Evol-Instruct.
The results show that the NEFT method has no significant effect on other capabilities of the model on different datasets and models.
To confirm this, the authors evaluated the model loss and found that the test dataset loss was lower than the training data, confirming this view.
In order to confirm that the improvement in text quality was caused by the addition of noise and not by the increase in text length, the researchers also performed ablation experiments.
The results show that simply forcing the model to generate longer text cannot achieve the effect of NEFT.