GPT-4 "alchemy" guide: MoE, parameter amount, training cost and inference secrets

巴比特_

2023-07-13 04:37:54

Original: Picking Up Elephants

Source: Overseas Unicorns

Authors: Dylan Patel, Gerald Wong

Edited by: Haina, Wenli, Cage

Editor: Siqi

Image source: Generated by Unbounded AI‌

This article is compiled from the column SemiAnalysis by Dylan Patel and Gerald Wong. Not long ago, Dylan Patel broke the news about Google's internal letter: We Have No Moat, And Neither Does OpenAI.

GPT-4 is the result of a deep combination of science and engineering innovation. There are countless tricks in the middle. For the outside world, if you can understand the structure of GPT-4, it is like obtaining the "recipe for alchemy" of the strongest model. This content gives the GPT-4 architecture, training and inference infrastructure, parameter quantity, training data set, token number, cost, and MoE model and other parameters and information details in great detail.

Dylan and Gerald believe that the reason why OpenAI does not disclose the architecture of GPT-4 is not because of the so-called AI Safety considerations, but because this architecture is easy to be copied; George Hotz, known as the "genius hacker", also expressed similar Opinion, however, George argues that GPT-4 consists of MoEs of 8 expert models, each with about 1100 parameters.

The two authors predict that companies such as Google, Meta, Anthropic, Inflection, Character.ai, Tencent, ByteDance, Baidu, etc. will have the same or even more powerful model capabilities as GPT-4 in the short term. Even though the architecture of GPT-4 is “easily copied,” in their view, OpenAI has the most durable moat—the largest number of end users, leading engineering talents, and a first-mover advantage in intergenerational changes in models.

Friendly reminder: The data in the article comes from the original author's multi-party collection and research, which has not been confirmed by OpenAI, but Dylan Patel's research is generally considered to be highly reliable, and can be used as a good reference for GPT-4 in-depth research materials. In addition, we think that the easy-to-reproduce views in the article may be suspected of being a "title party", because apart from OpenAI and Google, scientists who are good at complex MoE framework training and reasoning are currently scarce, and the current GPT-4 is only the first generation of MoE. It is not the final answer given by OpenAI, and a lot of experience in the process is not available to other teams, and these experiences will definitely become a unique advantage of OpenAI.

The following is the table of contents of this article, and it is recommended to read it in combination with the main points.

👇

01 Overview

02 Model Structure

03 Dataset

04 Parallel Strategy

05 Training cost

06 MoE

07 Reasoning

08 Infra and cost of reasoning

09 Multi-query attention mechanism

10 consecutive batches

11 Speculative decoding

12 Vision Multimodal

01.Overview

The engineering capabilities of OpenAI and what they've created are amazing, but that doesn't mean the solutions are insurmountable. Their solution is very elegant, and it also involves the consideration and balance of a series of complex factors, and the expansion of the model scale is only a part of it. **OpenAI's most durable moat comes from three aspects: first, they have the most real-world users, second, leading engineering talent, and finally, they are likely to continue to maintain a leading edge in future model development. **

Not only is it valuable to understand why GPT-4 chose a certain architecture, but more recently, we will also outline the training and inference costs of GPT-4 on A100, and how to use H100 in the next generation model architecture. expand.

From GPT-3 to GPT-4, OpenAI wants to increase the size of the model by 100 times. The core of this process is naturally the cost issue**. Dense transformers are commonly used model architectures such as OpenAI GPT-3, Google PaLM, Meta LLaMA, TII Falcon, MosaicML MPT, etc. At present, there are at least 50 companies using this architecture to train LLM, which is a good example architecture, but its scalability is very limited.

AI Brick Wall discussed the training cost of the model in this article, before GPT-4 was released. From the perspective of training cost, the dense model (dense transformers) is about to face its own "AI Brick Wall". Make some upper-level architectural efforts.

AI Brick Wall: The hardware at this stage has reached its limit in terms of Dense Transformer, so it is impractical and costly to continuously expand the scale of the model to a model with one trillion or ten trillion parameters. Before the new generation of hardware, various strategies and techniques are needed to reduce training costs, improve model training efficiency, and expand the model to a higher number of parameters. The author believes that this series of technologies will be realized around 2023, and companies capable of participating include OpenAI, Google, DeepMind, Microsoft, and Nvidia. Many of these strategies were presented at the NeurIPS conference and are likely to have a major impact on AI applications.

But over the past 6 months, we realized that training cost might be a non-issue. Although spending millions or even hundreds of millions of dollars to train models sounds crazy, it is actually trivial for the tech giants. A large model is a capital investment project (Capex line item), and the larger the model, the better the result. The only limiting factor is whether human beings have sufficient ability and time to provide feedback and modify the model architecture while expanding the model scale.

Meta invests more than $16 billion in the "Metaverse" every year, Google spends about $10 billion on new project attempts, Amazon spends more than $50 billion on Alexa, and cryptocurrencies are on "things of no value" Over $100 billion wasted. Society as a whole will spend over $100 billion to create supercomputers capable of training large-scale models that can be productized in various ways. Multiple countries and companies will repeat the training effort** on large models, which are the new "arms race in space"**. Compared with the previous "resource waste", the real value will be realized in the short term because of the emergence of human assistants and autonomous agents.

But in the next few years, Google, Meta and OpenAI, Microsoft and other companies will spend more than 100 billion US dollars to build a supercomputer to train the model.

The more important issue of expanding the size of the model, the real "AI Brick Wall", lies in the inference link. The goal here is to decouple training computing power from inference computing power, so for any model that will be deployed, it makes sense to train beyond DeepMind's Chinchilla-optimal. (Picking note: Increasing the amount of training data to make the model over-learn is a strategy to increase the ability of small models and reduce the cost of reasoning.) This is why a sparse model architecture (sparse model architecture) is used. The reasoning under this architecture does not Not all parameters need to be active.

Chinchilla optimal: From Deepmind's paper Training Compute-Optimal Large Language Models, it indicates what model size and data size should be used to obtain the lowest loss when there is a fixed total number of FLOPS.

At present, Chinchilla-optimal is the optimal strategy on the training side, and training with more tokens to surpass the effect of Chinchilla-optimal is the optimal strategy on the inference side. And because the reasoning cost accounts for the "big head", most companies will choose a strategy that exceeds Chinchilla-optimal.

The essence of the problem in the inference link is that the cost of deploying the model to users and Agents is too high. The cost of inference is several times higher than the cost of training, and solving this problem is the goal of OpenAI in terms of model architecture and infrastructure.

When it comes to inference with large models, especially dense models, model size can become a multivariate issue. On Device AI- Double Edged Sword This article has discussed the situation in the context of edge computing. To put it simply, terminal devices can never have the throughput and memory bandwidth required to implement large language models. Even if the bandwidth is sufficient, the efficiency of edge devices in utilizing hardware computing resources is very low. Data centers face similar issues.

Utilization of computing resources is very important for data centers and clouds. (Note: At present, the upper limit of GPU/TPU utilization in the industry is about 50%.) One of the important reasons why NVIDIA's software is widely praised is that in the process of continuously launching a new generation of GPU, NVIDIA is also constantly updating A generation of software that drives increased FLOPS utilization by enabling smarter data movement around chips, between chips, and between memories.

FLOPS: Floating Point Operations Per Second, is a unit used to measure the speed of computer operations. The higher the FLOPS, the better the computer can handle the problem. The computing power of the GPU mainly comes from the FLOPS it can provide. The higher the FLOPS provided by the GPU, the stronger its computing power.

At this stage, the use cases for LLM inference are mostly "live assistants", which means that it must achieve high enough throughput to be really useful to users. Taking humans as an analogy, the average reading speed of humans is about 250 words per minute, and some people can reach about 1,000 words per minute. Corresponding to the model, it means outputting at least 8.33 tokens per second, preferably 33.33 tokens per second token, it is possible to meet all human needs.

However, due to the limitation of memory bandwidth, even on the latest NVIDA H100 GPU server, the dense model (dense model) with trillion parameters cannot achieve this throughput mathematically. Every time a token is generated, it needs to be loaded from memory to the chip, and then this token is sent in again to generate the next token. In addition, the KV cache (KV Cache) for implementing the attention mechanism also requires additional bandwidth.

KV Cache (KV Cache): During the sampling process, the Transformer model will perform a self-attention operation (Self-Attention), for which it is necessary to extract a key value for each item in the current sequence (whether it is /context or a generated token) (Key-Value, KV) vector. These vectors are stored in a matrix, often called the KV cache or past cache. The function of the KV cache is to avoid recalculating the key-value vector every time the token is sampled. Using pre-computed K and V values can save a lot of computing time, although it will take up some storage space. KV cache plays a very important role in the Transformer model and can help greatly improve the efficiency and performance of the model.

This diagram assumes that failing to fuse each operation is inefficient, and that attention mechanisms require comparable memory bandwidth and hardware overhead as parameter reads. In reality, even with "optimized" libraries such as NVIDIA FasterTransformer, the overall overhead will be higher.

The figure above shows the memory bandwidth required to serve a single user LLM with sufficiently high throughput. From this picture it can be seen that:

• Even a bandwidth 8 times that of H100 cannot serve a dense model with a scale of 1 trillion parameters at a rate of 33.33 tokens per second;

• Furthermore, the FLOPS utilization of 8x H100 is still below 5% at 20 tokens per second, which results in extremely high inference cost.

In fact, for today's 8-way tensor parallelized H100 system, the inference constraint is about 300 billion feed-forward parameters.

However, OpenAI is achieving human reading speed with the A100 and models with >1 trillion parameters, widely available at a low price of $0.06 per 1000 tokens, and this is possible precisely because of its sparse architecture .

Next, we will discuss the GPT-4 model architecture, infra for training and reasoning, the number of parameters, the composition of the training data set, the number of tokens, the number of layers, parallel strategies, multi-modal visual encoders, etc. behind a series of different engineering designs. Considerations, implementation techniques, and how OpenAI addresses bottlenecks in large model inference.

02. Model structure

The scale of GPT-4 is more than 10 times that of GPT-3, we estimate that it has about 1.8 trillion parameters, and these parameters are distributed on 120 transformer layers. For comparison, the parameters of GPT-3 are about 1750 billion. (Note: GPT-3 has only 12 transformer layers, and the number of layers is 1/10 of GPT-4.)

To control costs, OpenAI chose to use the MoE model. OpenAI uses 16 MLP.2-type experts in the model, each with about 111 billion parameters. Two of these expert models are called in each forward pass.

• Mixture-of-Experts (MoE): MoE model is a deep learning architecture, which usually consists of multiple experts (Experts), each expert is responsible for processing different aspects of the input data, and has Own parameter set (there are also some parameters, such as embedding, which can be shared by all experts, ie shared parameters). In the reasoning process of the model, according to the different characteristics of the input data, the model will route the input to different experts. Each expert processes the corresponding assigned input according to its parameter set and completes the output. The final output is the integration of the output of each expert. .

• MLP: Multi-Layer Perceptron (Multi-Layer Perceptron). MLP is an artificial neural network that includes multiple hidden layers. There are usually multiple independent MLP experts in the MoE model.

There are many literatures discussing how to route (assign) each pending token to an expert model, but it is said that the set of algorithms used by OpenAI is quite simple, at least GPT-4 is like this.

In addition, about 55 billion shared parameters are used in the attention mechanism.

Each forward inference (generating a token) only utilizes about 280 billion parameters and 560 TFLOPs, compared to about 1.8 trillion parameters and 3700 TFLOPs required for each forward inference if the dense model is purely used.

03. Dataset

GPT-4 was trained on about 13 trillion tokens, which is reasonable considering that CommonCrawl RefinedWeb contains about 5 trillion high-quality tokens. For reference, Deepmind's Chinchilla and Google's PaLM models were trained with about 1.4 trillion tokens and about 0.78 trillion tokens, respectively, and PaLM2 is said to be trained on about 5 trillion tokens.

CommonCrawl Refinedweb： CommonCrawl is a non-profit project that aims to build and maintain an open and accessible Internet dataset that uses web crawler technology to regularly scan web pages on the Internet and organize the web pages and related metadata and archive. CommonCrawl RefinedWeb is a library of high-quality texts that CommonCrawl has screened from raw collected data after algorithmic and human review.

The data set used by OpenAI to train GPT-4 is not 13 trillion unique tokens. On the contrary, due to the lack of high-quality tokens, this dataset contains multiple epochs. There are 2 epochs for text-based data and 4 epochs for code-based data. (Note: This refers to some high-quality texts and codes that have been learned by the model many times.) This is far from achieving Chinchilla-optimal (the model needs to be trained on double the number of tokens), which also shows that the network is easy to Obtaining token is not enough. The high-quality text tokens that actually exist on the network should be 1000 times that available today, and the audio and video tokens are even more, but collecting these tokens cannot be achieved simply by web scraping. Unfortunately, we haven't found much information about OpenAI's RLHF to data.

An epoch refers to the process of using all samples in the entire training set (training set) to train the model once. Specifically, an epoch includes multiple training steps (training steps), each training step is to input a small batch of samples into the model for training, and update the parameters of the model to minimize the loss function (loss function).

If the epoch is too small, the model may not be able to make full use of the information in the training set, resulting in underfitting, that is, the model cannot fit the training data well, resulting in poor performance on the test set. Conversely, if an epoch is too large, the model may be overfitting, learning too much noise and local features in the training set, while ignoring the global features.

In the pre-training stage, the context length (seqlen) is 8k. The 32k context version of GPT-4 is implemented on top of 8k fine-tuning after pre-training.

The batch size was gradually increased on the cluster for several days, but in the end, OpenAI used a batch size as high as 60 million. Of course, since not every parameter sees all parameters, this is just a batch size of 7.5 million per expert.

Batch size refers to the number of training samples for each iteration (iteration) or forward pass (forward pass). During model training, the data is divided into batches for training, and Batch size indicates the number of samples in each batch. The advantage of batch training is that it can avoid memory constraints and save computing resources for repeated calculation of intermediate results.

The size of the Batch Size has a great impact on the training effect and speed of the model. The larger the Batch Size, the larger the calculation of updating parameters each time, but the training process will be more stable, because the samples in each Batch can average the noise and uncertainty. On the other hand, if the Batch Size is too small, the training process may become unstable and require more training steps to converge to the optimal solution. In addition, the size of the Batch Size will also be limited by hardware resources. Therefore, in practical applications, it is very important to choose an appropriate Batch Size.

04. Parallel strategy

Parallel processing on all A100 GPUs is very important.

OpenAI uses 8-way (8-way) scale tensor parallelism (Tensor Parallelism), the reason is 8-way (8-way) because this is the limit of NVLink. In addition, we also heard that OpenAI is using 15-way (15-way) pipeline parallelism strategy. Theoretically, 15-way is too many considering data communication and computing time, but it is also reasonable if they are limited by memory capacity.

There are several classic distributed parallel paradigms in large model training, namely Pipeline Parallelism, Data Parallelism and Tensor Parallelism. FastSpeed, Microsoft's open source distributed training framework, combines these three parallel paradigms.

If you just use pipeline parallelism and tensor parallelism, the parameters on each GPU need about 30GB under FP16, and once KV cache and KV overhead are taken into account, if most of the GPUs used by OpenAI are 40GB A100, this architecture from It is also reasonable in theory. OpenAI may use ZeRo stage 1, block-level FSDP, or hybrid shared data parallelism.

• KV overhead (KV overhead): refers to the burden caused by additional overhead in the KV storage system. These overheads may include metadata for storing and managing key-value pairs, index structures, data replication and synchronization, network communication, and more. An increase in KV overhead can lead to performance degradation, increased storage requirements, and increased system complexity.

• ZeRo Stage 1: ZeRO (Zero Redundancy Optimizer) means that each card stores a complete optimizer state. If each card only stores a part of the optimizer state, the optimizer states of all cards together form a complete state, that is, Pos (Partition Optimizer States), which is called ZeRO-stage1.

• Block-level FSDP: refers to the block-based Full Precision Dynamic Quantization (Full Precision Dynamic Quantization) technology. Higher model accuracy can be preserved during training and reasoning, making the cost of model inference lower.

The reason why the full model FSDP is not used may be the high communication cost. While OpenAI has a high-speed network between most nodes, probably not all of them, we think there are at least some clusters with much lower connection bandwidth than others.

It is unclear how OpenAI avoids huge bubbles with such high pipeline parallelism. Chances are they just borne the cost.

Bubble: The delay or wait time in each batch due to the high degree of pipeline parallelism. It means that in the process of highly parallel computing, due to the different calculation speeds of different parts, some parts may need to wait for other parts to complete the calculation, resulting in delay or idle time. In this case, "bubble" refers to these idle or waiting intervals. This sentence means that they may just accept that there is some idle time or delay in the calculation process.

05. Training cost

OpenAI used about 2.15e25 FLOPS in the training of GPT-4, on about 25,000 A100 GPUs for 90 to 100 days of training, where the maximum computing power utilization was about 32% to 36%. **

This extremely low utilization is partly due to the large number of failures that require restarting checkpoints, with the bubbles mentioned above taking up a lot of cost.

Another reason is that all-reduce across so many GPUs is very expensive. Especially if we suspect that the cluster is actually made up of many smaller clusters with relatively weak network connections, such as 800G/1.6T non-blocking connections between different parts of the cluster, but these Some can only connect at 200G/400G speed.

all-reduce is a communication operation in parallel computing, which is used to realize global reduction of data in distributed computing. In distributed deep learning, all-reduce is a common communication operation for sharing and aggregating gradient information among multiple computing nodes, so as to update model parameters during training.

If their cost on the Cloud is about $1 per hour per A100, that comes to ~$63 million for this training session alone**. This does not include all trials, failed attempts, and other costs for data collection, RLHF, staff, etc. When these factors are taken into account, the actual cost is much higher. In addition, you also need to consider that you need to have a team to complete the chip configuration, network equipment, and data center, bear the capital investment (Capex), and rent them to you.

Currently pre-training can be done in about 55 days with about 8,192 H100s at a total cost of $21.5 million, each H100 GPU costs $2/hour.

We expect nine companies to have more H100 GPUs by the end of the year. Maybe these H100s will not all be used for model training, but these companies will definitely embrace large models and become important players. Meta expects to have over 100,000 H100s by the end of the year, a significant portion of which will be deployed in their own data centers for inference, though their largest single cluster will have over 25,000 H100 GPUs. (Note: Meta's computing resources will make LLaMA's ability to evolve into an important variable for open source and private deployment.) Many companies will train a model with the same ability as GPT-4 before the end of this year.

06.MoE

MoE is an effective way to reduce the number of parameters during inference, while it also increases the number of parameters, which helps encode more information per training token. Because it is very difficult to obtain enough high-quality tokens, it is necessary to choose the MoE architecture. Because if OpenAI really wants to implement Chinchilla-Optimal, they must train twice the number of tokens now.

That being said, OpenAI makes several trade-offs. For example, dealing with MoE during inference is very difficult because not every part of the model is used when generating every token. This means that some parts may be dormant while other parts are being used. This can seriously impact utilization when servicing users.

The researchers proved that using 64 to 128 experts yielded better loss results than using 16 experts, but this is just research. There are several reasons for reducing the number of experts. One of the reasons OpenAI chose 16 experts is that having more experts makes it harder to generalize and achieve convergence. Given such a large training run, OpenAI chose to be more conservative in the number of experts.

Also, using fewer experts is helpful for inference architectures. There are various complex tradeoffs when moving to a MoE inference architecture. Let's start with the basic LLM inference tradeoffs, and then explore the problems OpenAI faced and the choices they made.

07. Reasoning

In this part, we first want to point out that every LLM company we contacted thinks that NVIDIA's FasterTransformer inference library is pretty bad, and TensorRT is even worse. Without the ability to use Nvidia's templates and modify them, it means creating your own solution from scratch. NVIDIA needs to solve this problem as soon as possible to adapt to the needs of LLM inference, otherwise it will become an open tool in fact. Easier to add third-party hardware support. More and more large models are coming, and if NVIDA can't provide a software advantage in inference, and kernels still need to be hand-written, then AMD's MI300 and other hardware will have a much bigger market.

There are 3 key factors in the inference link of LLM, which are mainly related to the number of chips used.

1. Latency

The model must respond within a reasonable delay. People don't want to wait a few seconds before starting to receive output in a chat application. Input and output token processing times can fluctuate.

2. Throughput

The model must output a certain number of tokens per second. Human usage is about 30 tokens per second, and throughput can be lower or higher for various other use cases.

3. Utilization

The hardware running the model must achieve high utilization or the cost will be prohibitive. While it is possible to achieve higher utilization by grouping more user requests with higher latency and lower throughput, this increases the difficulty.

LLM inference is mainly to balance two main factors, memory bandwidth and computation.

In simple terms, each parameter must be read with two FLOPs associated with it. Therefore, the ratio of most chips (for example, H100 SXM has only 3TB/s memory bandwidth, but has 2,000 TFLOP/s FP8) is completely unbalanced in inference with batch-size 1. If only one user is served, i.e. with a batch size of 1, the memory bandwidth required to stream each parameter for each token generation dominates the inference time, and the computation time is almost negligible.

In order to be able to scale large models to multiple users, the batch size must be greater than 1, and multiple users share the parameter reading cost. For example, with a batch size of 256 or 512, each byte of memory read in corresponds to 512 FLOP/s or 1024 FLOP/s. This ratio is closer to the H100's ratio of memory bandwidth to FLOPS. Helps achieve higher utilization, but has the disadvantage of higher latency.

Many people think that memory capacity is the main bottleneck for LLM inference, since the size of the model may fit on multiple chips, but this view may be problematic. Although inference of large models requires multiple chips, and higher memory capacity results in fewer adapted chips, it is actually better to use more chips than needed in order to reduce latency, increase throughput, and Larger batch sizes can be used to continuously increase utilization.

Google also mentioned the treatment of the above three problems in the PaLM inference paper. It is worth noting that **this is for a dense model like PaLM, not a sparse model like GPT4. **

If an application requires the lowest possible latency, we need more chips and split the model in as many ways as possible to be economical. Smaller batch sizes allow for lower latency, but smaller batch sizes also result in poorer MFU [utilization], resulting in a higher total cost per token (in chip seconds or dollars) .

If an application requires offline inference, and latency is not an issue, then the main goal is to maximize the throughput per chip (i.e. minimize the total cost per token). Increasing the batch size is most efficient, as larger batch sizes generally lead to better MFU [utilization], but certain partitioning strategies that are not effective for small batch sizes grow as the batch size grows and become effective.

**More chips and larger batch sizes are cheaper because they increase utilization, but this also introduces a third variable, Networking Time. ** The method of deploying the model on multiple chips can effectively solve the delay, but at the expense of utilization.

Both the weight loading part of storage time and non-attentional computation time are proportional to model size and inversely proportional to chip count. For a given partition layout, the time required for chip-to-chip communication decreases less rapidly (or not at all) with the number of chips used, so it becomes a more and more difficult problem as the number of chips increases. increasingly important bottleneck.

We noticed that the memory requirements of the KV cache exploded as the batch size and size grew.

If an application needs to generate text with long attention contexts (long attention contexts), it will greatly increase the inference time. For a model with more than 500B of multi-head attention, the KV cache of attention can become very large: for a model with a batch size of 512 and a context length of 2048, the total amount of KV cache is 3TB, which is 3 times the model parameter size . The on-chip memory (on-chip memory) needs to load the KV cache from the off-chip memory (off-chip memory), which is loaded every time a token is generated. During this period, the computing core of the chip is basically idle.

Long sequence lengths are particularly troublesome for memory bandwidth and memory capacity. The reason why OpenAI's GPT-3.5 turbo with 16k contexts and GPT-4 with 32k contexts are expensive is that they cannot take larger batches due to memory constraints.

Smaller batches result in lower hardware utilization. Also, the KV cache bloats as the sequence length increases. The KV cache cannot be shared between users, so separate memory reads are required, further reducing memory bandwidth. See below for more information on MQA.

08. Infra and cost of reasoning

Infra

The architecture of MoE makes the inference of GPT-4 face challenges in terms of latency, throughput and utilization. Because the forward pass of each token can be routed to different expert models, it is very difficult to achieve low latency, high throughput and high utilization in this case, especially at high batch-size.

OpenAI's GPT-4 architecture contains 16 expert models, and each forward channel has 2 routers. This means that with a batch-size of 8, each expert's parameter read may only take up "1" of the batch size. More seriously, this also results in a batch-size of 8 for one expert, while batch-size of other experts may only be 4, 1, or 0.

In addition, the routing algorithm routes the forward pass in different directions each time a token is generated, which results in significant variations in token-to-token latency and expert batch size. That is, when processing different tokens, different experts may be assigned to different tasks, and both the computational load and the batch size may vary accordingly.

Inference infra is one of the main considerations for OpenAI to choose a small number of experts in the design of MoE. If they use more experts, memory bandwidth becomes a bigger bottleneck for inference. OpenAI often achieves batch-sizes above 4k on its own inference clusters, which means that even with optimal load balancing among experts, each expert can only reach a batch size of about 500. This requires very large usage to achieve.

Our understanding is that OpenAI runs inference on a cluster of 128 GPUs and has multiple such clusters in different data centers and geographic regions. Inference is performed in parallel with 8-way tensors and 16-way pipelines. Using 8 GPUs per node, each GPU has only about 130B of parameters, or less than 30GB per GPU under FP16, and less than 15GB under FP8/int8. This allows running inference on a 40GB A100 as long as the KV cache size for all batches doesn't bloat too much.

FP16, FP8, and int8 are different numerical precision (precision) representations, which are often used in the calculation process in deep learning to reduce the use of memory and computing resources, thereby improving the efficiency of model training and reasoning.

FP16, FP8, and int8 respectively refer to 16-bit floating-point numbers, 8-bit floating-point numbers, and 8-bit integers. Their precision is lower than that of 32-bit single-precision floating-point numbers (FP32), but they can greatly reduce memory and computing resources. Use to accelerate model training and inference in deep learning. For example, using FP16 can more than halve the computation time without losing too much precision, while using int8 can reduce the computation time by a factor of about 4 without losing too much precision.

It should be noted that the use of low-precision calculations may have a certain impact on the accuracy of the model, so a trade-off between accuracy and efficiency is required, and the most appropriate accuracy representation method should be selected according to specific task requirements.

To avoid the network communication being too irregular and at the same time avoiding the prohibitive cost of recomputing the KV cache between each token generation, the various layers containing various experts are not split on different nodes in order to share the KV cache.

**Biggest difficulty for all future MoE model extensions and conditional routing. It is how to deal with the limit of 120 routing layers around the KV cache. **

In the MoE model, the number of routing layers per branch cannot exceed 120 layers, otherwise the KV cache cannot be handled effectively. This is because during the inference process of the model, each branch needs to calculate the KV cache, which leads to an increase in computational cost.

A simple solution to this problem is to place a spanning route in 15 different nodes based on the layer limit of 120. In this way, the computational load can be evenly distributed on different nodes, thus improving the efficiency and performance of the model. However, since the first node needs to do data loading and embedding, it is important how to place fewer layers on the head node of the inference cluster.

In addition, in the process of encoding and decoding the input data, there may be some noise about inferential decoding, which we will discuss further later. A more critical issue is determining whether such noise should be believed. This can also explain why it makes sense to include fewer layers on the head node.

reasoning cost

Compared to the Davinchi model with 175B parameters, GPT-4 has 1.6 times the feed-forward parameters, but the cost is 3 times that of Davinchi. This is mainly because GPT-4 requires a larger cluster and achieves lower utilization.

We guess that using 128 A100s for inference with GPT-4 8k context length (seqlen) costs about $0.0049 per 1k tokens. While using 128 H100s for inference on GPT-4 8k context, the cost per 1k tokens is about $0.0021. (Note: The current price of GPT-4-8k is 0.03/1k input tokens, 0.06/1k output tokens. Currently, OpenAI’s use of inference chips will not be as extravagant as the author speculates. This calculation can be used as a lower bound for future price reductions .) It is important to note that **these costs are calculated at high utilization and batch-size. **

It is also possible that our assumption is wrong, given that the utilization of OpenAI clusters can sometimes be very low.

We hypothesize that OpenAI shuts down the cluster during downturns and repurposes those nodes for other tasks, such as resuming checkpoint training of small test models, or experimenting with various new techniques. Doing so helps keep inference costs low, otherwise OpenAI's utilization could be even lower, implying more than 2x the cost estimate.

Resume checkpoint training of a small test model, typically when training a deep learning model, restart training a smaller model (e.g., a subset using only a subset of ) in order to quickly test new model structures or algorithms in a short period of time. This approach can help researchers iterate quickly on model design and find optimal model structures and parameters.

09. Multi-query attention mechanism

The use of Multi-Query Attention is quite common, but we want to emphasize that OpenAI does the same. In general, only 1 attention head is needed, and the memory capacity can be significantly reduced for KV caching. Even so, GPT-4 with 32k contexts certainly cannot run on the 40GB A100, and the maximum batch size of 8k is already capped. If there is no MQA, the maximum batch size of 8k will be greatly limited, and the economic benefits will be greatly reduced.

• Multi-Query Attention (MQA): Fast Transformer Decoding: One Write-Head is All You Need This paper proposed the concept of MQA in 2019, and later became a frequently used in natural language processing attention mechanism.

In the traditional attention mechanism, a query (query) is matched with a set of key-value pairs to obtain a weighted representation for each key. Whereas in multi-query attention, there are multiple queries, and each query is matched against key-value pairs to obtain a different weighted representation for each key. This process can be seen as encoding the input under multiple different "views", resulting in a more comprehensive and accurate representation.

• Attention Head (Head): In a deep learning model, it usually contains multiple layers (layers) and a head (head), which is used to map the output of the model to the desired output space. The head layer is usually added to the model to meet specific tasks. For example, in natural language processing tasks, the head is usually used to convert the output of the model into text for text classification and other tasks. In the deep learning model, the head is usually followed by the last layer, which is used to convert the output of the last layer into the desired output form.

10. Continuous batch processing

To allow some degree of maximum latency and optimize inference cost, OpenAI uses both variable batch size and continuous batching techniques. This approach can improve the utilization of computing resources without sacrificing model performance, and achieve lower latency and higher throughput during the model's inference process. If you don’t understand the concept of continuous batch processing, AnyScale’s official article How continuous batching enables 23x throughput in LLM inference while reducing p50 latency is worth reading. (Pickup Note: The distributed computing framework Ray developed by Anyscale is used by OpenAI in the infra pipeline of the model. Pickup has published research on this company before.)

Continuous batching (Continuous batching): A technique used during deep learning training to improve training efficiency and resource utilization through hardware. The traditional batch processing method is to load a certain amount of training data into memory at one time, and then train these data. This method can improve training efficiency, but it may also waste memory space.

The continuous batch processing is to divide the training data into several small batches, and only load one small batch for training each time, and then load the next small batch after the training is completed, and so on, until the completion of the entire training data set training process. Using continuous batching techniques can improve training efficiency while reducing memory usage, and can also improve model stability and generalization.

Source: Anyscale

11. Speculative decoding

There are rumors that OpenAI uses Speculative Decoding technology in the inference task of the GPT-4 model. While we cannot be sure of the accuracy of this message, the general variation in latency and variance from one token to another for both simple retrieval tasks and more complex tasks seems to suggest that this technique is possible. However, because there are too many variables, we cannot confirm whether this technique is actually used.

In order to avoid content disputes, some content in Accelerating LLM Infeferencewith Staged Speculative Decoding is quoted here, and the key content is bolded.

Using LLMs is generally divided into two phases:

1. Prefill stage

In this phase, a hint() is first given as input and run through the model to generate the KV cache and the first output logits. Among them, logits is the probability distribution vector output by LLM at each time step, which is used to represent the possibility of each token. This prepopulation phase is usually fast because of parallel computation.

2. Decoding stage

In this stage, a token is selected from the output logits and fed back to the model to generate logits for the next token. This is repeated until the desired number of tokens is generated. Since each decoding must be computed sequentially to produce a token, the arithmetic intensity of this second stage (i.e. computed FLOPs/bytes of memory bandwidth) is very low when running in small batches. leading to underutilization of computing power.) Therefore, decoding is usually the most expensive part of autoregressive generation.

This is why it is much cheaper to input tokens than output tokens in OpenAI's API calls.

The core idea of speculative decoding is to use a smaller, faster draft model to decode several tokens ahead of time and feed them into the oracle model as a batch. If the draft model's predictions are correct (i.e. agree with the oracle model's predictions), one batch can be used to decode several tokens, saving a lot of memory bandwidth and time per token.

The Oracle model refers to a larger, slower LLM model used in the speculative decoding method to validate the predictions of the draft model. The Oracle model will calculate the probability distribution of the next token based on the prediction results of the draft model and the previously generated tokens, and then return this probability distribution to the draft model as an output.

By using the Oracle model to verify the prediction results of the draft model, errors and deviations in the subsequent decoding process of the draft model can be avoided, thereby improving the accuracy and stability of the model. At the same time, the Oracle model can also help the draft model to better learn and understand the context information in the language model, thereby improving the generation ability and effect of the model.

However, if the larger model rejects a token predicted by the draft model, the rest of the batch is discarded and the algorithm reverts to standard token-by-token decoding. Speculative decoding can also be combined with a rejection sampling scheme to sample tokens from the original distribution. Note that this approach only works in small batch settings where bandwidth is the bottleneck.

In short, speculative decoding trades computation for bandwidth, and there are two key reasons why it is an attractive performance optimization target. First, speculative decoding does not degrade the model quality at all, because it only improves the inference speed and throughput of the model by modifying the calculation process of the decoding stage. Second, the benefits it provides are generally independent from other methods, because its advantage lies in converting sequential calculations into parallel execution, while other methods mainly start with model structure, parameters, training, etc. for optimization.

Current inference methods predict a single sequence for each batch. However** this method does not scale well in the case of large batches or low precision draft models. **Intuitively, for long continuous token sequences, the probability of the two models predicting agreement decreases exponentially, which means that as the strength of the algorithm expands, the return of speculative decoding will decrease rapidly.

We think that if OpenAI is using speculative decoding, they are likely only using it for short sequences of about 4 tokens in length. In addition, some people think that the decline in the performance of the GPT-4 model is because OpenAI added low-probability sequences from the speculative decoding model to the model pre-training, which may not be true.

Also - Some people think that the Bard model also uses speculative decoding because Google waits for the full sequence to be generated before sending it to the user, but we don't believe this guess to be true.

12. Visual Multimodal

Vision Multi-Modal is probably the least compelling part of GPT-4, at least compared to other research. So far, no one has explored the commercialization of multimodal LLM research.

Vision Multi-Modal: It refers to the joint processing and analysis of information from different modalities (such as image, text, voice, etc.). Usually, the information of these different modalities is semantically related, so combining them can provide richer information and more accurate inference results.

The visual multimodal capability of GPT-4 is achieved through a visual encoder independent of the text encoder, and has a cross-attention mechanism (Cross-Attention) with the text encoder. It is said that its architecture is similar to the Flamingo model. The vision encoder was fine-tuned on the 1.8 trillion parameter GPT-4 model, however, it was only pre-trained with an additional ~2 trillion tokens of text data, not vision data.

Cross-Attention: It is a mechanism for establishing associations between multiple sequence data, which has been widely used in natural language processing and computer vision. In sequence-to-sequence tasks, such as machine translation and text summarization, a cross-attention mechanism is used to compute the correlation between source and target sequences so that the information in the source sequence is used when generating the target sequence.

In computer vision tasks, cross-attention mechanisms are used to link images and text for use in tasks such as image description generation and visual question answering.

OpenAI plans to train the vision model from scratch, but the technology is not yet mature, so they hope to reduce the risk by training from text.

**Rumor has it that OpenAI's GPT-5 will train vision models from scratch and have the ability to automatically generate image and audio processing. **

A major goal of visual multimodal technology is to enable autonomous agents to read web pages and transcribe their image and video content. The data used by OpenAI to train this model includes joint data (including rendered LaTeX/text), web page screenshots and Youtube video sample frames, etc., and uses Whisper technology to transcribe.

One interesting thing about the LLM over-optimization issue is that the IO cost of the visual model is different from the IO cost of the plain text model. The IO cost of the text model is very cheap, but in the vision model, the IO cost of data loading is about 150 times that of the text model. The size of each token is 600 bytes, while the text model has only 4 bytes. Currently, there is a lot of work going on in image compression research. (Xianxiang Note: Text information is easier to compress, and image/video tokenization is a direction worthy of attention in the multimodal field.)

IO cost: IO cost refers to the time, resources, and energy costs required to complete an input/output operation in a computer system. These costs include aspects such as data transfer, storage and processing. In the field of machine learning and deep learning, IO cost usually refers to the cost of reading and writing data from storage media (such as hard disk, memory, network, etc.). During model training and inference, IO cost may become a bottleneck, affecting system performance and efficiency. Therefore, in order to improve the performance and efficiency of computer systems, IO cost needs to be considered and optimized.

This is very important for vendors optimizing their hardware after 2-3 years to account for the strong visual and audio capabilities of each model. They may find that their architecture is a poor fit. All in all, future LLM architectures will certainly evolve beyond the reduced text-based dense and/or MoE models we see today.

Reference

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

Reward
like
Comment
Share

Comment

0/400

No comments

Topic
GT 2025 Q2 Burn Completed
14k Popularity
Michael Saylor Hints at Buying BTC
11k Popularity
Gate Alpha Trading Share
10k Popularity
4Dr.Han Joins Gate Square
50k Popularity
5BTC Back Above 110K
5k Popularity
6Trump–Musk Rift
33k Popularity
7Solana Staking ETF
22k Popularity
8Trump’s Tax Reform
33k Popularity
9Gate Square Creator Spark Program
150k Popularity

sitemap