🎉 #Gate xStocks Trading Share# Posting Event Is Ongoing!
📝 Share your trading experience on Gate Square to unlock $1,000 rewards!
🎁 5 top Square creators * $100 Futures Voucher
🎉 Share your post on X – Top 10 posts by views * extra $50
How to Participate:
1️⃣ Follow Gate_Square
2️⃣ Make an original post (at least 20 words) with #Gate xStocks Trading Share#
3️⃣ If you share on Twitter, submit post link here: https://www.gate.com/questionnaire/6854
Note: You may submit the form multiple times. More posts, higher chances to win!
📅 End at: July 9, 16:00 UTC
Show off your trading on Gate Squ
GPT-4 "alchemy" guide: MoE, parameter amount, training cost and inference secrets
Original: Picking Up Elephants
Source: Overseas Unicorns
Authors: Dylan Patel, Gerald Wong
Edited by: Haina, Wenli, Cage
Editor: Siqi
This article is compiled from the column SemiAnalysis by Dylan Patel and Gerald Wong. Not long ago, Dylan Patel broke the news about Google's internal letter: We Have No Moat, And Neither Does OpenAI.
GPT-4 is the result of a deep combination of science and engineering innovation. There are countless tricks in the middle. For the outside world, if you can understand the structure of GPT-4, it is like obtaining the "recipe for alchemy" of the strongest model. This content gives the GPT-4 architecture, training and inference infrastructure, parameter quantity, training data set, token number, cost, and MoE model and other parameters and information details in great detail.
Dylan and Gerald believe that the reason why OpenAI does not disclose the architecture of GPT-4 is not because of the so-called AI Safety considerations, but because this architecture is easy to be copied; George Hotz, known as the "genius hacker", also expressed similar Opinion, however, George argues that GPT-4 consists of MoEs of 8 expert models, each with about 1100 parameters.
The two authors predict that companies such as Google, Meta, Anthropic, Inflection, Character.ai, Tencent, ByteDance, Baidu, etc. will have the same or even more powerful model capabilities as GPT-4 in the short term. Even though the architecture of GPT-4 is “easily copied,” in their view, OpenAI has the most durable moat—the largest number of end users, leading engineering talents, and a first-mover advantage in intergenerational changes in models.
Friendly reminder: The data in the article comes from the original author's multi-party collection and research, which has not been confirmed by OpenAI, but Dylan Patel's research is generally considered to be highly reliable, and can be used as a good reference for GPT-4 in-depth research materials. In addition, we think that the easy-to-reproduce views in the article may be suspected of being a "title party", because apart from OpenAI and Google, scientists who are good at complex MoE framework training and reasoning are currently scarce, and the current GPT-4 is only the first generation of MoE. It is not the final answer given by OpenAI, and a lot of experience in the process is not available to other teams, and these experiences will definitely become a unique advantage of OpenAI.
The following is the table of contents of this article, and it is recommended to read it in combination with the main points.
👇
01 Overview
02 Model Structure
03 Dataset
04 Parallel Strategy
05 Training cost
06 MoE
07 Reasoning
08 Infra and cost of reasoning
09 Multi-query attention mechanism
10 consecutive batches
11 Speculative decoding
12 Vision Multimodal
01.Overview
The engineering capabilities of OpenAI and what they've created are amazing, but that doesn't mean the solutions are insurmountable. Their solution is very elegant, and it also involves the consideration and balance of a series of complex factors, and the expansion of the model scale is only a part of it. **OpenAI's most durable moat comes from three aspects: first, they have the most real-world users, second, leading engineering talent, and finally, they are likely to continue to maintain a leading edge in future model development. **
Not only is it valuable to understand why GPT-4 chose a certain architecture, but more recently, we will also outline the training and inference costs of GPT-4 on A100, and how to use H100 in the next generation model architecture. expand.
From GPT-3 to GPT-4, OpenAI wants to increase the size of the model by 100 times. The core of this process is naturally the cost issue**. Dense transformers are commonly used model architectures such as OpenAI GPT-3, Google PaLM, Meta LLaMA, TII Falcon, MosaicML MPT, etc. At present, there are at least 50 companies using this architecture to train LLM, which is a good example architecture, but its scalability is very limited.
AI Brick Wall discussed the training cost of the model in this article, before GPT-4 was released. From the perspective of training cost, the dense model (dense transformers) is about to face its own "AI Brick Wall". Make some upper-level architectural efforts.
But over the past 6 months, we realized that training cost might be a non-issue. Although spending millions or even hundreds of millions of dollars to train models sounds crazy, it is actually trivial for the tech giants. A large model is a capital investment project (Capex line item), and the larger the model, the better the result. The only limiting factor is whether human beings have sufficient ability and time to provide feedback and modify the model architecture while expanding the model scale.
Meta invests more than $16 billion in the "Metaverse" every year, Google spends about $10 billion on new project attempts, Amazon spends more than $50 billion on Alexa, and cryptocurrencies are on "things of no value" Over $100 billion wasted. Society as a whole will spend over $100 billion to create supercomputers capable of training large-scale models that can be productized in various ways. Multiple countries and companies will repeat the training effort** on large models, which are the new "arms race in space"**. Compared with the previous "resource waste", the real value will be realized in the short term because of the emergence of human assistants and autonomous agents.
But in the next few years, Google, Meta and OpenAI, Microsoft and other companies will spend more than 100 billion US dollars to build a supercomputer to train the model.
The more important issue of expanding the size of the model, the real "AI Brick Wall", lies in the inference link. The goal here is to decouple training computing power from inference computing power, so for any model that will be deployed, it makes sense to train beyond DeepMind's Chinchilla-optimal. (Picking note: Increasing the amount of training data to make the model over-learn is a strategy to increase the ability of small models and reduce the cost of reasoning.) This is why a sparse model architecture (sparse model architecture) is used. The reasoning under this architecture does not Not all parameters need to be active.
The essence of the problem in the inference link is that the cost of deploying the model to users and Agents is too high. The cost of inference is several times higher than the cost of training, and solving this problem is the goal of OpenAI in terms of model architecture and infrastructure.
When it comes to inference with large models, especially dense models, model size can become a multivariate issue. On Device AI- Double Edged Sword This article has discussed the situation in the context of edge computing. To put it simply, terminal devices can never have the throughput and memory bandwidth required to implement large language models. Even if the bandwidth is sufficient, the efficiency of edge devices in utilizing hardware computing resources is very low. Data centers face similar issues.
Utilization of computing resources is very important for data centers and clouds. (Note: At present, the upper limit of GPU/TPU utilization in the industry is about 50%.) One of the important reasons why NVIDIA's software is widely praised is that in the process of continuously launching a new generation of GPU, NVIDIA is also constantly updating A generation of software that drives increased FLOPS utilization by enabling smarter data movement around chips, between chips, and between memories.
At this stage, the use cases for LLM inference are mostly "live assistants", which means that it must achieve high enough throughput to be really useful to users. Taking humans as an analogy, the average reading speed of humans is about 250 words per minute, and some people can reach about 1,000 words per minute. Corresponding to the model, it means outputting at least 8.33 tokens per second, preferably 33.33 tokens per second token, it is possible to meet all human needs.
However, due to the limitation of memory bandwidth, even on the latest NVIDA H100 GPU server, the dense model (dense model) with trillion parameters cannot achieve this throughput mathematically. Every time a token is generated, it needs to be loaded from memory to the chip, and then this token is sent in again to generate the next token. In addition, the KV cache (KV Cache) for implementing the attention mechanism also requires additional bandwidth.
The figure above shows the memory bandwidth required to serve a single user LLM with sufficiently high throughput. From this picture it can be seen that:
• Even a bandwidth 8 times that of H100 cannot serve a dense model with a scale of 1 trillion parameters at a rate of 33.33 tokens per second;
• Furthermore, the FLOPS utilization of 8x H100 is still below 5% at 20 tokens per second, which results in extremely high inference cost.
In fact, for today's 8-way tensor parallelized H100 system, the inference constraint is about 300 billion feed-forward parameters.
However, OpenAI is achieving human reading speed with the A100 and models with >1 trillion parameters, widely available at a low price of $0.06 per 1000 tokens, and this is possible precisely because of its sparse architecture .
Next, we will discuss the GPT-4 model architecture, infra for training and reasoning, the number of parameters, the composition of the training data set, the number of tokens, the number of layers, parallel strategies, multi-modal visual encoders, etc. behind a series of different engineering designs. Considerations, implementation techniques, and how OpenAI addresses bottlenecks in large model inference.
02. Model structure
The scale of GPT-4 is more than 10 times that of GPT-3, we estimate that it has about 1.8 trillion parameters, and these parameters are distributed on 120 transformer layers. For comparison, the parameters of GPT-3 are about 1750 billion. (Note: GPT-3 has only 12 transformer layers, and the number of layers is 1/10 of GPT-4.)
To control costs, OpenAI chose to use the MoE model. OpenAI uses 16 MLP.2-type experts in the model, each with about 111 billion parameters. Two of these expert models are called in each forward pass.
In addition, about 55 billion shared parameters are used in the attention mechanism.
Each forward inference (generating a token) only utilizes about 280 billion parameters and 560 TFLOPs, compared to about 1.8 trillion parameters and 3700 TFLOPs required for each forward inference if the dense model is purely used.
03. Dataset
GPT-4 was trained on about 13 trillion tokens, which is reasonable considering that CommonCrawl RefinedWeb contains about 5 trillion high-quality tokens. For reference, Deepmind's Chinchilla and Google's PaLM models were trained with about 1.4 trillion tokens and about 0.78 trillion tokens, respectively, and PaLM2 is said to be trained on about 5 trillion tokens.
The data set used by OpenAI to train GPT-4 is not 13 trillion unique tokens. On the contrary, due to the lack of high-quality tokens, this dataset contains multiple epochs. There are 2 epochs for text-based data and 4 epochs for code-based data. (Note: This refers to some high-quality texts and codes that have been learned by the model many times.) This is far from achieving Chinchilla-optimal (the model needs to be trained on double the number of tokens), which also shows that the network is easy to Obtaining token is not enough. The high-quality text tokens that actually exist on the network should be 1000 times that available today, and the audio and video tokens are even more, but collecting these tokens cannot be achieved simply by web scraping. Unfortunately, we haven't found much information about OpenAI's RLHF to data.
In the pre-training stage, the context length (seqlen) is 8k. The 32k context version of GPT-4 is implemented on top of 8k fine-tuning after pre-training.
The batch size was gradually increased on the cluster for several days, but in the end, OpenAI used a batch size as high as 60 million. Of course, since not every parameter sees all parameters, this is just a batch size of 7.5 million per expert.
04. Parallel strategy
Parallel processing on all A100 GPUs is very important.
OpenAI uses 8-way (8-way) scale tensor parallelism (Tensor Parallelism), the reason is 8-way (8-way) because this is the limit of NVLink. In addition, we also heard that OpenAI is using 15-way (15-way) pipeline parallelism strategy. Theoretically, 15-way is too many considering data communication and computing time, but it is also reasonable if they are limited by memory capacity.
If you just use pipeline parallelism and tensor parallelism, the parameters on each GPU need about 30GB under FP16, and once KV cache and KV overhead are taken into account, if most of the GPUs used by OpenAI are 40GB A100, this architecture from It is also reasonable in theory. OpenAI may use ZeRo stage 1, block-level FSDP, or hybrid shared data parallelism.
The reason why the full model FSDP is not used may be the high communication cost. While OpenAI has a high-speed network between most nodes, probably not all of them, we think there are at least some clusters with much lower connection bandwidth than others.
It is unclear how OpenAI avoids huge bubbles with such high pipeline parallelism. Chances are they just borne the cost.
05. Training cost
OpenAI used about 2.15e25 FLOPS in the training of GPT-4, on about 25,000 A100 GPUs for 90 to 100 days of training, where the maximum computing power utilization was about 32% to 36%. **
This extremely low utilization is partly due to the large number of failures that require restarting checkpoints, with the bubbles mentioned above taking up a lot of cost.
Another reason is that all-reduce across so many GPUs is very expensive. Especially if we suspect that the cluster is actually made up of many smaller clusters with relatively weak network connections, such as 800G/1.6T non-blocking connections between different parts of the cluster, but these Some can only connect at 200G/400G speed.
If their cost on the Cloud is about $1 per hour per A100, that comes to ~$63 million for this training session alone**. This does not include all trials, failed attempts, and other costs for data collection, RLHF, staff, etc. When these factors are taken into account, the actual cost is much higher. In addition, you also need to consider that you need to have a team to complete the chip configuration, network equipment, and data center, bear the capital investment (Capex), and rent them to you.
Currently pre-training can be done in about 55 days with about 8,192 H100s at a total cost of $21.5 million, each H100 GPU costs $2/hour.
We expect nine companies to have more H100 GPUs by the end of the year. Maybe these H100s will not all be used for model training, but these companies will definitely embrace large models and become important players. Meta expects to have over 100,000 H100s by the end of the year, a significant portion of which will be deployed in their own data centers for inference, though their largest single cluster will have over 25,000 H100 GPUs. (Note: Meta's computing resources will make LLaMA's ability to evolve into an important variable for open source and private deployment.) Many companies will train a model with the same ability as GPT-4 before the end of this year.
06.MoE
MoE is an effective way to reduce the number of parameters during inference, while it also increases the number of parameters, which helps encode more information per training token. Because it is very difficult to obtain enough high-quality tokens, it is necessary to choose the MoE architecture. Because if OpenAI really wants to implement Chinchilla-Optimal, they must train twice the number of tokens now.
That being said, OpenAI makes several trade-offs. For example, dealing with MoE during inference is very difficult because not every part of the model is used when generating every token. This means that some parts may be dormant while other parts are being used. This can seriously impact utilization when servicing users.
The researchers proved that using 64 to 128 experts yielded better loss results than using 16 experts, but this is just research. There are several reasons for reducing the number of experts. One of the reasons OpenAI chose 16 experts is that having more experts makes it harder to generalize and achieve convergence. Given such a large training run, OpenAI chose to be more conservative in the number of experts.
Also, using fewer experts is helpful for inference architectures. There are various complex tradeoffs when moving to a MoE inference architecture. Let's start with the basic LLM inference tradeoffs, and then explore the problems OpenAI faced and the choices they made.
07. Reasoning
In this part, we first want to point out that every LLM company we contacted thinks that NVIDIA's FasterTransformer inference library is pretty bad, and TensorRT is even worse. Without the ability to use Nvidia's templates and modify them, it means creating your own solution from scratch. NVIDIA needs to solve this problem as soon as possible to adapt to the needs of LLM inference, otherwise it will become an open tool in fact. Easier to add third-party hardware support. More and more large models are coming, and if NVIDA can't provide a software advantage in inference, and kernels still need to be hand-written, then AMD's MI300 and other hardware will have a much bigger market.
There are 3 key factors in the inference link of LLM, which are mainly related to the number of chips used.
1. Latency
The model must respond within a reasonable delay. People don't want to wait a few seconds before starting to receive output in a chat application. Input and output token processing times can fluctuate.
2. Throughput
The model must output a certain number of tokens per second. Human usage is about 30 tokens per second, and throughput can be lower or higher for various other use cases.
3. Utilization
The hardware running the model must achieve high utilization or the cost will be prohibitive. While it is possible to achieve higher utilization by grouping more user requests with higher latency and lower throughput, this increases the difficulty.
LLM inference is mainly to balance two main factors, memory bandwidth and computation.
In simple terms, each parameter must be read with two FLOPs associated with it. Therefore, the ratio of most chips (for example, H100 SXM has only 3TB/s memory bandwidth, but has 2,000 TFLOP/s FP8) is completely unbalanced in inference with batch-size 1. If only one user is served, i.e. with a batch size of 1, the memory bandwidth required to stream each parameter for each token generation dominates the inference time, and the computation time is almost negligible.
In order to be able to scale large models to multiple users, the batch size must be greater than 1, and multiple users share the parameter reading cost. For example, with a batch size of 256 or 512, each byte of memory read in corresponds to 512 FLOP/s or 1024 FLOP/s. This ratio is closer to the H100's ratio of memory bandwidth to FLOPS. Helps achieve higher utilization, but has the disadvantage of higher latency.
Many people think that memory capacity is the main bottleneck for LLM inference, since the size of the model may fit on multiple chips, but this view may be problematic. Although inference of large models requires multiple chips, and higher memory capacity results in fewer adapted chips, it is actually better to use more chips than needed in order to reduce latency, increase throughput, and Larger batch sizes can be used to continuously increase utilization.
If an application requires the lowest possible latency, we need more chips and split the model in as many ways as possible to be economical. Smaller batch sizes allow for lower latency, but smaller batch sizes also result in poorer MFU [utilization], resulting in a higher total cost per token (in chip seconds or dollars) .
If an application requires offline inference, and latency is not an issue, then the main goal is to maximize the throughput per chip (i.e. minimize the total cost per token). Increasing the batch size is most efficient, as larger batch sizes generally lead to better MFU [utilization], but certain partitioning strategies that are not effective for small batch sizes grow as the batch size grows and become effective.
**More chips and larger batch sizes are cheaper because they increase utilization, but this also introduces a third variable, Networking Time. ** The method of deploying the model on multiple chips can effectively solve the delay, but at the expense of utilization.
Both the weight loading part of storage time and non-attentional computation time are proportional to model size and inversely proportional to chip count. For a given partition layout, the time required for chip-to-chip communication decreases less rapidly (or not at all) with the number of chips used, so it becomes a more and more difficult problem as the number of chips increases. increasingly important bottleneck.
We noticed that the memory requirements of the KV cache exploded as the batch size and size grew.
If an application needs to generate text with long attention contexts (long attention contexts), it will greatly increase the inference time. For a model with more than 500B of multi-head attention, the KV cache of attention can become very large: for a model with a batch size of 512 and a context length of 2048, the total amount of KV cache is 3TB, which is 3 times the model parameter size . The on-chip memory (on-chip memory) needs to load the KV cache from the off-chip memory (off-chip memory), which is loaded every time a token is generated. During this period, the computing core of the chip is basically idle.
Long sequence lengths are particularly troublesome for memory bandwidth and memory capacity. The reason why OpenAI's GPT-3.5 turbo with 16k contexts and GPT-4 with 32k contexts are expensive is that they cannot take larger batches due to memory constraints.
Smaller batches result in lower hardware utilization. Also, the KV cache bloats as the sequence length increases. The KV cache cannot be shared between users, so separate memory reads are required, further reducing memory bandwidth. See below for more information on MQA.
08. Infra and cost of reasoning
Infra
The architecture of MoE makes the inference of GPT-4 face challenges in terms of latency, throughput and utilization. Because the forward pass of each token can be routed to different expert models, it is very difficult to achieve low latency, high throughput and high utilization in this case, especially at high batch-size.
OpenAI's GPT-4 architecture contains 16 expert models, and each forward channel has 2 routers. This means that with a batch-size of 8, each expert's parameter read may only take up "1" of the batch size. More seriously, this also results in a batch-size of 8 for one expert, while batch-size of other experts may only be 4, 1, or 0.
In addition, the routing algorithm routes the forward pass in different directions each time a token is generated, which results in significant variations in token-to-token latency and expert batch size. That is, when processing different tokens, different experts may be assigned to different tasks, and both the computational load and the batch size may vary accordingly.
Inference infra is one of the main considerations for OpenAI to choose a small number of experts in the design of MoE. If they use more experts, memory bandwidth becomes a bigger bottleneck for inference. OpenAI often achieves batch-sizes above 4k on its own inference clusters, which means that even with optimal load balancing among experts, each expert can only reach a batch size of about 500. This requires very large usage to achieve.
Our understanding is that OpenAI runs inference on a cluster of 128 GPUs and has multiple such clusters in different data centers and geographic regions. Inference is performed in parallel with 8-way tensors and 16-way pipelines. Using 8 GPUs per node, each GPU has only about 130B of parameters, or less than 30GB per GPU under FP16, and less than 15GB under FP8/int8. This allows running inference on a 40GB A100 as long as the KV cache size for all batches doesn't bloat too much.
FP16, FP8, and int8 are different numerical precision (precision) representations, which are often used in the calculation process in deep learning to reduce the use of memory and computing resources, thereby improving the efficiency of model training and reasoning.
FP16, FP8, and int8 respectively refer to 16-bit floating-point numbers, 8-bit floating-point numbers, and 8-bit integers. Their precision is lower than that of 32-bit single-precision floating-point numbers (FP32), but they can greatly reduce memory and computing resources. Use to accelerate model training and inference in deep learning. For example, using FP16 can more than halve the computation time without losing too much precision, while using int8 can reduce the computation time by a factor of about 4 without losing too much precision.
It should be noted that the use of low-precision calculations may have a certain impact on the accuracy of the model, so a trade-off between accuracy and efficiency is required, and the most appropriate accuracy representation method should be selected according to specific task requirements.
To avoid the network communication being too irregular and at the same time avoiding the prohibitive cost of recomputing the KV cache between each token generation, the various layers containing various experts are not split on different nodes in order to share the KV cache.
**Biggest difficulty for all future MoE model extensions and conditional routing. It is how to deal with the limit of 120 routing layers around the KV cache. **
In the MoE model, the number of routing layers per branch cannot exceed 120 layers, otherwise the KV cache cannot be handled effectively. This is because during the inference process of the model, each branch needs to calculate the KV cache, which leads to an increase in computational cost.
A simple solution to this problem is to place a spanning route in 15 different nodes based on the layer limit of 120. In this way, the computational load can be evenly distributed on different nodes, thus improving the efficiency and performance of the model. However, since the first node needs to do data loading and embedding, it is important how to place fewer layers on the head node of the inference cluster.
In addition, in the process of encoding and decoding the input data, there may be some noise about inferential decoding, which we will discuss further later. A more critical issue is determining whether such noise should be believed. This can also explain why it makes sense to include fewer layers on the head node.
reasoning cost
Compared to the Davinchi model with 175B parameters, GPT-4 has 1.6 times the feed-forward parameters, but the cost is 3 times that of Davinchi. This is mainly because GPT-4 requires a larger cluster and achieves lower utilization.
We guess that using 128 A100s for inference with GPT-4 8k context length (seqlen) costs about $0.0049 per 1k tokens. While using 128 H100s for inference on GPT-4 8k context, the cost per 1k tokens is about $0.0021. (Note: The current price of GPT-4-8k is 0.03/1k input tokens, 0.06/1k output tokens. Currently, OpenAI’s use of inference chips will not be as extravagant as the author speculates. This calculation can be used as a lower bound for future price reductions .) It is important to note that **these costs are calculated at high utilization and batch-size. **
We hypothesize that OpenAI shuts down the cluster during downturns and repurposes those nodes for other tasks, such as resuming checkpoint training of small test models, or experimenting with various new techniques. Doing so helps keep inference costs low, otherwise OpenAI's utilization could be even lower, implying more than 2x the cost estimate.
09. Multi-query attention mechanism
The use of Multi-Query Attention is quite common, but we want to emphasize that OpenAI does the same. In general, only 1 attention head is needed, and the memory capacity can be significantly reduced for KV caching. Even so, GPT-4 with 32k contexts certainly cannot run on the 40GB A100, and the maximum batch size of 8k is already capped. If there is no MQA, the maximum batch size of 8k will be greatly limited, and the economic benefits will be greatly reduced.
10. Continuous batch processing
To allow some degree of maximum latency and optimize inference cost, OpenAI uses both variable batch size and continuous batching techniques. This approach can improve the utilization of computing resources without sacrificing model performance, and achieve lower latency and higher throughput during the model's inference process. If you don’t understand the concept of continuous batch processing, AnyScale’s official article How continuous batching enables 23x throughput in LLM inference while reducing p50 latency is worth reading. (Pickup Note: The distributed computing framework Ray developed by Anyscale is used by OpenAI in the infra pipeline of the model. Pickup has published research on this company before.)
11. Speculative decoding
There are rumors that OpenAI uses Speculative Decoding technology in the inference task of the GPT-4 model. While we cannot be sure of the accuracy of this message, the general variation in latency and variance from one token to another for both simple retrieval tasks and more complex tasks seems to suggest that this technique is possible. However, because there are too many variables, we cannot confirm whether this technique is actually used.
Using LLMs is generally divided into two phases:
1. Prefill stage
In this phase, a hint() is first given as input and run through the model to generate the KV cache and the first output logits. Among them, logits is the probability distribution vector output by LLM at each time step, which is used to represent the possibility of each token. This prepopulation phase is usually fast because of parallel computation.
2. Decoding stage
In this stage, a token is selected from the output logits and fed back to the model to generate logits for the next token. This is repeated until the desired number of tokens is generated. Since each decoding must be computed sequentially to produce a token, the arithmetic intensity of this second stage (i.e. computed FLOPs/bytes of memory bandwidth) is very low when running in small batches. leading to underutilization of computing power.) Therefore, decoding is usually the most expensive part of autoregressive generation.
This is why it is much cheaper to input tokens than output tokens in OpenAI's API calls.
The core idea of speculative decoding is to use a smaller, faster draft model to decode several tokens ahead of time and feed them into the oracle model as a batch. If the draft model's predictions are correct (i.e. agree with the oracle model's predictions), one batch can be used to decode several tokens, saving a lot of memory bandwidth and time per token.
However, if the larger model rejects a token predicted by the draft model, the rest of the batch is discarded and the algorithm reverts to standard token-by-token decoding. Speculative decoding can also be combined with a rejection sampling scheme to sample tokens from the original distribution. Note that this approach only works in small batch settings where bandwidth is the bottleneck.
In short, speculative decoding trades computation for bandwidth, and there are two key reasons why it is an attractive performance optimization target. First, speculative decoding does not degrade the model quality at all, because it only improves the inference speed and throughput of the model by modifying the calculation process of the decoding stage. Second, the benefits it provides are generally independent from other methods, because its advantage lies in converting sequential calculations into parallel execution, while other methods mainly start with model structure, parameters, training, etc. for optimization.
Current inference methods predict a single sequence for each batch. However** this method does not scale well in the case of large batches or low precision draft models. **Intuitively, for long continuous token sequences, the probability of the two models predicting agreement decreases exponentially, which means that as the strength of the algorithm expands, the return of speculative decoding will decrease rapidly.
We think that if OpenAI is using speculative decoding, they are likely only using it for short sequences of about 4 tokens in length. In addition, some people think that the decline in the performance of the GPT-4 model is because OpenAI added low-probability sequences from the speculative decoding model to the model pre-training, which may not be true.
Also - Some people think that the Bard model also uses speculative decoding because Google waits for the full sequence to be generated before sending it to the user, but we don't believe this guess to be true.
12. Visual Multimodal
Vision Multi-Modal: It refers to the joint processing and analysis of information from different modalities (such as image, text, voice, etc.). Usually, the information of these different modalities is semantically related, so combining them can provide richer information and more accurate inference results.
The visual multimodal capability of GPT-4 is achieved through a visual encoder independent of the text encoder, and has a cross-attention mechanism (Cross-Attention) with the text encoder. It is said that its architecture is similar to the Flamingo model. The vision encoder was fine-tuned on the 1.8 trillion parameter GPT-4 model, however, it was only pre-trained with an additional ~2 trillion tokens of text data, not vision data.
OpenAI plans to train the vision model from scratch, but the technology is not yet mature, so they hope to reduce the risk by training from text.
**Rumor has it that OpenAI's GPT-5 will train vision models from scratch and have the ability to automatically generate image and audio processing. **
A major goal of visual multimodal technology is to enable autonomous agents to read web pages and transcribe their image and video content. The data used by OpenAI to train this model includes joint data (including rendered LaTeX/text), web page screenshots and Youtube video sample frames, etc., and uses Whisper technology to transcribe.
One interesting thing about the LLM over-optimization issue is that the IO cost of the visual model is different from the IO cost of the plain text model. The IO cost of the text model is very cheap, but in the vision model, the IO cost of data loading is about 150 times that of the text model. The size of each token is 600 bytes, while the text model has only 4 bytes. Currently, there is a lot of work going on in image compression research. (Xianxiang Note: Text information is easier to compress, and image/video tokenization is a direction worthy of attention in the multimodal field.)
This is very important for vendors optimizing their hardware after 2-3 years to account for the strong visual and audio capabilities of each model. They may find that their architecture is a poor fit. All in all, future LLM architectures will certainly evolve beyond the reduced text-based dense and/or MoE models we see today.
Reference