Potential Track Preview: Decentralized Computing Power Market (Part I)

2023-11-03 09:42:15

By Zeke, YBB Capital

Introduction

Since the birth of GPT-3, generative AI has ushered in an explosive inflection point in the field of artificial intelligence with its amazing performance and broad application scenarios, and tech giants have begun to jump into the AI track in groups. However, the operation of large language model (LLM) training and inference requires a lot of computing power, and with the iterative upgrading of the model, the computing power demand and cost increase exponentially. Taking GPT-2 and GPT-3 as an example, the difference in the number of parameters between GPT-2 and GPT-3 is 1,166 times (150 million parameters for GPT-2 and 175 billion parameters for GPT-3), and the cost of GPT-3 can reach up to $12 million based on the price model of the public GPU cloud at that time, which is 200 times that of GPT-2. In the actual use process, each question of the user needs to be inferred and calculated, according to the situation of 13 million unique users at the beginning of this year, the corresponding chip demand is more than 30,000 pieces A100GPU. The initial cost would then be a staggering $800 million, with an estimated $700,000 per day for model inference.

Insufficient computing power and high costs have become a problem for the entire AI industry, but the same problem seems to plague the blockchain industry as well. On the one hand, the fourth halving of Bitcoin and the passage of ETFs are coming, and as the price climbs in the future, the demand for computing hardware by miners will inevitably increase significantly. On the other hand, "Zero-Knowledge Proof" (ZKP) technology is booming, and Vitalik has repeatedly emphasized that ZK's impact on the blockchain space in the next decade will be as important as the blockchain itself. Although the future of this technology is highly anticipated by the blockchain industry, ZK also consumes a lot of computing power and time in the process of generating proofs like AI due to the complex computational process.

In the foreseeable future, a shortage of computing power will become inevitable, so will the decentralized computing power market be a good business?

Definition of Decentralized Computing Market

The decentralized computing power market is actually basically equivalent to the decentralized cloud computing track, but compared with decentralized cloud computing, I personally think that this term will be more appropriate to describe the new projects mentioned later. The decentralized computing power market should belong to a subset of DePIN (decentralized physical infrastructure network), and its goal is to create an open computing power market, through token incentives, so that anyone with idle computing resources can provide their resources in this market, mainly serving the B-end user and developer community. In terms of well-known projects, such as Render Network, a network of rendering solutions based on decentralized GPUs, and Akash Network, a distributed peer-to-peer marketplace for cloud computing, belong to this track.

The following will start with the basic concept, and then discuss the three emerging markets under the track: the AGI computing power market, the Bitcoin computing power market, and the AGI computing power market in the ZK hardware acceleration market, and the latter two will be discussed in "Potential Track Preview: Decentralized Computing Power Market (Part II)".

Overview of hashrate

The origin of the concept of computing power can be traced back to the invention of computers, the original computer was a mechanical device to complete computing tasks, and computing power refers to the computing power of a mechanical device. With the development of computer technology, the concept of computing power has also evolved, and now computing power usually refers to the ability of computer hardware (CPU, GPU, FPGA, etc.) and software (operating system, compiler, application, etc.) to work together.

Definition

Computing power refers to the amount of data that a computer or other computing device can process or the number of computing tasks that can be completed in a certain period of time. Hashrate is often used to describe the performance of a computer or other computing device, and it is an important measure of the processing power of a computing device.

Metrics

Computing power can be measured in various ways, such as computing speed, computing energy consumption, computing accuracy, and parallelism. In the computer field, commonly used computing power metrics include FLOPS (floating point operations per second), IPS (instructions per second), TPS (transactions per second), etc.

FLOPS (Floating-Point Operations Per Second) refers to a computer's ability to handle floating-point operations (mathematical operations on numbers with decimal points, taking into account issues such as precision and rounding errors), and it measures how many floating-point operations a computer can complete per second. FLOPS is a measure of a computer's high-performance computing power, and is commonly used to measure the computing power of supercomputers, high-performance computing servers, and graphics processing units (GPUs), among others. For example, a computer system has a FLOPS of 1 TFLOPS (1 trillion floating-point operations per second), which means it can complete 1 trillion floating-point operations per second.

IPS (Instructions Per Second) refers to the speed at which a computer processes instructions, and it measures how many instructions a computer is capable of executing per second. IPS is a measure of the single-instruction performance of a computer, and is often used to measure the performance of a central processing unit (CPU), etc. For example, a CPU with an IPS of 3 GHz (which can execute 300 million instructions per second) means that it can execute 300 million instructions per second.

TPS (Transactions Per Second) refers to the ability of a computer to process transactions, and it measures how many transactions a computer can complete per second. It is often used to measure the performance of a database server. For example, a database server with a TPS of 1000 means that it can process 1000 database transactions per second.

In addition, there are some computing power indicators for specific application scenarios, such as inference speed, image processing speed, and speech recognition accuracy.

Type of hashrate

GPU computing power refers to the computing power of a graphics processing unit. Unlike the CPU (Central Processing Unit), the GPU is a piece of hardware specifically designed to process graphics data such as images and videos, and it has a large number of processing units and efficient parallel computing power, which can perform a large number of floating-point operations at the same time. Since GPUs were originally used for game graphics processing, they typically have higher clock frequencies and greater memory bandwidth than CPUs to support complex graphics operations.

Difference Between CPU and GPU

Architecture: The computing architecture of CPUs and GPUs is different. CPUs typically have one or more cores, each of which is a general-purpose processor capable of performing a variety of different operations. GPUs, on the other hand, have a large number of Stream Processors and Shaders, which are dedicated to performing operations related to image processing.

Parallel computing: GPUs typically have higher parallel computing capabilities. CPUs have a limited number of cores and can only execute one instruction per core, but GPUs can have thousands of stream processors that can execute multiple instructions and operations at the same time. As a result, GPUs are generally better suited than CPUs to perform parallel computing tasks, such as machine learning and deep learning, which require a lot of parallel computing.

Programming: GPU programming is more complex than CPUs, requiring the use of specific programming languages (such as CUDA or OpenCL) and the use of specific programming techniques to take advantage of the parallel computing power of GPUs. In contrast, CPUs are simpler to program and can use common programming languages and programming tools.

The importance of computing power

In the era of the Industrial Revolution, oil was the blood of the world, permeating every industry. Computing power is in the blockchain, and in the upcoming AI era, computing power will be the "digital oil" of the world. From the crazy rush of major companies for AI chips and the fact that Nvidia stocks exceeded one trillion, to the recent blockade of high-end chips in China by the United States, to the size of computing power, chip area, and even the plan to ban GPU cloud, its importance is self-evident, and computing power will be a commodity in the next era.

! Potential Track Preview: Decentralized Computing Power Market (Part I)

Overview of Artificial General Intelligence

Artificial Intelligence (AI) is a new technical science that studies and develops theories, methods, technologies and application systems for simulating, extending and expanding human intelligence. It originated in the fifties and sixties of the 20th century, and after more than half a century of evolution, it has experienced the intertwined development of three waves of symbolism, connectionism and actors. A more specific definition of generative AI is Artificial General Intelligence (AGI), an AI system with a broad understanding that can perform intelligence similar to or superior to humans in a variety of different tasks and domains. AGI basically needs to be composed of three elements: deep learning (DL), big data, and large-scale computing power.

Deep learning

Deep learning is a subfield of machine learning (ML), and deep learning algorithms are neural networks modeled after the human brain. For example, the human brain contains millions of interconnected neurons that work together to learn and process information. Similarly, deep learning neural networks (or artificial neural networks) are made up of multiple layers of artificial neurons that work together inside a computer. Artificial neurons are software modules called nodes that use mathematical calculations to process data. Artificial neural networks are deep learning algorithms that use these nodes to solve complex problems.

! Potential Track Preview: Decentralized Computing Power Market (Part I)

Neural networks can be divided into input layers, hidden layers, and output layers, and the parameters are connected between different layers.

Input Layer: The input layer is the first layer of the neural network and is responsible for receiving external input data. Each neuron of the input layer corresponds to a feature of the input data. For example, when processing image data, each neuron may correspond to one pixel value of the image;

Hidden Layers: The input layer processes the data and passes it to the farther layers in the neural network. These hidden layers process information at different levels, adjusting their behavior as new information is received. Deep learning networks have hundreds of hidden layers that can be used to analyze problems from many different angles. For example, if you are given an image of an unknown animal that must be classified, you can compare it to an animal you already know. For example, the shape of the ears, the number of legs, and the size of the pupils can determine what kind of animal it is. Hidden layers in deep neural networks work in the same way. If a deep learning algorithm tries to classify an animal image, each of its hidden layers processes the different features of the animal and tries to classify it accurately;

Output Layer: The output layer is the last layer of the neural network and is responsible for generating the output of the network. Each neuron in the output layer represents a possible output class or value. For example, in a classification problem, each output layer neuron may correspond to a category, while in a regression problem, the output layer may have only one neuron whose value represents the predicted outcome;

Parameters: In a neural network, the connections between different layers are represented by Weights and Biases parameters, which are optimized during training to enable the network to accurately identify patterns and make predictions in the data. The increase in parameters can increase the model capacity of a neural network, i.e., the ability of the model to learn and represent complex patterns in the data. However, the increase in parameters will increase the demand for computing power.

Big Data

In order to train effectively, neural networks often require a large amount of data, diverse and of high quality and multiple sources. It is the foundation for the training and validation of machine learning models. By analyzing big data, machine learning models can learn patterns and relationships in the data to make predictions or classifications.

Massive computing power

The multi-layer complex structure of the neural network, the large number of parameters, the need for big data processing, the iterative training method (in the training stage, the model needs to iterate repeatedly, and the forward propagation and backpropagation of each layer need to be calculated during the training process, including the calculation of the activation function, the calculation of the loss function, the calculation of the gradient and the update of the weight), the need for high-precision computing, the parallel computing ability, the optimization and regularization technology, and the model evaluation and verification process, all of which lead to the demand for high computing power. AGI's requirements for large-scale computing power increase by about 10 times every year. So far, the latest model GPT-4 contains 1.8 trillion parameters, a single training cost of more than 60 million US dollars, and the computing power required is 2.15e25 FLOPS (21,500 trillion floating-point calculations). The demand for computing power for the next model training is still expanding, and new models are also increasing.

AI Computing Economics

Future market size

According to the most authoritative estimates, the "2022-2023 Global Computing Power Index Evaluation Report" jointly compiled by IDC (International Data Corporation) and Inspur Information and Tsinghua University's Global Industry Research Institute, The global AI computing market size will grow from $19.50 billion in 2022 to $34.66 billion in 2026, with the generative AI computing market size growing from $820 million in 2022 to $10.99 billion in 2026. Generative AI computing will grow from 4.2% to 31.7% of the overall AI computing market.

! Potential Track Preview: Decentralized Computing Power Market (Part I)

Computing power economic monopoly

The production of AI GPUs has been monopolized by NVIDA, and they are extremely expensive (the latest H100 has been sold for $40,000 per chip), and the GPUs have been snapped up by Silicon Valley giants as soon as they are released, and some of these devices are used to train their own new models. The other part is leased to AI developers through cloud platforms, such as Google, Amazon, and Microsoft's cloud computing platforms, which master a large number of computing resources such as servers, GPUs, and TPUs. Computing power has become a new resource monopolized by giants, and a large number of AI-related developers cannot even buy a dedicated GPU without a markup, and in order to use the latest equipment, developers have to rent AWS or Microsoft cloud servers. According to the financial report, this business has extremely high profits, with AWS's cloud services having a gross margin of 61%, while Microsoft has a higher gross margin of 72%.

! Potential Track Preview: Decentralized Computing Power Market (Part I)

So do we have to accept this centralized authority and control, and pay 72% of the profit fee for computing resources? Will the giants that monopolize Web2 have a monopoly on the next era?

The problem of decentralized AGI computing power

When it comes to antitrust, decentralization is usually the optimal solution, and from the existing projects, can we use the protocol to achieve the large-scale computing power required by AI through storage projects in DePIN and idle GPUs such as RDNR? The answer is no, the road to slaying dragons is not so simple, the early projects are not specially designed for AGI computing power, it is not feasible, and computing power needs to face at least the following five challenges on the chain:

Verification of work: To build a truly trustless computing network and provide financial incentives to participants, the network must have a way to verify that the deep learning computational work is actually performed. At the heart of this problem is the state dependence of deep learning models; In a deep learning model, the input of each layer depends on the output of the previous layer. This means that you can't just validate one layer in your model without considering all the layers before it. The calculations for each layer are based on the results of all the layers that preceded it. Therefore, in order to verify the work done at a particular point (e.g., a particular layer), all the work must be performed from the beginning of the model to that particular point;
Market: As an emerging market, the AI computing power market is subject to supply and demand dilemmas, such as cold start problems, and supply and demand liquidity need to be roughly matched from the beginning so that the market can grow successfully. In order to capture the potential supply of hash power, participants must be offered explicit rewards in exchange for their hash resources. The marketplace needs a mechanism to keep track of the computational work done and pay the corresponding fees to the providers in a timely manner. In traditional markets, intermediaries handle tasks such as management and onboarding, while reducing operational costs by setting minimum payouts. However, this approach is more costly when scaling the market. Only a small fraction of the supply can be effectively captured economically, which leads to a threshold equilibrium state where the market can only capture and maintain a limited supply and cannot grow further;
Downtime Problem: Downtime problem is a fundamental problem in computational theory, which involves judging whether a given computational task will be completed in a finite time or will never stop. This problem is unsolvable, meaning that there is no universal algorithm that can predict whether all computational tasks will stop in a finite amount of time. For example, on Ethereum, smart contract execution faces a similar downtime. i.e., it is impossible to determine in advance how much computing resources will be required for the execution of a smart contract, or whether it will be completed in a reasonable time;

(In the context of deep learning, this problem will be more complex as models and frameworks will switch from static graph construction to dynamic construction and execution.) ）

Privacy: The design and development of privacy awareness is a must for the project team. While a large amount of machine learning research can be performed on publicly available datasets, fine-tuning of models on proprietary user data is often required to improve the performance of models and adapt them to specific applications. This fine-tuning process may involve the processing of personal data and therefore needs to take into account the requirements of the Privacy Shield;
Parallelization: This is a key factor in the feasibility of current projects, deep learning models are often trained in parallel on large hardware clusters with proprietary architectures and extremely low latency, while GPUs in distributed computing networks require frequent data exchange to introduce latency and are limited by the lowest performance GPUs. In the case of untrustworthy and unreliable computing power sources, how to heterogeneous parallelization is a problem that must be solved, and the current feasible method is to achieve parallelization through transformer models, such as Switch Transformers, which now have the characteristics of high parallelization.

Solution: Although the current attempt at the decentralized AGI computing power market is still in the early stage, there happen to be two projects that have preliminarily solved the consensus design of the decentralized network and the implementation process of the decentralized computing network in model training and inference. The following will take Gensyn and Together as examples to analyze the design methods and problems of the decentralized AGI computing power market.

Reunion

! Potential Track Preview: Decentralized Computing Power Market (Part I)

Gensyn is a marketplace for AGI computing power that is still in the building stage and aims to solve the multiple challenges of decentralized deep learning computing and reduce the cost of deep learning today. Gensyn is essentially a Layer 1 proof-of-stake protocol based on the Polkadot network, which directly rewards solvers (Solvers) through smart contracts in exchange for their idle GPU devices for computation, and performs machine learning tasks.

So back to the question above, the core of building a truly trustless computing network is to validate the machine learning work that has been done. This is a highly complex problem that requires a balance to be found at the intersection of complexity theory, game theory, cryptography, and optimization.

Gensyn proposes a simple solution where the solver submits the results of the machine learning task they have completed. To verify that these results are accurate, another independent validator tries to do the same work again. This method can be referred to as a single replication because only one validator will re-execute. This means that there is only one additional effort to verify the accuracy of the original work. However, if the person verifying the work is not the requester of the original job, then the trust problem remains. Because validators themselves may not be honest, and their work needs to be verified. This leads to a potential problem that if the person verifying the work is not the requester of the original work, then another validator is needed to verify their work. But this new validator may also not be trusted, so another validator is needed to validate their work, which may continue forever, forming an infinite chain of replication. Here we need to introduce three key concepts and interweave them to build a four-role participant system to solve the infinite chain problem.

Proof of Probabilistic Learning: Use the metadata of a gradient-based optimization process to construct a certificate of work done. By replicating certain stages, you can quickly validate these certificates to ensure that the work has been completed as scheduled.

Graph-based pinpoint protocol: Uses a multi-granularity, graph-based pinnacle protocol, as well as consistent execution of cross-evaluators. This allows verification efforts to be re-run and compared to ensure consistency, and ultimately confirmed by the blockchain itself.

Truebit-style incentive games: Use staking and slashing to build incentive games that ensure that every financially sound participant will act honestly and perform their intended tasks.

The contributor system consists of committers, solvers, validators, and whistleblowers.

Submitters:

The submitter is the end user of the system, provides the tasks that will be calculated, and pays for the units of work completed;

Solvers:

The solver is the primary worker of the system, performing model training and generating proofs that are checked by validators;

Verifiers:

The verifier is the key to linking the non-deterministic training process to the deterministic linear computation, replicating a portion of the solver's proof and comparing the distance to the expected threshold;

Whistleblowers:

Whistleblowers are the last line of defense, checking the work of validators and making challenges in the hope of lucrative bonus payouts.

The system works

The protocol is designed to operate in a game system that will consist of eight phases, covering four main participant roles, and will be used to complete the complete process from task submission to final validation.

Task Submission: A task consists of three specific pieces of information:

Metadata describing tasks and hyperparameters;
A model binary (or basic schema);
Publicly accessible, pre-processed training data.

In order to submit the task, the submitter specifies the details of the task in a machine-readable format and submits it to the chain along with the model binary (or machine-readable schema) and a publicly accessible location of the pre-processed training data. Exposed data can be stored in a simple object store such as AWS S3 or in a decentralized storage such as IPFS, Arweave, or Subspace.
Profiling: The analysis process determines a baseline distance threshold for learning to validate the proof. The validator will periodically scrape the analysis task and generate a variation threshold for the proof-of-learning comparison. To generate thresholds, validators will deterministically run and rerun a portion of the training, using different random seeds, generating and checking their own proofs. During this process, the validator establishes an overall expected distance threshold that can be used as a non-deterministic effort to validate the solution.
Training: After analysis, the task goes into a public task pool (similar to Ethereum's Mempool). Select a solver to execute the task and remove the task from the task pool. The solver performs the task based on the metadata submitted by the submitter, as well as the model and training data provided. When performing the training task, the solver also generates a proof of learning by periodically checking and storing metadata (including parameters) from the training process so that the verifier can replicate the following optimization steps as accurately as possible.
Proof generation: The solver periodically stores model weights or updates and the corresponding index with the training dataset to identify the samples used to generate weight updates. The checkpoint frequency can be adjusted to provide greater assurance or to save storage space. Proof can be "stacked", meaning that proof can start with a random distribution of weights used to initialize the weights, or start with pre-trained weights generated using their own proofs. This enables the protocol to build a set of proven, pre-trained base models (i.e., base models) that can be fine-tuned for more specific tasks.
Verification of proof: Once the task is complete, the solver registers the task with the chain and displays their proof of learning in a publicly accessible location for the validator to access. The validator pulls the validation task from the common task pool and performs computational work to rerun a portion of the proof and perform the distance calculation. The chain (along with the thresholds calculated during the analysis phase) then uses the resulting distance to determine if the verification matches the proof.
Graph-based pinpoint challenge: After validating the proof of learning, the whistleblower can copy the validator's work to check that the validation itself is performed correctly. If a whistleblower believes that the verification has been performed in error (malicious or non-malicious), they can challenge the contract quorum to receive a reward. This reward can come from deposits from solvers and validators (in the case of a genuinely positive), or from the lottery vault prize pool (in the case of a false positive) and the arbitration is performed using the chain itself. Whistleblowers (in their case, validators) will only verify and subsequently challenge the work if they expect to receive appropriate compensation. In practice, this means that whistleblowers are expected to join and leave the network based on the number of whistleblowers with other activities (i.e., with live deposits and challenges). Therefore, the expected default strategy for any whistleblower is to join the network when there are fewer other whistleblowers, post a deposit, randomly select an active task, and begin their verification process. After the first task is over, they will grab another random active task and repeat until the number of whistleblowers exceeds their determined payout threshold, and then they will leave the network (or, more likely, move to another role in the network – validator or solver – depending on their hardware capabilities) until the situation reverses again.
Contract arbitration: When a validator is challenged by a whistleblower, they enter a process with the chain to find out where the disputed action or input is, and finally the chain performs the final basic operation and determines whether the challenge is justified. In order to keep the whistleblower honest and credible and overcome the dilemma of validators, regular forced errors and jackpot payouts are introduced here.
Settlement: During the settlement process, participants are paid based on the conclusion of probability and certainty checks. Depending on the results of previous verifications and challenges, there will be different payouts for different scenarios. If the work is deemed to have been performed correctly and all checks have passed, the solution provider and validator are rewarded based on the action taken.

Brief review of the project

Gensyn has designed a wonderful game system on the verification layer and incentive layer, which can quickly identify the error by finding the divergence points in the network, but there are still many details missing in the current system. For example, how to set parameters to ensure that the rewards and punishments are reasonable without the threshold being too high? Has the game considered the difference between the extreme case and the solver's computing power? There is no detailed description of heterogeneous parallel operation in the current version of the white paper, and it seems that the implementation of Gensyn is still difficult and long.

Together.ai

Together is a company that focuses on open source of large models and is committed to decentralized AI computing solutions, hoping that anyone can access and use AI anywhere. Strictly speaking, Together is not a blockchain project, but the project has preliminarily solved the latency problem in the decentralized AGI computing network. Therefore, the following article only analyzes Together's solution and does not evaluate the project.

How can large models be trained and inferred when a decentralized network is 100 times slower than a data center?

Let's imagine what the distribution of GPU devices participating in the network would look like if decentralization was removed. These devices will be distributed on different continents, in different cities, and will need to be connected to each other, and the latency and bandwidth of the connection will vary. As shown in the figure below, a distributed scenario is simulated with devices distributed across North America, Europe, and Asia, with varying bandwidth and latency between devices. So what needs to be done to connect it in series?

! Potential Track Preview: Decentralized Computing Power Market (Part I)

Distributed training computing modeling: The following figure shows the basic model training on multiple devices, and there are three communication types in terms of communication types: Forward Activation, Backward Gradient, and Lateral Communication.

! Potential Track Preview: Decentralized Computing Power Market (Part I)

In combination with communication bandwidth and latency, two forms of parallelism need to be considered: pipeline parallelism and data parallelism, corresponding to the three types of communication in the multi-device case:

In pipeline parallelism, all layers of the model are divided into stages, where each device processes a phase, which is a continuous sequence of layers, such as multiple transformer blocks; In forward passing, the activation is passed to the next stage, whereas in backward passing, the gradient of the activation is passed to the previous stage.

In data parallelism, the device independently calculates the gradients of different microbatches, but communicates to synchronize these gradients.

Scheduling Optimization:

In a decentralized environment, the training process is often limited by communication. Scheduling algorithms generally assign tasks that require a large amount of communication to devices with faster connection speeds, and considering the dependencies between tasks and the heterogeneity of the network, the cost of a specific scheduling strategy needs to be modeled first. In order to capture the complex communication cost of training the base model, Together proposes a novel formula and decomposes the cost model into two levels through graph theory:

Graph theory is a branch of mathematics that studies the nature and structure of graphs (networks). A graph is made up of vertices (nodes) and edges (lines that connect nodes). The main purpose of graph theory is to study the various properties of graphs, such as the connectivity of graphs, the colors of graphs, the nature of paths and loops in graphs.
The first level is a balanced graph partition (splitting the set of vertices of the graph into several subsets of equal or approximately equal sizes, while minimizing the number of edges between the subsets. In this segmentation, each subset represents a partition, and the communication cost is reduced by minimizing the edges between the partitions, which corresponds to the communication cost of data parallelism.
The second level is a joint graph matching and traveling salesman problem (a joint graph matching and traveling salesman problem is a combinatorial optimization problem that combines elements of graph matching and traveling salesman problems. The problem of graph matching is finding a match in the graph so that some kind of cost is minimized or maximized. The traveling salesman problem is to find the shortest path to all nodes in the graph), corresponding to the communication cost of pipeline parallelism.

! Potential Track Preview: Decentralized Computing Power Market (Part I)

The above figure is a schematic diagram of the process, because the actual implementation process involves some complex calculation formulas. In order to make it easier to understand, the following will explain the process in the diagram in layman's terms, and the detailed implementation process can be consulted by yourself in the documentation on the Together official website.

Suppose there is a device set D with N devices, and the communication between them has an indeterminate delay (A-matrix) and bandwidth (B-matrix). Based on device set D, we first generate a balanced graph segmentation. The number of devices in each split or device group is approximately equal, and they all handle the same pipeline stages. This ensures that when data is paralleled, groups of devices perform a similar amount of work. (Data parallelism is when multiple devices perform the same task, while pipelining stages are when devices perform different task steps in a specific order). Based on the latency and bandwidth of the communication, the "cost" of transferring data between groups of devices can be calculated through formulas. Each balanced group of devices is combined to produce a fully connected rough graph, where each node represents a stage of the pipeline and the edges represent the cost of communication between the two stages. To minimize communication costs, a matching algorithm is used to determine which device groups should work together.

For further optimization, the problem can also be modeled as an open-loop traveling salesman problem (open-loop means that there is no need to return to the origin of the path) to find an optimal path to transfer data between all devices. Finally, Together uses their innovative scheduling algorithm to find the optimal allocation strategy for a given cost model, so as to minimize communication costs and maximize training throughput. According to actual measurements, even if the network is 100 times slower under this scheduling optimization, the end-to-end training throughput is only about 1.7 to 2.3 times slower.

Communication Compression Optimization:

! Potential Track Preview: Decentralized Computing Power Market (Part I)

For the optimization of communication compression, Together introduces the AQ-SGD algorithm (for the detailed calculation process, please refer to the paper Fine-tuning Language Models over Slow Networks using Activation Compression with Guarantees). The AQ-SGD algorithm is a novel active compression technology designed to solve the communication efficiency problem of pipeline parallel training on low-speed networks. Unlike previous methods of directly compressing the activity value, AQ-SGD focuses on compressing the changes in the activity value of the same training sample over different periods, and this unique method introduces an interesting "self-executing" dynamic, and the performance of the algorithm is expected to gradually improve as the training stabilizes. After rigorous theoretical analysis, the AQ-SGD algorithm proves that it has a good convergence rate under certain technical conditions and the quantization function with bounded error. The algorithm can be implemented efficiently without adding additional end-to-end runtime overhead, although it requires more memory and SSDs to store the active value. Through extensive experimental validation on sequence classification and language modeling datasets, AQ-SGD can compress activity values to 2-4 bits without sacrificing convergence performance. In addition, AQ-SGD can also be integrated with state-of-the-art gradient compression algorithms to achieve "end-to-end communication compression", that is, the data exchange between all machines, including model gradients, forward activity values, and reverse gradients, is compressed to low precision, thereby greatly improving the communication efficiency of distributed training. Compared to the end-to-end training performance of a centralized computing network (e.g., 10 Gbps) without compression, it is currently only 31% slower. Combined with the data of scheduling optimization, although there is still a certain gap from the centralized computing power network, there is a relatively large hope for catching up in the future.

Conclusion

Under the dividend period brought by the AI wave, the AGI computing power market is undoubtedly the market with the greatest potential and the most demand among many computing power markets. However, the development difficulty, hardware requirements, and capital requirements are also the highest. Combined with the above two projects, there is still a certain distance from the implementation of the AGI computing power market, and the real decentralized network is much more complex than the ideal situation, which is obviously not enough to compete with cloud giants. At the time of writing this article, it was also observed that some projects that are in their infancy (PPT stage) have begun to explore some new entry points, such as focusing on the less difficult inference stage or the training of small models, which are more practical attempts.

Although it faces many challenges, it is important in the long run that the decentralization and permissionless significance of AGI computing power should not be concentrated in a few centralized giants. Because mankind does not need a new "religion" or a new "pope", let alone pay expensive "membership dues".

bibliography

1.Gensyn Litepaper：

2.NeurIPS 2022: Overcoming Communication Bottlenecks for Decentralized Training ：

3.Fine-tuning Language Models over Slow Networks using Activation Compression with Guarantees：

4.The Machine Learning Compute Protocol and our future：

5.Microsoft：Earnings Release FY23 Q2：

Compete for AI tickets: BAT and Byte Meituan compete for GPU:
IDC: 2022-2023 Global Computing Power Index Evaluation Report:
Guosheng Securities large model training estimation:
Wings of Information: What is the relationship between computing power and AI? ：

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

Reward
like
Comment
Share

Comment

0/400

No comments

Topic
#Gate ETH Staking APY 5%
39k Popularity
#Show My Alpha Points
52k Popularity
#Crypto IPO Surge
18k Popularity
#Bitcoin Hashrate New High
4k Popularity
#Hong Kong Stablecoin Rules
5k Popularity

sitemap