How many volumes is large model training? Unravel the mystery of big model computing power

Question

Article source: Titanium MediaAuthor|Qin ConghuiEditor|Gai Hongda> The premise of grabbing computing power is that computing power is becoming a new business model. The boom of large-scale model "alchemy" will pass, and computing power service providers should take precautions and turn in time.![](https://img-cdn.gateio.im/resized-social/moments-bab2147faf-f385fd50ff-dd1a6f-69ad2a) *Image source: Generated by Unbounded AI*Using 40 years of global weather data, pre-training with 200 GPU cards, and in about 2 months, a large Pangea meteorological model with hundreds of millions of parameters was trained.This is the story of Bi Kaifeng, who graduated from Tsinghua University for 3 years, and trained a large model.However, from the cost point of view, under normal circumstances, a GPU is 7.8 yuan / hour, and the training cost of the Bikaifeng Pangu meteorological model may exceed 2 million. This is still a vertical large model in the meteorological field, and if it is trained on a general large model, the cost may be a hundred times.According to statistics, there are more than 100 large models with 1 billion parameters in China. However, the industry's flocking large model "Alchemy" faces the problem that high-end GPUs are difficult to find. The cost of computing power is high, and the lack of computing power and funds has become the most intuitive problem in front of the industry.  ## **High-end GPU, how much is lacking? **  "No, of course it is lacking, but what can we do." A senior executive of a large factory blurted out when asked if he lacked computing power.This seems to have become an unsolved problem recognized by the industry, the price of an NVIDIA A100 at the peak has been speculated to 200,000 yuan, and the monthly rental price of a single A100 server has also soared to 50,000-70,000 / month. But even so, the high price may still not be able to get the chip, and some computing power suppliers have also encountered strange experiences that are difficult to encounter before, such as supplier ticket skipping.Zhou Lijun, a cloud computing industry executive, said similarly: "There is a shortage of computing power. We have many customers who want high-end GPU resources, but they can't fully meet the needs of the broad market for the time being. ”![](https://img-cdn.gateio.im/resized-social/moments-bab2147faf-f46d3c3ff5-dd1a6f-69ad2a) *A cloud service provider's high-performance computing cluster with A100 is sold out interface*It turns out that the shortage of high-end GPUs is unsolved in the industry in the short term. With the outbreak of large models, the market's demand for computing power has grown rapidly, but the growth rate of supply has far from kept up. Although the supply of computing power will definitely enter the buyer's market from the seller's market in the long run, it is unknown how long this time will take.Each company is calculating how many "goods" (NVIDIA GPUs) they have in their hands, and even use this to judge market share. For example, if you have close to 10,000 cards in your hand, and the market is 100,000 cards in total, the share is 10%. "By the end of the year, there will be about 40,000, and if the market is 200,000, it will probably be 20 percent of the market." People familiar with the matter gave examples.On the one hand, you can't buy a card, on the other hand, the threshold for large model training is not as easy to "get started" as the industry bakes. As mentioned above, the training cost of the Bikaifeng Pangea meteorological model may exceed 2 million. However, it should be noted that the Bikaifeng Pangu meteorological model is a vertical large model trained on the basis of the Pangu general large model, and its parameters are hundreds of millions. If you want to train a general-purpose large model with billion-scale parameters or larger, the cost may be ten times or a hundred times higher."At present, the largest investment scale is in training, and without billions of capital investment, it is difficult to continue to make a large model." Qiu Yuepeng, Vice President of Tencent Group, COO of Cloud and Smart Industry Business Group, and President of Tencent Cloud, revealed."Run fast, at least until the money burns out to get the next round of 'financing'." One entrepreneur described the current big model "war situation": "This road is a dead endIf you don't have tens of billions of dollars behind you, it's hard to go. ”In this situation, the common view in the industry is that with the competition in the large model market, the market will also change from fanatical to rational, and enterprises will also control costs and adjust strategies with expected changes.  ## **Unsolvable Positive Response**  If there are no conditions, it is necessary to create conditions - this seems to be the majority mentality among the participants in the big model. And how to create conditions to deal with real problems, each company also has many methods.Due to the shortage of high-end GPU chips, and the GPU available in the Chinese market is not the latest generation, the performance is usually lower, so enterprises need longer time to train large models. These companies are also looking for innovative ways to make up for the lack of computing power.One way to do this is to use higher quality data for training, which makes training more efficient.Recently, the Academy of Information and Communications Technology (CAICT) took the lead in releasing the "Research Report on Industry Large Model Standard System and Capacity Architecture", which mentioned the evaluation of the data layer of large models. The report suggests that in terms of data quality, because it will have a great impact on the effect of the model, it is recommended to introduce manual labeling and confirmation, and select at least a certain proportion of the original data for labeling, so as to construct and seriously high-quality datasets.In addition to reducing the cost of large models through high-quality data, for the industry, improving infrastructure capabilities and achieving stable operation of more than 100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000"As a cloud service provider, we help customers build a stable and reliable infrastructure. Because the stability of the GPU server card will be poor, any failure will interrupt the training, resulting in an increase in the overall training time. High-performance computing clusters can provide customers with more stable services, reduce training time, and solve some computing power problems. Zhou Lijun said.At the same time, the resource scheduling of the computing power card also tests the technical ability of the service provider. Xu Wei, head of East China Internet Solutions of Volcano Engine, told Titanium Media that having computing power card resources is only one aspect, and how to schedule card resources and really put them into use is a more tested core ability and engineering ability. "Splitting a card into many small cards and trying to achieve distributed and refined scheduling can further reduce the cost of computing power." Xu Wei said.The network also affects the speed and efficiency of large model training. Large model training is often thousands of cards, connecting hundreds of GPU servers required network speed is extremely high, if the network is a little congested, the training speed will be very slow, efficiency is very affected. "As long as one server overheats and goes down, the entire cluster may have to stop and training tasks will have to restart. This requires very high requirements for cloud service O&M capabilities and troubleshooting capabilities. Qiu Yuepeng said.Some vendors have found another way, and the transition from cloud computing architecture to supercomputing architecture has also become a way to reduce costs, that is, in the case of meeting user needs, non-high-throughput computing tasks and parallel task scenarios, the supercomputing cloud is about half the price of cloud supercomputing, and then through performance optimization resource utilization can be increased from 30% to 60%.In addition, some manufacturers choose to use domestic platforms to train and reason large models to replace NVIDIA, which is difficult to find with a card. "We jointly released the iFLYTEK Spark all-in-one machine with Huawei, which is very remarkable to be able to do training and reasoning on the domestic platform. I am particularly pleased to tell you that Huawei's GPU capabilities are now the same as NVIDIA, and Ren Zhengfei attaches great importance to it, and Huawei's three directors have worked in the special class of iFLYTEK and have now made it comparable to NVIDIA's A100. Liu Qingfeng, founder and chairman of iFLYTEK, once said.Each of the above methods is a relatively large project, so it is difficult for general enterprises to meet through self-built data centers, and many algorithm teams choose the most professional computing power manufacturers to support. Among them, parallel storage is also a large cost, as well as technical capabilities, corresponding failure rate guarantees, etc. are also part of the hardware cost. Of course, even consider the cost of IDC availability area electricity, operating costs such as software, platform, and personnel costs.Only the GPU cluster at the kilocard level will have a scale effect, and choosing a computing power service provider is equivalent to saying that the marginal cost is zero.Sun Ninghui, academician of the Chinese Academy of Engineering and researcher of the Institute of Computing Technology of the Chinese Academy of Sciences, also proposed in his speech that AIGC has brought about the outbreak of the artificial intelligence industry, and the large-scale application of intelligent technology has a typical long-tail problem, that is, strong departments with strong AI capabilities (network security, nine institutes of the ninth academy and meteorological bureaus, etc.), scientific research institutions and large and medium-sized enterprises only account for about 20% of the main body of computing power demand, and the other 80% are small and medium-sized enterprises. Or limited by the high price of computing power, it is difficult to obtain development dividends in the wave of AI era.Therefore, in order to realize the large-scale application of intelligent technology, the artificial intelligence industry is both "applauded" and "applauded", and a large amount of cheap and easy-to-use intelligent computing power is needed, so that small, medium and micro enterprises can also use computing power conveniently and cheaply.Whether it is the urgent demand for computing power of large models or the various problems that need to be solved in the application process of computing power, a new change that needs to be paid attention to is that computing power has become a new service model in the process of market demand and technology iteration.  ## **Explore a new model of computing power service**  What is the computing power of the big model we are grabbing? To answer this question, we need to start with the computing power service.In terms of types, computing power is divided into general computing power, intelligent computing power and supercomputing power, and these computing power has become a service, which is the result of the dual drive of market and technology.The definition of computing power service in the "2023 Computing Power Service White Paper" (hereinafter referred to as the "White Paper") is a new field of computing power industry based on diversified computing power, linked by computing power network, and aimed at providing effective computing power.The essence of computing power service is to achieve unified output of heterogeneous computing power through new computing technology, and cross-integrate with cloud, big data, AI and other technologies. There is not only computing power in the computing power service, it is a unified encapsulation of computing power, storage, network and other resources, and the computing power delivery is completed in the form of services (such as APIs).Understanding this, you will find that in grabbing NVIDIA chips, a large part of them are computing power service providers, that is, computing power producers. Industry users who actually call the computing power API on the front-end only need to put forward the corresponding computing power requirements.According to Titanium Media App, from the perspective of the software side, all the large model used by software interaction is divided into three types, the first large model API call, each family has a quotation, according to the price settlement; The second is to own a small model, purchase computing power by yourself, or even deploy it yourself; Third, large model vendors cooperate with cloud vendors, that is, dedicated clouds, and pay monthly. "Generally, these are the three, Kingsoft Office currently mainly uses API calls, and the internal small model has made its own computing power scheduling platform." Yao Dong, vice president of Kingsoft Office, said to Titanium Media App.![](https://img-cdn.gateio.im/resized-social/moments-bab2147faf-cf2277488b-dd1a6f-69ad2a) Hashrate industry chain structure diagram, source: China Academy of Information and Communications TechnologyIn other words, in the computing power structure industry chain, upstream enterprises mainly complete the supply of supporting resources for computing power services such as general computing power, intelligent computing power, supercomputing power, storage and network. For example, in the battle for large-model computing power, NVIDIA belongs to the upstream computing power basic resource supply to the industry to supply chips, and the rise in the stocks of server manufacturers such as Inspur Information is also affected by market demand.Midstream enterprises are mainly cloud service providers and new computing power service providers, and their roles are mainly to realize computing power production through computing power orchestration, computing power scheduling, and computing power trading technology, and complete the supply of computing power through APIs. The above-mentioned computing power service providers, Tencent Cloud, and Volcano Engine are all in this link. The stronger the service-oriented ability of computing power to serve midstream enterprises, the lower the threshold for the application side, and the more conducive to the inclusive and ubiquitous development of computing power.Downstream enterprises rely on the computing power provided by computing power services to generate and manufacture value-added services, such as industry users. This part of the user only needs to put forward the demand, and the computing power producer configures the corresponding computing power according to the demand to complete the "computing power task" issued by the user.This has more cost and technical advantages than the original purchase of servers to build a large model computing power environment. Bi Kaifeng's training of the Pangu Meteorological Big Model should directly call the underlying layer of the Pangu Model, that is, HUAWEI CLOUD's high-performance computing service, so will the process of other large model enterprises use computing power or pay for computing power be different?  ## **Computing Power Business Model Iteration**  ChatGLM is the first batch of general large models launched, taking the use of ChatGLM computing power of Zhipu AI as an example, according to the information that has been publicly disclosed, ChatGLM AI uses a number of mainstream AI computing power service providers in China. "In theory, everything should be useful." People familiar with the matter said that this may also include domestic mainstream computing power service providers/cloud service providers.Pay-as-you-go billing and monthly billing are the mainstream modes of the current computing power service, and there are roughly two types of usage requirements, one is to choose the corresponding computing power service instance, and on the official website interface of a cloud service provider, it can provide high-performance GPU servers equipped with NVIDIA A800, A100, V100 three mainstream graphics cards.![](https://img-cdn.gateio.im/resized-social/moments-bab2147faf-f6fb6682b2-dd1a6f-69ad2a) *Types of high-performance computing GPU graphics cards provided by a computing service provider*The other is to choose the corresponding MaaS service platform and fine-tune the large model in the MaaS platform. Taking the Tencent Cloud TI-ONE platform pay-as-you-go publication price as an example, the configuration of 8C40G V100\*1 is 20.32 yuan per hour, which can be used for automatic learning-vision, task-based modeling, notebook, and visual modeling.At present, the industry is also promoting the "integration of computing and network" of computing power services, and through the comprehensive judgment of computing tasks, computing network resource status, and other information, a computing network orchestration scheme that can support cross-architecture, cross-region, and cross-service provider scheduling is formed, and related resource deployment is completed. For example, as long as you save a sum of money and deposit it in the computing power network, the partitions in the computing power network can be called at willAccording to the characteristics of the application, select the most suitable partition, the fastest partition, and the most cost-effective partition, and then charge according to the duration, and deduct the fee from the pre-deposited funds.The same is true for cloud service providers, as a unique product of cloud services, enabling them to quickly participate in the computing power industry chain.According to data from the Ministry of Industry and Information Technology, the total scale of China's computing power will reach 180EFLOPS in 2022, ranking second in the world. As of 2022, the scale of China's computing power industry has reached 1.8 trillion. The large-model computing power has greatly accelerated the development of the computing power industry.One saying is that the current computing power service is actually a new type of "selling electricity" model. However, according to the different division of labor, some computing service providers may need to help users do more system performance debugging, software installation, large-scale job operation duty and operation characteristics analysis, that is, part of the last-mile operation and maintenance work.With the normalization of large-model high-performance computing demand, computing power services, which were born out of cloud services, have quickly entered the public's field of vision, forming a unique industrial chain and business model. It's just that at the beginning of the outbreak of the computing power industry due to large models, the shortage of high-end GPUs, the high cost of computing power, and the grabbing of "cores" have formed a unique landscape belonging to this era."At this stage, the volume is who can get the card in the supply chain, NVIDIA is the king of the entire industry at present, and all markets are controlled by it, which is the status quo." People familiar with the matter commented. It's as if whoever gets the card can deliver the business when demand outstrips supply.But not everyone is grabbing the "card", because the shortage is temporary, and the problem will always be solved. "The person who does the long-term research doesn't actually grab it, just wait because he won't dieAt present, there are only a group of startups who are grabbing cards, and they want to ensure that they can survive until next year. The person said.In many uncertainties, it is a definite trend for computing power to become a service, and what computing power service providers should do is to be ready to take precautions when the big model returns to rationality and the market wind changes rapidly.Note: At the request of the interviewee, Zhou Lijun is a pseudonym.**(This article first published Titanium Media APP) **