Baidu Shen Dou: Upgrade computing platform capability for 100000 card computing power cluster, Wenxin large model daily usage exceeds 700 million times

As the parameter scale of large models becomes larger, the demand for computing power shows an exponential growth trend. At the 2024 Baidu Cloud Intelligence Conference held on September 25, Shen Shao, executive vice president of Baidu AI Cloud Group and president of Baidu Smart Cloud Business Group, said that the famous scaling law in the field of large-scale models is still continuing. This law pointed out that model performance will improve with the increase of parameters, computing power and data set size, and "soon, more 100000 calorie computing power clusters will appear".
According to Shen Dou's observation, in the past year, we have felt a sharp increase in the demand for model training from customers. He introduced, "The landing of the big model industry in 2024 is accelerating. Currently, on the Qianfan big model platform, Wenxin big model has been adjusted more than 700 million times a day, helping users fine tune 30000 big models and developing over 700000 enterprise level applications
The increasing demand for large model training means that the required computing power cluster size is getting larger, and at the same time, the expectation of a continuous decrease in model inference costs is also increasing. Shen Dou stated that these have raised higher requirements for the stability and effectiveness of GPU management. On September 25th, Baidu upgraded its AI heterogeneous computing platform Baige 4.0, which has the ability to deploy and manage 100000 card clusters.
Shen Dou introduced that GPU computing power clusters have three characteristics - extreme scale, extreme high density, and extreme interconnection. Building a 10000 card cluster alone can cost billions of yuan in GPU procurement costs. Shen Dou emphasized that building computing power resources is not simply about buying GPUs and connecting them, but requires a lot of technology. For example, there are more diverse models of GPU chips and more complex management; GPU needs to perform a large amount of parallel computing; The transmission volume of data has increased and the demand for speed has become higher, "he said. Therefore, the Baige computing platform needs to support heterogeneous chips, high-speed interconnection, and efficient storage.
Shen Dou also stated that managing a 100000 card cluster is fundamentally different from managing a 10000 card cluster. Firstly, at the physical level, deploying a cluster with a capacity of 100000 cards would occupy approximately 100000 square meters of space, equivalent to the area of 14 standard football fields. Secondly, in terms of energy consumption, these servers consume approximately 3 million kilowatt hours of electricity per day, equivalent to the daily electricity consumption of residents in the eastern urban area of Beijing. The huge demand for space and energy in a 100000 card cluster far exceeds the capacity of traditional data center deployment methods. If cross regional deployment of data centers is considered, it will bring huge challenges at the network level. In addition, GPU failures in the 100000 card cluster will be very frequent, and the proportion of effective training time will also face new challenges.
Shen Dou introduced that in response to these challenges, Baige 4.0 has built a large-scale congestion free HPN high-performance network at the 100000 card level, a 10ms level ultra high precision network monitoring, and a minute level fault recovery capability for 100000 card clusters. Baige 4.0 is designed for deploying large-scale clusters of 100000 cards. Today's Baige 4.0 already has mature capabilities for deploying and managing 100000 card clusters, aiming to overcome these new challenges and provide a continuously leading computing platform for the entire industry, "said Shen Dou.
Not only Baidu, but more and more tech giants are facing the demand for AI big models and improving their computing infrastructure capabilities. In early September, Musk announced that Colossus, a super AI training cluster created by his AI startup xAI, had been officially launched, equipped with 100000 Nvidia H100 GPU acceleration cards, and will double the number of GPUs in the coming months. On September 19, 2024, at the Yunqi Conference, Alibaba Cloud also stated that GPU based AI computing power will be the dominant computing paradigm in the future. Alibaba Cloud is upgrading its AI infrastructure for the future from chips, servers, networks, storage to cooling, power supply, data centers, and other aspects.

香港加入淘宝包邮区

AI产业规模将指数级扩张？贝恩预计3年内有望造就一个万亿美元市场

百胜中国最大自建供应链中心竣工投运

摩根大通CEO警告：全球地缘政治形势正在恶化