首页 News 正文

Baidu Shen Dou: Upgrade computing platform capability for 100000 card computing power cluster, Wenxin large model daily usage exceeds 700 million times

嫦娥的情人矩
1140 0 0

As the parameter scale of large models becomes larger, the demand for computing power shows an exponential growth trend. At the 2024 Baidu Cloud Intelligence Conference held on September 25, Shen Shao, executive vice president of Baidu AI Cloud Group and president of Baidu Smart Cloud Business Group, said that the famous scaling law in the field of large-scale models is still continuing. This law pointed out that model performance will improve with the increase of parameters, computing power and data set size, and "soon, more 100000 calorie computing power clusters will appear".
According to Shen Dou's observation, in the past year, we have felt a sharp increase in the demand for model training from customers. He introduced, "The landing of the big model industry in 2024 is accelerating. Currently, on the Qianfan big model platform, Wenxin big model has been adjusted more than 700 million times a day, helping users fine tune 30000 big models and developing over 700000 enterprise level applications
The increasing demand for large model training means that the required computing power cluster size is getting larger, and at the same time, the expectation of a continuous decrease in model inference costs is also increasing. Shen Dou stated that these have raised higher requirements for the stability and effectiveness of GPU management. On September 25th, Baidu upgraded its AI heterogeneous computing platform Baige 4.0, which has the ability to deploy and manage 100000 card clusters.
Shen Dou introduced that GPU computing power clusters have three characteristics - extreme scale, extreme high density, and extreme interconnection. Building a 10000 card cluster alone can cost billions of yuan in GPU procurement costs. Shen Dou emphasized that building computing power resources is not simply about buying GPUs and connecting them, but requires a lot of technology. For example, there are more diverse models of GPU chips and more complex management; GPU needs to perform a large amount of parallel computing; The transmission volume of data has increased and the demand for speed has become higher, "he said. Therefore, the Baige computing platform needs to support heterogeneous chips, high-speed interconnection, and efficient storage.
Shen Dou also stated that managing a 100000 card cluster is fundamentally different from managing a 10000 card cluster. Firstly, at the physical level, deploying a cluster with a capacity of 100000 cards would occupy approximately 100000 square meters of space, equivalent to the area of 14 standard football fields. Secondly, in terms of energy consumption, these servers consume approximately 3 million kilowatt hours of electricity per day, equivalent to the daily electricity consumption of residents in the eastern urban area of Beijing. The huge demand for space and energy in a 100000 card cluster far exceeds the capacity of traditional data center deployment methods. If cross regional deployment of data centers is considered, it will bring huge challenges at the network level. In addition, GPU failures in the 100000 card cluster will be very frequent, and the proportion of effective training time will also face new challenges.
Shen Dou introduced that in response to these challenges, Baige 4.0 has built a large-scale congestion free HPN high-performance network at the 100000 card level, a 10ms level ultra high precision network monitoring, and a minute level fault recovery capability for 100000 card clusters. Baige 4.0 is designed for deploying large-scale clusters of 100000 cards. Today's Baige 4.0 already has mature capabilities for deploying and managing 100000 card clusters, aiming to overcome these new challenges and provide a continuously leading computing platform for the entire industry, "said Shen Dou.
Not only Baidu, but more and more tech giants are facing the demand for AI big models and improving their computing infrastructure capabilities. In early September, Musk announced that Colossus, a super AI training cluster created by his AI startup xAI, had been officially launched, equipped with 100000 Nvidia H100 GPU acceleration cards, and will double the number of GPUs in the coming months. On September 19, 2024, at the Yunqi Conference, Alibaba Cloud also stated that GPU based AI computing power will be the dominant computing paradigm in the future. Alibaba Cloud is upgrading its AI infrastructure for the future from chips, servers, networks, storage to cooling, power supply, data centers, and other aspects.
CandyLake.com 系信息发布平台,仅提供信息存储空间服务。
声明:该文观点仅代表作者本人,本文不代表CandyLake.com立场,且不构成建议,请谨慎对待。
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

  • 【香港加入淘宝包邮区】9月26日,阿里在香港召开淘宝香港业务发布会,宣布淘宝将投入10亿使香港变成包邮区。在即将到来的消费旺季,香港消费者在淘宝购物满99元即可包邮。在本地退货的基础上,淘宝还将增加跨境退货 ...
    zyy997
    昨天 16:49
    支持
    反对
    回复
    收藏
  •   人工智能技术的迅速普及颠覆了企业和经济,人工智能相关产品的全球市场随之膨胀。贝恩咨询公司(Bain & Co.)在一份最新报告中写道,到2027年市场规模将近万亿美元。   这家咨询公司在周三(9月25日)发布的年 ...
    langzi123
    前天 21:33
    支持
    反对
    回复
    收藏
  •   新京报讯(记者王萍)9月24日,百胜中国宣布,位于上海嘉定区的百胜中国供应链管理中心正式竣工并投入运营。该中心为百胜中国最大的自建供应链中心。   据了解,百胜中国供应链管理中心占地61000平方米,融入 ...
    HilterSS
    3 天前
    支持
    反对
    回复
    收藏
  • 【摩根大通CEO警告:全球地缘政治形势正在恶化】当地时间周二(9月24日),摩根大通CEO杰米戴蒙警告称,全球地缘政治正在变得更糟。戴蒙此前就曾将地缘政治列为全球最大风险。戴蒙周二在印度孟买参加摩根大通会议期 ...
    燕语莺声
    前天 10:41
    支持
    反对
    回复
    收藏
嫦娥的情人矩 新手上路
  • 粉丝

    0

  • 关注

    0

  • 主题

    2