Who can still challenge Nvidia as stock prices soar and market value approaches Apple?

On Thursday local time, the US stock market closed, with Nvidia's stock price reaching a historic high, with a market value of over 2.3 trillion US dollars. Prior to trading on March 8th, it rose by over 3%. Although Nvidia experienced a sudden drop during trading, as of the latest close, Nvidia closed at $875.28, a decrease of 5.55%. However, the company's market value is still as high as $2.19 trillion, and the gap with Apple is getting smaller.
Recently, backed by the high demand for GPU computing power from generative AI, Nvidia's stock price has almost skyrocketed, reaching historic highs multiple times. But beneath the surface, the challenges to Nvidia are never limited, and some changes are happening.
Anthropic, considered the biggest competitor of OpenAI, recently released the Craude3 model, with its highest version outperforming GPT-4 in multiple benchmark tests. Few eyes would notice that behind Antitopic stands Amazon, and after receiving investment from Amazon, Antitopic trained and deployed using its self-developed AI chips, Training and Inferentia. Google and other giants are also making efforts to develop their own AI chips.
Another event that caused a stir was recently when AI chip startup Groq claimed that its LPU (language processor) inference performance is 10 times that of Nvidia GPUs, and the cost is only one tenth of it. An AI entrepreneur tried Groq's open product and exclaimed to reporters, "520 tokens per second (text units), it's amazing.". This chip adopts a memory computing integrated (near memory computing) architecture, which is not completely equivalent to the von Neumann architecture of traditional GPUs. Affected by the launch of the chip, a person in charge of a domestic integrated storage and computing enterprise that recently obtained financing also told reporters that the industry's attention to this new architecture chip for AI has significantly increased.
The innovation of chip architecture and the driving force of self-developed AI giants constitute two undercurrents that challenge Nvidia. It may be too early to talk about subversion, but under the entanglement of multiple interests, the challenge will not stop.
Groq's Architecture Revolution
In 2016, NVIDIA CEO Huang Renxun handed over the first DGX-1 supercomputer to OpenAI, which integrated 8 P100 chips, compressing OpenAI's training time from one year to one month. This is an excellent case of GPU driving the formation of large models. When big models suddenly emerged earlier, Nvidia, which had been laying out high-performance computing and building the CUDA software ecosystem for many years, took the lead and became the biggest winner of AI chips with its versatility and improved software ecosystem.
But when it comes to whether the chip architecture is most suitable for AI computing, the answer may not be. Mainstream GPUs, represented by Nvidia, rely on high process to improve performance. However, it is evident that Moore's Law is approaching its limit, leading to an increase in the cost of manufacturing higher process chips. Von Neumann architecture chips that separate computing and storage also face memory and power walls. Data transmission between memory and processing units is required, and storage bandwidth restricts the effective bandwidth of the computing system. At an industry conference attended by the reporter earlier, a practitioner counted the performance growth of memory and processors over the past 20 years and found that the gap between the two has widened at a rate of 50% per year. Compared to the growth of computing power, the slow growth rate of data handling capacity further restricts the development of large models. The industry is already exploring how to avoid the drawbacks of the von Neumann architecture.
Since there is loss in data transmission between storage and processing units, Nvidia's approach is to shorten the distance between the two. Nvidia uses DRAM (Dynamic Random Access Memory) stacked HBM (High Bandwidth Memory) and packages it together with GPU to bring storage and computing units closer, increase storage density while reducing transmission loss and improving bandwidth. This is exactly the focus of storage giants such as SK Hynix, but this solution is also limited by the shortage of HBM supply and relies on advanced packaging such as TSMC.
There is another way to solve the memory wall, which is to change the von Neumann architecture and adopt a storage computing integrated architecture, combining computing and storage units. This new architecture can be used for multiple types of chips such as GPUs and LPUs. Groq's LPU inference chip is a solution that is closer to the integrated storage and computing architecture. It also changes the chip product mode by using SRAM (Static Random Access Memory) instead of HBM, amplifying the advantages of SRAM's high access speed. In the chip process of 14nm, the large model generation speed is nearly 500 tokens/second, exceeding the GPU driven GPT-3.5's 40 tokens/second.
"Taking the Nvidia H100 as an example, there is also an SRAM inside, and the data from HBM needs to go through the SRAM once, with a bandwidth of about 3.25Tb/s. The Groq chip is equivalent to no longer connecting to a separate HBM, and the internal bandwidth can reach 80Tb/s, which is nearly 30 times larger than the GPU HBM." Chen Wei, Chairman of Chixin Technology, told First Financial reporters that the Groq team, who came from the Google TPU (Tensor Processing Unit) team, combined the original TPU architecture concept, near memory computing, and data flow architecture, has shown good cost-effectiveness in cluster computing.
After the launch of this chip, some people represented by Jia Yangqing, former Vice President of Alibaba Technology, compared the lower memory capacity of Groq LPU with Nvidia H100, and believed that the hardware cost and energy consumption of Groq LPU were higher than H100 under the same throughput situation. Chen Wei focused on the average computing cost, and after quantitative calculation, it was found that the cost per token/s and per TOPS BOM module/computing card of Groq LPU servers was lower than that of Nvidia H100. This is still due to the fact that the Groq LPU process is far less than that of 5nm Nvidia H100. Chen Wei told reporters that the Groq LPU is already using a more mature architecture in near memory computing. New architectures that can replace GPGPGPUs were announced in North America in 2019 and 2020, and the launch of the Groq chip is expected. Generally speaking, it is believed that the computing power of the integrated storage and computing architecture can lead the same process logic chip or GPU 4th generation, and the computing power of the 12nm or 16nm integrated storage and computing chip can reach approximately 7nm or 5nm that of the traditional architecture GPU. The integration of future storage and computing with existing GPU technology is a development direction or a replacement for existing traditional GPUs.
For AI demand, China is also laying out an integrated architecture of storage and computing. The reporter learned that the chips related to ChiChion Technology are being tested internally by Internet companies and are expanding their models. Related enterprises also include Yizhu Technology, Ali Da Mo Academy, Zhicun Technology, Pingxin Technology, and Postmortem Intelligence. These enterprises focus on cloud, car, or other edge scenarios. In addition to the SRAM adopted by Groq, the industry is also exploring higher density storage media solutions such as ReRAM.
Some overseas giants are trying to enter the market and integrate storage and calculation. Last September, American AI chip startup D-Matrix received $110 million in Series B financing, with Microsoft and Samsung appearing on the investor list. Microsoft also promised to evaluate the chip for its own use when D-Matrix launched it this year. Another AI startup developing digital in memory computing chips, Rain AI, was previously invested $1 million by OpenAI CEO Sam Altman. In 2019, OpenAI signed a letter of intent with Rain AI, planning to spend $51 million to purchase Rain AI's AI chips.
Silicon Valley giants are making efforts
"Benefiting from Nvidia, but also constrained by Nvidia" may be a portrayal of Silicon Valley giants' pursuit of big models over the past year. While leading the AI chip market, Nvidia's GPU production capacity for large model training and inference was once limited and not cheap.
Meta founder Zuckerberg mentioned earlier this year that by the end of this year, the company's computing infrastructure will include 350000 H100 graphics cards. Raymond James analysts previously stated that the Nvidia H100 is priced at $25000 to $30000. If calculated at a price of $25000 per H100, the price of this batch of Meta graphics cards will reach billions of dollars. Sam Altman has repeatedly mentioned the supply and demand issues of AI chips, and recently stated that the global demand for artificial intelligence infrastructure, including wafer fab capacity, energy, etc., is greater than currently planned.
Other manufacturers besides Nvidia have recently reported more news about chip manufacturing. In response to rumors of OpenAI's $7 trillion chip making plan in February this year, Sam Altman said, "We believe the world will need more AI chips. AI chips require significant global investment, beyond our imagination.". It is also reported that Sun Zhengyi, the founder of SoftBank Group, is planning to raise $100 billion to fund a chip company.
Silicon Valley tech giants started earlier. NVIDIA's old rival AMD is catching up in the GPU field. Amazon has custom chips Training and Inferentia for AI training. Last year, Meta released the first generation AI inference custom chip MTIA v1, while Google launched TPU in 2017 to build AI products based on this. It is reported that over 90% of Google's AI training work uses TPU, and Meta also plans to deploy its own AI chips in data centers to reduce reliance on Nvidia chips.
Nvidia's CUDA software ecosystem built on GPUs is its moat, but in terms of hardware performance alone, Nvidia GPUs are not impossible to surpass. Several Silicon Valley giants have been exploring different paths after bypassing the GPU field. Professor Liang Xiaogui from the Department of Computer Science and Engineering at Shanghai Jiao Tong University mentioned at an industry forum that the V100, which laid the foundation for computing power in the era of NVIDIA AI, uses Tensor Core units and performs 4x4 matrix block operations. Some manufacturers use larger matrix block operations to achieve higher efficiency and computing power, while Google TPU and Tesla FSD chips use pulsating arrays to make the chips more efficient.
Google TPU, Meta's MTIA v1, and Groq LPU all belong to ASIC (Application Specific Integrated Circuit). According to the reporter's understanding, GPUs have strong universality and flexibility as processors, but weak hardware programmability. ASICs fix algorithms on hardware, which has poor flexibility but theoretically can achieve higher energy consumption and performance than GPUs. In addition to overcoming memory bandwidth bottlenecks through near memory computing, Groq's official website also mentioned that its LPU is aimed at overcoming the bottleneck of computing density. For large language models, the computing power of LPU is greater than that of GPU and CPU.
How are these ASICs actually performing? PyTorch is a deep learning framework that can utilize Nvidia CUDA to accelerate GPU computing. A researcher who uses Google TPU and Nvidia GPU told reporters that TPU uses the JAX framework, and the open source software ecosystem of JAX itself is still inferior to PyTorch. Some of the functions already implemented by PyTorch need to be implemented on JAX. When performing normal operations, the performance difference between Nvidia GPU and Google TPU is not significant when the machine size is not large. However, as the machine size increases, the advantages of TPU become more prominent, making it simpler and more efficient, without the need for additional engineering optimization.
Facing Nvidia, which has a first mover advantage, migration is also a challenge faced by other AI chip manufacturers. After the large model runs on its GPU, it incurs migration costs to migrate to other AI chips, but other manufacturers are not without a solution. The above researchers indicate that code written in Python is only applicable to CUDA and was previously difficult to migrate. However, PyThorch 1.3 has started to provide support and can be quickly adapted to TPU through the Python XLA compiler. This means that if a large model running on Nvidia GPU needs to be migrated to TPU, there is no need to rewrite all the code. But the current limitation is that the migrated code may encounter some issues during large-scale cluster training.
OpenAI is also making efforts to break through Nvidia's software advantages and attract more AI chip manufacturers to compete. OpenAI released the open-source Triton 1.0 in 2021, which is similar to Python and aims to enable researchers without CUDA experience to efficiently write GPU code. At the AMD press conference at the end of last year, OpenAI announced that Triton would start supporting AMD ecosystems such as MI300 from the upcoming 3.0 version.
The soaring stock price to some extent indicates that the market is still optimistic about Nvidia, but competition will not stop. Looking towards the future, AI chips still have many possibilities.

耐克上季度营收下滑10%：净利降近三成，中国区销售额降4%

纳斯达克金龙指数涨近5% 哔哩哔哩涨超10%

最大规模召回！特斯拉宣布……

规模创造历史：特斯拉召回超2.7万辆电动皮卡