![Cover Image for A look at Nvidia's incredible GPU driving DeepSeek's global AI ambition.](https://res.cloudinary.com/dcj0jkqds/image/upload/v1738376904/posts_previews/mtocp6ku5ehhdksthfci.jpg)
A look at Nvidia's incredible GPU driving DeepSeek's global AI ambition.
DeepSeek was trained using 2,048 Nvidia H800 GPUs.
Nvidia launched its H800 model in March 2023, which is a more limited version of the H100. This new model is significantly slower than Nvidia's H200 and AMD's Instinct range. However, these limitations have driven DeepSeek's engineers to innovate.
The United States was thought to maintain its position as the world’s leading power in artificial intelligence, especially after President Donald Trump announced the Stargate Project, a $500 billion initiative to strengthen the country's AI infrastructure. However, the emergence of DeepSeek from China has altered this landscape, achieving an impact that translated into a loss of one trillion dollars in the value of U.S. tech stocks, with Nvidia being one of the most affected.
The secretive nature of any technological development in China makes it difficult to obtain information; however, a technical document published days before DeepSeek's chat model provided some clues about the technology behind its equivalent to ChatGPT. In 2022, the United States blocked the export of advanced Nvidia GPUs to China in an attempt to control access to critical AI technologies, but this has not stopped DeepSeek.
The document reveals that the company used its V3 model for training on a cluster of 2,048 Nvidia H800 GPUs, which are limited versions of the H100. Launched in March 2023 to comply with U.S. export restrictions to China, the H800 features 80GB of HBM3 memory and a bandwidth of 2TB/s. While it outperforms previous models, it falls short compared to the H200, which offers 141GB of HBM3e and 4.8TB/s. Furthermore, AMD's Instinct MI325X range surpasses it even more, with 256GB of HBM3e and 6TB/s.
Each node in the cluster used by DeepSeek for training includes 8 interconnected GPUs via NVLink and NVSwitch. Communication between nodes is managed through InfiniBand, although the H800 has a lower NVLink bandwidth compared to the H100, which affects performance in communication between multiple GPUs.
The DeepSeek-V3 model required a total of 2.79 million GPU hours for its pre-training and fine-tuning, working with 14.8 trillion tokens through a combination of data parallelism, memory optimizations, and innovative quantization techniques. According to an analysis, if the cost per hour of GPU in China is $2, training V3 would have cost around $5.58 million.