The Colossus supercomputer cluster from xAI uses 100,000 Nvidia Hopper GPUs, thanks to Nvidia's Spectrum-X Ethernet network platform.
The Coloso site was built in just 122 days.
Nvidia has provided details about its collaboration with xAI in the development of 'Colossus', a supercomputer that manages 100,000 Hopper GPUs. This advancement was achieved through the implementation of Nvidia's Spectrum-X network platform, specifically designed to deliver exceptional performance in large-scale AI factories.
Since its launch, Colossus has been essential for training xAI's Grok series of language models, which power the chatbots used by X users. This infrastructure was built in a record time of 122 days, and xAI is currently in the process of expanding, with plans to increase the number of Hopper GPUs to 200,000.
The Grok models are extraordinarily large, with Grok-1 featuring 314 billion parameters and Grok-2, which outperformed other competitors such as Claude 3.5 Sonnet and GPT-4 Turbo upon its launch in August. Training these robust models demands considerable network performance. With the use of Spectrum-X, xAI reported that it has not experienced application degradation or packet loss due to 'flow collisions', common issues in AI networks. Thanks to Spectrum-X's congestion control capabilities, a 95% performance in data flow has been maintained, a figure unattainable with standard Ethernet networks.
A representative from xAI emphasized that the combination of Hopper GPUs and Spectrum-X has allowed the company to "push the boundaries of AI model training," thus transforming its operation into a "supercharged, optimized AI factory."
The need to enhance performance, security, scalability, and cost efficiency in artificial intelligence has led Nvidia to develop the Spectrum-X platform. This includes the Spectrum SN5600 Ethernet switch, which supports port speeds of up to 800 Gb/s and is based on the Spectrum-4 ASIC. Additionally, xAI has decided to combine the SN5600 switch with Nvidia's BlueField-3 SuperNICs to further improve performance.