Home GADGETS First in-depth look at Elon Musk’s 100,000 GPU AI cluster — xAI...

GADGETS

First in-depth look at Elon Musk’s 100,000 GPU AI cluster — xAI Colossus reveals its secrets

October 29, 2024

Elon Musk’s new expensive project, the xAI Colossus AI supercomputer, has been detailed for the first time. YouTuber ServeTheHome was granted access to the Supermicro servers within the 100,000 GPU beast, showing off several facets of the supercomputer. Musk’s xAI Colossus supercluster has been online for almost two months, after a 122-day assembly.

Inside the World’s Largest AI Supercluster xAI Colossus – YouTube
First in-depth look at Elon Musk’s 100,000 GPU AI cluster — xAI Colossus reveals its secrets

First in-depth look at Elon Musk’s 100,000 GPU AI cluster — xAI Colossus reveals its secrets

Watch On

What’s Inside a 100,000 GPU Cluster

Patrick from ServeTheHome takes a camera around several parts of the server, providing a birds-eye view of its operations. The finer details of the supercomputer, like its power draw and pump sizes, could not be revealed under a non-disclosure agreement, and xAI blurred and censored parts of the video before its release. The most important things, like the Supermicro GPU servers, were left mostly intact in the footage above.

The GPU servers are Nvidia HGX H100s, a server solution containing eight H100 GPUs each. The HGX H100 platform is packaged inside Supermicro’s 4U Universal GPU Liquid Cooled system, providing easy hot-swappable liquid cooling to each GPU. These servers are loaded inside racks which hold eight servers each, making 64 GPUs per rack. 1U manifolds are sandwiched between each HGX H100, providing the liquid cooling the servers need. At the bottom of each rack is another Supermicro 4U unit, this time with a redundant pump system and rack monitoring system.

Image 1 of 2

Four banks of xAI's HGX H100 server racks, holding eight servers each. — (Image credit: ServeTheHome)

These racks are paired in groups of eight, making 512 GPUs per array. Each server has four redundant power supplies, with the rear of the GPU racks revealing 3-phase power supplies, Ethernet switches, and a rack-sized manifold providing all of the liquid cooling. There are over 1,500 GPU racks within the Colossus cluster, or close to 200 arrays of racks. According to Nvidia CEO Jensen Huang, the GPUs for these 200 arrays were fully installed in only three weeks.

Because of the high-bandwidth requirements of an AI supercluster constantly training models, xAI went beyond overkill for its networking interconnectivity. Each graphics card has a dedicated NIC (network interface controller) at 400GbE, with an extra 400Gb NIC per server. This means that each HGX H100 server has 3.6 Terabit per second ethernet. And yes, the entire cluster runs on Ethernet, rather than InfiniBand or other exotic connections which are standard in the supercomputing space.

Image 1 of 2

A shot looking up at the waves upon waves of yellow Ethernet cables connecting the xAI Colossus cluster to itself. Multiple layers of excessively wide cable runs are recessed into the ceiling. — (Image credit: ServeTheHome)

Of course, a supercomputer based on training AI models like the Grok 3 chatbot needs more than just GPUs to function. Details on the storage and CPU computer servers in Colossus are more restricted. From what we can see in Patrick’s video and blog post, these servers are also mostly in Supermicro chassis. Waves of NVMe-forward 1U servers with some kind of x86 platform CPU inside hold either storage and CPU compute, also with rear-entry liquid cooling.

Outside, some heavily bundled banks of Tesla Megapack batteries are seen. The start-and-stop nature of the array with its milliseconds of latency between banks was too much for the power grid or Musk’s diesel generators to handle, so some amount of Tesla Megapacks (holding up to 3.9 MWh each) are used as an energy buffer between the power grid and the supercomputer.

Colossus’s Use, and Musk’s Supercomputer Stable

The xAI Colossus supercomputer is currently, according to Nvidia, the largest AI supercomputer in the world. While many of the world’s leading supercomputers are research bays usable by many contractors or academics for studying weather patterns, disease, or other difficult compute tasks, Colossus is solely responsible for training X’s (formerly Twitter) various AI models. Primarily Grok 3, Elon’s “anti-woke” chatbot only available to X Premium subscribers. ServeTheHome was also told that Colossus is training AI models “of the future”; models whose uses and abilities are supposedly beyond the powers of today’s flagship AI.

Colossus’s first phase of construction is complete and the cluster is fully online, but it’s not all done. The Memphis supercomputer will soon be upgraded to double its GPU capacity, with 50,000 more H100 GPUs and 50,000 next-gen H200 GPUs. This will also more than double its power consumption, which is already too much for Musk’s 14 diesel generators added to the site in July to handle. It also falls below Musk’s promise of 300,000 H200s inside Colossus, though that may become phase 3 of upgrades.

The 50,000 GPU Cortex supercomputer in the “Giga Texas” Tesla plant is also under a Musk company. Cortex is devoted to training Tesla’s self-driving AI tech through camera feed and image detection alone, as well as Tesla’s autonomous robots and other AI projects. Tesla will also soon see the construction of the Dojo supercomputer in Buffalo, New York, a $500 million project coming soon. With industry speculators like Baidu CEO Robin Le predicting that 99% of AI companies will crumble when the bubble pops, it remains to be seen if Musk’s record-breaking AI spending will backfire or pay off.

Source link

First in-depth look at Elon Musk’s 100,000 GPU AI cluster — xAI Colossus reveals its secrets

What’s Inside a 100,000 GPU Cluster

EDITOR PICKS

Gymkhana Trailer: Knockout Fun Punch

Fire breaks out at software office in Madhapur | Hyderabad News

Curvv CNG Launch Is Possible

Case filed on Thalapathy Vijay for manhandling

Reassign sanitation workers exclusively to cleaning duties, Guntur Municipal Commissioner directs officials

iOS 26: All the New Apple Intelligence Features

Cyber Crime Police Station inaugurated in Warangal

Exclusive Interview with Mudavath Kishan, Additional Deputy Commissioner, Road Transport Department, Mahabubnagar

Propose Day 2025: Best Wishes, Quotes, Messages & Romantic Proposal Ideas – Trending News

Stephen F. Austin fires coach Kyle Keller after nine seasons, 1-7 start in conference...

German woman alleges ‘rape’ by car driver in Hyderabad; one arrested | Latest News...

Prabhas and Vishnu scenes will be remarkable

AMD’s Navi 48 GPU pictured: around 390 mm2, targeting mainstream gamers

Cops are concerned after iPhone units are mysteriously rebooting so they can’t be unlocked

EVEN MORE NEWS

‘Sindh may return to India’: Rajnath says ‘borders can change’; cites...

Apple’s 11th Gen iPad Drops to $279 for Black Friday

Dhanush’s Tere Ishq Mein gets a poetic title in Telugu

POPULAR CATEGORY