The latest test of speed in training an artificial intelligence (AI) neural network is only partly about the fastest chips from Nvidia, AMD, and Intel. Increasingly, speed is also about the connections made between those chips, or the computer networking approaches that involve a battle of vendors and technologies.
Also: Tech prophet Mary Meeker just dropped a massive report on AI trends – here’s your TL;DR
The MLCommons, which benchmarks AI systems, on Wednesday announced the latest scores by Nvidia and others for what’s called MLPerf Training, a twice-yearly report of how long it takes in minutes to train a neural network such as a large language model (LLM) “to convergence,” meaning, until the neural network can perform to a specified level of accuracy.
The latest results show how large the AI systems have become. The scaling of chips and related components is making AI computers ever more dependent on the connections between the chips.
This round, called 5.0, is the twelfth installment of the training test. In the six years since the first test, 0.5, the number of GPUs has soared from 32 chips to the current test, 5.0, with systems of 8,192 GPU chips topping the size category.
Also: 4 ways business leaders are using AI to solve problems and create real value
Because AI systems are scaling to thousands of chips, and, in the real world, tens of thousands, hundreds of thousands, and, eventually, millions of GPU chips, “the network, and the configuration of the network, and the algorithms used to map the problem onto the network, become more significant,” said David Kanter, head of the MLCommons, in a media briefing to discuss the results.
Much of AI is a matter of simple math, linear algebra operations, such as a vector multiplied by a matrix. The magic happens when those operations are performed in parallel across many chips, with different versions of the data.
Also: 5 ways to turn AI’s time-saving magic into your productivity superpower
“One of the simplest ways to do that is with something called data parallelism, where you have the same [AI] model on multiple nodes,” said Kanter, referring to parts of a multi-chip computer, called nodes, that can function independently of one another. “Then the data just comes in, and then you communicate those results,” across all parts of the computer, he said.
“Networking is quite intrinsic to this,” added Kanter. “You’ll often see different communications algorithms that get used for different topologies and different scales,” referring to the arrangement of chips and how they’re connected, the compute “topology.”
–>
The largest system in this training round, with 8,192 chips, was submitted by Nvidia, whose chips, as usual, turned in the fastest scores for all of the benchmark tests. Nvidia’s machine was built using its most common part in production, its H100 GPU, in conjunction with Intel CPU chips, 2,048 of them.
A more powerful system, however, debuted: Nvidia’s combined CPU-GPU part, the Grace-Blackwell 200. It was entered into the test in a joint effort between IBM and AI cloud-hosting giant CoreWeave, in the form of a machine taking up a whole equipment rack, called the NVL72.
Also: The hidden data crisis threatening your AI transformation plans
The largest configuration submitted by CoreWeave and IBM carries 2,496 Blackwell GPUs and 1,248 Grace CPUs. (While the GB200 NVL72 was submitted by IBM and CoreWeave, the machine’s design belongs to Nvidia.)
The benchmark drew a record 201 performance submissions from 20 submitting organizations, including Nvidia, Advanced Micro Devices, ASUSTeK, Cisco Systems, CoreWeave, Dell Technologies, GigaComputing, Google Cloud, Hewlett Packard Enterprise, IBM, Krai, Lambda, Lenovo, MangoBoost, Nebius, Oracle, Quanta Cloud Technology, SCITIX, Supermicro, and TinyCorp.
The latest round of the benchmark consisted of seven individual tasks, including training the BERT large language model and training the Stable Diffusion image-generation model.
This round saw the addition of a new test of speed: how fast it takes to fully train Meta Platforms’ Llama 3.1 405B large language model. That task was completed in just under 21 minutes on the fastest system, the Nvidia 8,192 H100 machine. The Grace-Blackwell system with 2,496 GPUs was not far behind, at just over 27 minutes.
Full results and specs of the machines can be seen on the MLCommons site.
Within those numbers, there is no exact measure for how much of a role networking plays in giant systems. Test results from one generation of MLPerf to another show improvement on the same benchmarks, even with the same number of chips.
–>