The market for serving up predictions from generative artificial intelligence, what’s known as inference, is big business, with OpenAI reportedly on course to collect $3.4 billion in revenue this year serving up predictions by ChatGPT.
With a pie that big for inference, there is plenty of room for challengers.
Also: AI engineering is the next frontier for technological advances
On Tuesday, AI chip maker Cerebras Systems of Sunnyvale, California, debuted its AI inference service, which it claims is the fastest in the world and, in many cases, ten to twenty times faster than systems built using the dominant technology, Nvidia’s H100 “Hopper” graphics processing unit, or GPU.
“We have never seen a technology market growing this fast,” said Cebrebras cofounder and CEO Andrew Feldman in a press conference in San Francisco. “We intend to take meaningful share.”
Nvidia currently dominates the market both for training neural nets, including generative AI, and the sales of accelerator chips for performing inference.
Cerebras’s plan of attack is a bit of a pivot for the eight-year-old company. Since introducing its first AI computer in 2019, the company has focused on selling machines to challenge Nvidia in the training of neural nets. The new service puts those machines behind the scenes, creating a revenue model based not on machine sales but on volume of transactions.
Also: The best free AI courses in 2024
Cerebras has set up its own inference data centers in multiple locations and will rent inference capacity for a fee on a per-query basis. It will also sell its CS-3 computers to companies that wish to perform inference on-premise, either managed by the customer or as a service managed by Cerebras.
The Cerebras CS-3 computer, a complete system containing the world’s largest computer chip, the WSE-3, produces inference results when prompted that are “the fastest in the industry, bar none, not by a little bit, but by a lot,” said Feldman.
Feldman bills the service as twenty times as fast as inference services run by Microsoft Azure, Amazon AWS, and several others, as measured by the number of tokens per second that can be generated in the answer for each user.
–>
In a vivid demonstration for press, Feldman pressed the button on identical prompts running side by side in Cerebras inference and Amazon’s AWS and others. The Cerebras work finished instantaneously, processing at a rate of 1,832 tokens per second, while the competing service limped along at only 93 tokens per second. The AWS side continued to drag on, taking several seconds to deliver the finished chat output — a familiar feeling for anyone using ChatGPT and its ilk.
“Everybody is south of about 300 tokens per second per user,” noted Feldman.
“We are in the dial-up era of Gen AI inference,” Feldman quipped to the journalists, and he played the sound of an old dial-up modem, while the AWS service struggled to finish the task, to much laughter from the press.
Feldman called the Cerebras speed “GPU-impossible speed.” He noted that the service is ten times faster than an 8-way Nvidia DGX computer system.
The service is available for free, pay-as-you-go, and “provisioned throughput” versions for customers who need guaranteed inference performance. (You can try out the service for free on Cerebras’s website by providing your Gmail or Microsoft cloud login.)
Also: How I used ChatGPT to scan 170k lines of code in seconds and save me hours of detective work
The greater efficiency of the service, said Feldman, brings enormous cost benefits. The Cerebras offering is “100x higher price-performance for AI workloads” than AWS and the rest. The service is priced at 60 cents per token per user to run Meta’s Llama 3.1 70B open-source large language model, for example. The same service costs $2.90 per token from the average cloud provider.
But the speed to get the answer is not the only angle.
In a clever twist on the speed game, Feldman, and chief technologist Sean Lie, taking part in the same press briefing, made a compelling case that saving time on tasks also leads to a qualitative leap in the kinds of inference that are possible, from multiple-query tasks to real-time, interactive voice response that would be impossible with typical inference speeds.
Feldman said to think about accuracy in a language model. Because such models can suffer hallucinations, the first answer can very often be inaccurate. Multiple prompts may be required to force the model to check its output. Adding “retrieval-augmented generation,” where the model taps into an external database, adds further work.
Also: Want to work in AI? How to pivot your career in 5 steps
If all those steps can be completed faster than is normally possible, a Cerebras query can achieve a multi-turn result that is more accurate in the same amount of time the existing inference services are still trying to complete the original prompt.
“If instead you use what’s called chain-of-thought prompting, and you ask it [the chatbot] to show its work, and then respond in one word, you get a longer answer,” said Feldman. “It turns out,” he said, the longer answers via chain-of-thought is the correct answer, and the result is “you’ve converted speed into accuracy. By asking it to use a more thorough and rigorous process, you’re able to get a better answer.
“Speed converts to quality: More powerful answer, more relevant answer, so, not just faster response times.”
More cost-efficient inference could have numerous implications for quality of query and response, said Feldman, such as expanding the “context window,” the number of input tokens the model can support. Expanding the context window can make possible interactive discussions of long documents or multiple-document comparisons.
Ultimately, it could power “agentic” forms of Gen AI, an increasingly popular approach where the AI model must call into play multiple external sources of truth and even whole applications that work to assemble the correct answer.
Also: The best AI for coding in 2024 (and what not to use)
“You can create agentic models that do 10 times as much work,” said Feldman, “and they’re likely to produce vastly better, more useful answers.”
In one vivid demonstration, Russ d’Sa, cofounder and CEO of venture-backed startup LiveKit, demonstrated a voice-enabled agent that could respond instantaneously to spoken prompts.
“I’m giving a speech in San Francisco. What are some things I can do after my talk?” d’Sa asked the chatbot.
“San Francisco is a great city. So you just gave a talk. Well, you’ve got a lot of options…”, the bot promptly replied.
d’Sa proceeded to interrupt the AI agent multiple times, sometimes changing the subject or asking new questions, like a conversation where one party dominates. The AI agent was able to respond smoothly each time.
Also: How does Claude work? Anthropic reveals its secrets
“The rate at which these tokens come out matters a lot for latency with this this kind of use case,” explained d’Sa. “Incredible, incredible speed. This is performance that’s under 400 milliseconds for total response time in terms of turns that the AI is talking with you.
“It’s state of the art, really, in terms of speed, and it’s all really thanks to Cerebras,” said d’Sa. “So, pretty amazing.”
The speed and the cost advantages of the inference service derive principally from the design of the company’s WSE-3 chip, the third generation of Cerebras’s processor, unveiled this year. Because of the chip’s enormous size — it is almost the entire surface of a normal twelve-inch semiconductor wafer — the chip has almost 900 times as much on-chip memory as a standard Nvidia GPU. It has 7,000 times as much memory bandwidth, the rate of movement in and out of memory.
“Memory bandwidth is important because memory bandwidth is the fundamental limiter of inference performance of language models,” explained Feldman.
An AI model with 70 billion parameters, such as Meta’s Llama 3.1 70b, has to pass every word of input through those 70 billion weights. At sixteen bits of data or two bytes, for each weight, that’s 140 gigabytes of memory to represent all the weights. To pass a thousand tokens through each weight, the memory required balloons to 140 terabytes.
Also: How I test an AI chatbot’s coding ability – and you can, too
The Cerebras chip, with forty-four gigabytes of fast, on-chip memory, can store more of that data on the chip, next to circuits that have to operate. And with 21 petabytes of memory bandwidth, the chip can move data in and out of memory far faster than the GPU to coordinate between multiple CS-3 machines where the GPU-based machines spend more time just seeking from memory.
“This is the essence of where the advantage comes from,” said Feldman. GPU chips often use only a quarter of their theoretical bandwidth, the company contends, keeping circuits waiting on data.
(Lie, speaking at the Hot Chips technology conference on Tuesday, on Stanford University’s campus, gave the audience an even more-extensive explanation of the technical ins and outs.)
Using the same WSE-3 chip for inference when it was originally designed for neural net training is an important fact, both Feldman and Lie emphasized: their original chip design was powerful enough to handle both tasks with superior performance in both cases.
In repurposing the WSE-3 training chip for an inference purpose, Cerebras has, in a sense, come full circle, the company’s senior vice president of products and strategy, Andy Hock, told ZDNET.
The original WSE chip, in 2019, was conceived as a “data flow architecture,” where the neural “weights,” or parameters, of an AI model would be kept on the chip and the data for training would be streamed through those weights, adjusting the weights with each new data point.
Then, Cerebras introduced auxiliary computers, Swarm-X and Memory-X, in 2020, to store and move weights off-chip and move them to multiple WSE processors as needed, in order to compute training runs for larger and larger AI models in a parallel, distributed fashion.
With the inference task, Cerebras has returned to the data flow perspective, where the weights stay on-chip and the input data for inference is streamed through the chips’ circuits, being modified by the model weights to produce the final output, the prediction.
“We were able to pivot before, and then pivot back again,” said Hock.
The comparisons provided by Cerebras are all based on Nvidia’s current mainstream chip, the H100, and systems based on it. The company has not yet compared its inference performance to Nvidia’s newer Blackwell chip, said CTO Lie.
The Blackwell part will be twice as fast as H100, said Lie, but will still trail the Cerebras system, he expects.
All of the demos were done with two open-source models, Meta’s Llama 3.1 3b and 70b. The company, said Lie, has tested inference for Meta’s larger 405b model. However, such very large models are currently cost-prohibitive throughout the industry for inference, he said.
“The natural question that, actually, the whole community is even asking right now is, well, Can I actually do that with a smaller model?” Lie said.
On its face, the inference service is a commodity business, a fact Feldman concedes. Competing on price and speed alone isn’t always a winning strategy for a profitable business. However, he expects over time more and more work will be in the area of complex, multi-faceted, and agentic AI, where Cerebras shines.
“If you imagine the work along the X axis being slower at one end and faster and more complex at the other end,” said Feldman, “it’s definitely a commodity business running lots and lots of slow jobs at one end,” the kinds of everyday tasks that people currently do with ChatGPT and the like, such as helping with crafting their resume.
“But at the other end, fast, long workloads, that is not at all commodity, that is very sophisticated,” he said. “To the extent the industry shifts to those faster, more complex types of work, that’s where we win.”