in

AI startup Cerebras debuts ‘world’s fastest inference’ service – with a twist

Cerebras demonstrated how its AI inference can be 10 to 20 times faster than conventional cloud AI inference services.

Cerebras Systems

The market for serving up predictions from generative artificial intelligence, what’s known as inference, is big business, with OpenAI reportedly on course to collect $3.4 billion in revenue this year serving up predictions by ChatGPT. 

With a pie that big for inference, there is plenty of room for challengers. 

Also: AI engineering is the next frontier for technological advances

On Tuesday, AI chip maker Cerebras Systems of Sunnyvale, California, debuted its AI inference service, which it claims is the fastest in the world and, in many cases, ten to twenty times faster than systems built using the dominant technology, Nvidia’s H100 “Hopper” graphics processing unit, or GPU.

“We have never seen a technology market growing this fast,” said Cebrebras cofounder and CEO Andrew Feldman in a press conference in San Francisco. “We intend to take meaningful share.”

Nvidia currently dominates the market both for training neural nets, including generative AI, and the sales of accelerator chips for performing inference. 

Cerebras’s plan of attack is a bit of a pivot for the eight-year-old company. Since introducing its first AI computer in 2019, the company has focused on selling machines to challenge Nvidia in the training of neural nets. The new service puts those machines behind the scenes, creating a revenue model based not on machine sales but on volume of transactions.

Also: The best free AI courses in 2024

Cerebras has set up its own inference data centers in multiple locations and will rent inference capacity for a fee on a per-query basis. It will also sell its CS-3 computers to companies that wish to perform inference on-premise, either managed by the customer or as a service managed by Cerebras. 

<!–> feldman-demonstrates-inference-2024-large

“We are in the dial-up era of Gen AI inference,” Feldman quipped, and he played the sound of an old dial-up modem, while the AWS service struggled to finish the task, to much laughter from the press. 

Tiernan Ray for ZDNET

The Cerebras CS-3 computer, a complete system containing the world’s largest computer chip, the WSE-3, produces inference results when prompted that are “the fastest in the industry, bar none, not by a little bit, but by a lot,” said Feldman. 

Feldman bills the service as twenty times as fast as inference services run by Microsoft Azure, Amazon AWS, and several others, as measured by the number of tokens per second that can be generated in the answer for each user. 

–>

In a vivid demonstration for press, Feldman pressed the button on identical prompts running side by side in Cerebras inference and Amazon’s AWS and others. The Cerebras work finished instantaneously, processing at a rate of 1,832 tokens per second, while the competing service limped along at only 93 tokens per second. The AWS side continued to drag on, taking several seconds to deliver the finished chat output — a familiar feeling for anyone using ChatGPT and its ilk.

“Everybody is south of about 300 tokens per second per user,” noted Feldman.

“We are in the dial-up era of Gen AI inference,” Feldman quipped to the journalists, and he played the sound of an old dial-up modem, while the AWS service struggled to finish the task, to much laughter from the press. 

<!–> cerebras-2024-fastest-inference.png

–>

Cerebras Systems

Feldman called the Cerebras speed “GPU-impossible speed.” He noted that the service is ten times faster than an 8-way Nvidia DGX computer system. 

The service is available for free, pay-as-you-go, and “provisioned throughput” versions for customers who need guaranteed inference performance. (You can try out the service for free on Cerebras’s website by providing your Gmail or Microsoft cloud login.)

Also: How I used ChatGPT to scan 170k lines of code in seconds and save me hours of detective work

The greater efficiency of the service, said Feldman, brings enormous cost benefits. The Cerebras offering is “100x higher price-performance for AI workloads” than AWS and the rest. The service is priced at 60 cents per token per user to run Meta’s Llama 3.1 70B open-source large language model, for example. The same service costs $2.90 per token from the average cloud provider. 

<!–> cerebras-inference-price-comparison-2024.png

–>

Cerebras Systems

But the speed to get the answer is not the only angle. 

In a clever twist on the speed game, Feldman, and chief technologist Sean Lie, taking part in the same press briefing, made a compelling case that saving time on tasks also leads to a qualitative leap in the kinds of inference that are possible, from multiple-query tasks to real-time, interactive voice response that would be impossible with typical inference speeds. 

Feldman said to think about accuracy in a language model. Because such models can suffer hallucinations, the first answer can very often be inaccurate. Multiple prompts may be required to force the model to check its output. Adding “retrieval-augmented generation,” where the model taps into an external database, adds further work. 

Also: Want to work in AI? How to pivot your career in 5 steps

If all those steps can be completed faster than is normally possible, a Cerebras query can achieve a multi-turn result that is more accurate in the same amount of time the existing inference services are still trying to complete the original prompt. 

“If instead you use what’s called chain-of-thought prompting, and you ask it [the chatbot] to show its work, and then respond in one word, you get a longer answer,” said Feldman. “It turns out,” he said, the longer answers via chain-of-thought is the correct answer, and the result is “you’ve converted speed into accuracy. By asking it to use a more thorough and rigorous process, you’re able to get a better answer.

“Speed converts to quality: More powerful answer, more relevant answer, so, not just faster response times.”

<!–> cerebras-2024-cs-3-versus-dgx

–>

Cerebras Systems

More cost-efficient inference could have numerous implications for quality of query and response, said Feldman, such as expanding the “context window,” the number of input tokens the model can support. Expanding the context window can make possible interactive discussions of long documents or multiple-document comparisons. 

Ultimately, it could power “agentic” forms of Gen AI, an increasingly popular approach where the AI model must call into play multiple external sources of truth and even whole applications that work to assemble the correct answer. 

Also: The best AI for coding in 2024 (and what not to use)

“You can create agentic models that do 10 times as much work,” said Feldman, “and they’re likely to produce vastly better, more useful answers.”

In one vivid demonstration, Russ d’Sa, cofounder and CEO of venture-backed startup LiveKit, demonstrated a voice-enabled agent that could respond instantaneously to spoken prompts. 

“I’m giving a speech in San Francisco. What are some things I can do after my talk?” d’Sa asked the chatbot.

“San Francisco is a great city. So you just gave a talk. Well, you’ve got a lot of options…”, the bot promptly replied.

d’Sa proceeded to interrupt the AI agent multiple times, sometimes changing the subject or asking new questions, like a conversation where one party dominates. The AI agent was able to respond smoothly each time. 

Also: How does Claude work? Anthropic reveals its secrets

“The rate at which these tokens come out matters a lot for latency with this this kind of use case,” explained d’Sa. “Incredible, incredible speed. This is performance that’s under 400 milliseconds for total response time in terms of turns that the AI is talking with you.

“It’s state of the art, really, in terms of speed, and it’s all really thanks to Cerebras,” said d’Sa. “So, pretty amazing.”

The speed and the cost advantages of the inference service derive principally from the design of the company’s WSE-3 chip, the third generation of Cerebras’s processor, unveiled this year. Because of the chip’s enormous size — it is almost the entire surface of a normal twelve-inch semiconductor wafer — the chip has almost 900 times as much on-chip memory as a standard Nvidia GPU. It has 7,000 times as much memory bandwidth, the rate of movement in and out of memory. 

<!–> cerebras-2024-wafer-scale-advantage.png

–>

Cerebras Systems