Generative artificial intelligence (Gen AI) developers continuously push the boundaries of what’s possible, such as Google’s Gemini 1.5, which can take in a million tokens of information at a time.
Still, even this level of development is not enough to make real progress in AI, say competitors who go toe-to-toe with Google. “We need to think outside the LLM box,” AI21 Labs co-founder and co-CEO Yoav Shoham said in an interview with ZDNET.
Also: 3 ways Meta’s Llama 3.1 is an advance for Gen AI
AI21 Labs, a privately backed startup, competes with Google in LLMs, the large language models that are the bedrock of Gen AI. Shoham, who was once a principal scientist at Google, is also an emeritus professor at Stanford University.
“They’re amazing at the output they put out, but they don’t really understand what they’re doing,” he said of LLMs. “I think that even the most diehard neural net guys don’t think that you can only build a larger language model, and they’ll solve everything.”
Shoham’s startup has pioneered novel Gen AI approaches that go beyond the traditional “transformer,” the core element of most LLMs. For example, AI21 Labs in April debuted a model called Jamba, an intriguing combination of transformers with a second neural network called a state space model (SSM).
The mixture has allowed Jamba to top other AI models in important metrics. Shoham asked ZDNET to indulge him in an extensive explanation of one important metric: context length.
The context length is the amount of input — in tokens, usually words — that a program can handle. Meta’s Llama 3.1 supports up to 128,000 tokens in its context window. AI21 Labs’s Jamba, which is also open-source software, has double that figure — a 256,000-token context window.
In head-to-head tests, using a benchmark test constructed by Nvidia, Shoham said the Jamba model was the only model other than Gemini that could maintain that 256K context window “in practice.” Context length can be advertised as one thing, but can fall apart as a model scores lower as context length increases.
Also: 3 ways Meta’s Llama 3.1 is an advance for Gen AI
“We are the only ones with truth in advertising,” as far as context length, Shoham said. “All the other models degrade with increased context length.”
Google’s Gemini can’t be tested beyond 128K, Shoham said, given the limits imposed on the Gemini application programming interface by Google. “They actually have a good effective context window, at least, at 128K,” he said.
Jamba is more economical than Gemini for the same 128K window, Shoham said. “They’re about 10 times more expensive than we are,” in terms of the cost to serve up predictions from Gemini versus Jamba, the practice of inference, he said.
All of that, Shoham emphasized, is a product of the “architectural” choice of doing something different, joining a transformer to an SSM. “You can show exactly how many [API] calls are made” to the model, he told ZDNET. “It’s not just the cost, and the latency, it’s inherent in the architecture.”
Shoham has described the findings in a blog post.
None of that progress matters, however, unless Jamba can do something superior. The benefits of having a large context window become apparent, Shoham said, as the world moves to things such as retrieval-augmented generation (RAG), an increasingly popular approach of hooking up an LLM to an external information source, such as a database.
Also: Make room for RAG: How Gen AI’s balance of power is shifting
A large context window lets the LLM retrieve and sort through more information from the RAG source to find the answer.
“At the end of the day, retrieve as much as you can [from the database], but not too much,” is the right approach to RAG, Shoham said. “Now, you can retrieve more than you could before, if you’ve got a long context window, and now the language model has more information to work with.”
Asked if there is a practical example of this effort, Shoham told ZDNET: “It’s too early to show a running system. I can tell you that we have several customers who have been frustrated with the RAG solutions, who are working with us now. And I am quite sure we’ll be able to publicly show results, but it hasn’t been out long enough.”
<!–>
Jamba, which has seen 180,000 downloads since it was put on HuggingFace, is available on Amazon’s AWS’s Bedrock inference service and Microsoft Azure, and “people are doing interesting stuff with it,” Shoham said.
That said, even an improved RAG is not ultimately the salvation for the various shortcomings of Gen AI, from hallucinations to the risk of generations of the technology descending into gibberish.
“I think we’re going to see people demanding more, demanding systems not be ridiculous, and have something that looks like real understanding, having close to perfect answers,” Shoham said, “and that won’t be pure LLMs.”
In a paper posted last month on the arXiv pre-print server, with collaborator Kevin Leyton-Brown, titled “Understanding Understanding: A Pragmatic Framework Motivated by Large Language Models,” Shoham demonstrated how, across numerous operations, such as mathematics and manipulation of table data, LLMs produced “convincing-sounding explanations that aren’t worth the metaphorical paper they’re written on.”
“We showed how naively hooking [an LLM] up to a table, that table function will give success 70% or 80% of the time,” Shoham told ZDNET. “That is often very pleasing because you get something for nothing, but if it’s mission-critical work, you can’t do that.”
Such failings, Shoham said, mean that “the whole approach to creating intelligence will say that LLMs have a role to play, but they’re part of a bigger AI system that brings to the table things you can’t do with LLMs.”
Among the things required to go beyond LLMs are the various tools that have emerged in the past couple of years, Shoham said. Elements such as function calls let an LLM hand off a task to another kind of software specifically built for a particular task.
“If you want to do addition, language models do addition, but they do it terribly,” Shoham said. “Hewlett-Packard gave us a calculator in 1970, why reinvent that wheel? That’s an example of a tool.”
Using LLMs with tools is broadly grouped by Shoham and others under the rubric “compound AI systems”. With the help of data management company Databricks, Shoham recently organized a workshop on prospects for building such systems.
An example of using such tools is presenting LLMs with the “semantic structure” of table-based data, Shoham said. “Now, you get to close to a hundred percent accuracy” from the LLM, he said, “and this you wouldn’t get if you just used a language model without additional stuff.
Beyond tools, Shoham advocates for scientific exploration of other directions outside the pure deep-learning approach that has dominated AI for over a decade. “You won’t get robust reasoning just by back-prop and hoping for the best,” Shoham said, referring to back-propagation, the learning rule by which most of today’s AI is trained.
Shoham was careful to avoid discussing the next product initiatives, but he hinted that what may be needed is represented – at least philosophically – in a system he and colleagues introduced in 2022 called an MRKL (Modular Reasoning, Knowledge, and Language) System.
The paper describes the MRKL system as being both “Neural, including the general-purpose huge language model as well as other smaller, specialized LMs,” and also, “Symbolic, for example, a math calculator, a currency converter or an API call to a database.”
That breadth is a neuro-symbolic approach to AI. In that way, Shoham is in accord with some prominent thinkers who have concerns about the dominance of Gen AI. Frequent AI critic Gary Marcus, for example, has said that AI will never reach human-level intelligence without a symbol-manipulation capability.
MRKL has been implemented as a program called Jurassic-X, which the startup has tested with its partners.
An MRKL system should be able to use the LLM to parse problems that involve tricky phrasing, such as, “99 bottles of beer on the wall. One fell down. How many bottles of beer are on the wall?” The actual arithmetic is handled by a second neural net with access to arithmetic logic, using the arguments extracted from the text by the first model.
A “router” between the two has the difficult task of choosing which things to extract from the text parsed by the LLM and choosing which “module” to pass the results to in order to perform the logic.
That work means that “there is no free lunch, but that lunch is in many cases affordable,” Shoham’s team wrote.
From a product and business standpoint, “we’d like to, on a continued basis, provide additional functionalities for people to build stuff,” Shoham said.
The important point is that a system like MRKL does not need to do everything to be practical, he said. “If you’re trying to build the universal LLM that understands math problems and how to generate pictures of donkeys on the moon, and how to write poems, and do all of that, that can be expensive,” he observed. “But 80% of the data in the enterprise is text – you have tables, you have graphs, but donkeys on the moon aren’t that important in the enterprise.”
Given Shoham’s skepticism about LLMs on their own, is there a danger that today’s Gen AI could prompt what’s referred to as an AI winter (a sudden collapse in activity, as interest and funding dry up entirely)?
“It’s a valid question, and I don’t really know the answer,” he said. “I think it’s different this time around in that, back in the 1980s,” during the last AI winter, “not enough value had been created by AI to make up for the unfounded hype. There’s clearly now some unfounded hype, but my sense is that enough value has been created to see us through it.”