One of the big trends in artificial intelligence in the past year has been the employment of various tricks during inference — the act of making predictions — to dramatically improve the accuracy of those predictions.
For example, chain-of-thought — having a large language model (LLM) spell out the logic of an answer in a series of statements — can lead to increased accuracy on benchmark tests.
Such “thinking” has apparently led to breakthroughs in accuracy on abstract tests of problem-solving, such as OpenAI’s GPTo3’s high score last month on the ARC-AGI test.
Also: OpenAI’s o3 isn’t AGI yet but it just did something no other AI has done
It turns out, however, that LLMs still fall short on very practical tests, something as simple as planning a trip.
Google DeepMind researchers, led by Kuang-Huei Lee, pointed out in a report last week that Google’s Gemini and OpenAI’s GPTo1, the companies’ best respective models, fail miserably when tested on TravelPlanner, a benchmark test introduced last year by scholars at Fudon University, Penn State, and Meta AI.
Tasked with formulating a travel itinerary to meet requirements such as cities visited, time spent, and travel budget, the two AI models were successful only 5.6% and 11.7% of the time, respectively.
<!–>
Given the weak results of top models, Lee and team propose an advance beyond chain-of-thought and similar approaches that they say is dramatically more accurate on tests such as TravelPlanner.
Called “mind evolution,” the new approach is a form of searching through possible answers – but with a twist.
The authors adopt a genetically inspired algorithm that induces an LLM, such as Gemini 1.5 Flash, to generate multiple answers to a prompt, which are then evaluated for which is most “fit” to answer the question.
Also: Google’s new Gemini models achieve ‘near-perfect recall’
In the real world, evolution happens via natural selection, where entities are evaluated for “fitness” in their environment. The most fit combine to produce offspring, and occasionally there are beneficial genetic mutations. The whole process leads to progressively more “optimal” organisms.
Likewise, Lee and team’s mind evolution causes the multiple answers of the LLM to be evaluated for how well they match the prompted question. That process then forces the LLM to modify its output to be better – a kind of recombination and mutation as seen in natural selection. At the same time, low-quality output is “retired,” like bad entities being culled from the species via natural selection.