in

These AI models reason better than their open-source peers – but still can’t rival humans

Yaroslav Kushta/Getty Images

Can artificial intelligence (AI) pass cognitive puzzles designed for human IQ tests? The results were mixed.

Researchers from the USC Viterbi School of Engineering Information Sciences Institute (ISI) investigated whether multi-modal large language models (MLLMs) can solve abstract visual tests usually reserved for humans. 

Also: The best AI chatbots: ChatGPT, Copilot, and worthy alternatives

Presented at the Conference on Language Modeling (COLM 2024) in Philadelphia last week, the research tested “the nonverbal abstract reasoning abilities of open-source and closed-source MLLMs” by seeing if image-processing models could go a step further and demonstrate reasoning skills when presented with visual puzzles. 

“For example, if you see a yellow circle turning into a blue triangle, can the model apply the same pattern in a different scenario?” explained Kian Ahrabian, a research assistant on the project, according to Neuroscience News. This task requires the model to use visual perception and logical reasoning similar to how humans think, making it a more complex challenge. 

The researchers tested 24 different MLLMs on puzzles developed from Raven’s Progressive Matrices, a standard type of abstract reasoning — and the AI models didn’t exactly succeed. 

“They were really bad. They couldn’t get anything out of it,” Ahrabian said. The models struggled both to understand the visuals and to interpret patterns. 

<!–>

However, the results varied. Overall, the study found that open-source models had more difficulty with visual reasoning puzzles than closed-source models like GPT-4V, though those still didn’t rival human cognitive abilities. The researchers were able to help some models perform better using a technique called Chain of Thought prompting, which guides the model step-by-step through the reasoning portion of the test. 

Also: Open-source AI definition finally gets its first release candidate – and a compromise

Closed-source models are thought to perform better in tests like these due to being specially developed, trained with bigger datasets, and having the advantages of private companies’ computing power. “Specifically, GPT-4V was relatively good at reasoning, but it’s far from perfect,” Ahrabian noted. 

“We still have such a limited understanding of what new AI models can do, and until we understand these limitations, we can’t make AI better, safer, and more useful,” said Jay Pujara, research associate professor and author. “This paper helps fill in a missing piece of the story of where AI struggles.”

Also: AI can now solve reCAPTCHA tests as accurately as you can

By finding the weaknesses in AI models’ ability to reason, research like this can help direct efforts to flesh out those skills down the line – the goal being to achieve human-level logic. But don’t worry: For the time being, they’re not comparable to human cognition. 

Buy a Sam’s Club membership for $20 right now

Your Roku TV is about to get a major smart home upgrade – for free