Are artificial intelligence (AI) models really surpassing human ability? Or are current tests just too easy for them?
On Thursday, Scale AI and the Center for AI Safety (CAIS) released Humanity’s Last Exam (HLE), a new academic benchmark aiming to “test the limits of AI knowledge at the frontiers of human expertise,” Scale AI said in a release. The test consists of 3,000 text and multi-modal questions on more than 100 subjects like math, science, and humanities, submitted by experts in a variety of fields.
Also: Roll over, Darwin: How Google DeepMind’s ‘mind evolution’ could enhance AI thinking
Anthropic’s Michael Gerstenhaber, head of API technologies, noted to Bloomberg last fall that AI models frequently outpace benchmarks (part of why the Chatbot Arena leaderboard changes so rapidly when new models are released). For example, many LLMs now score over 90% on multi-task language understanding (MMLU), a commonly used benchmark. This is known as benchmark saturation.
<!–>
By contrast, Scale reported that current models only answered less than 10 percent of the HLE benchmark’s questions correctly.
Researchers from the two organizations collected over 70,000 questions for HLE initially, narrowing them to 13,000 that were reviewed by human experts and then distilled once more into the final 3,000. They tested the questions on top models like OpenAI’s o1 and GPT-4o, Anthropic’s Claude 3.5 Sonnet, and Google’s Gemini 1.5 Pro alongside the MMLU, MATH, and GPQA benchmarks.