With AI models clobbering every benchmark, it’s time for human evaluation
Veronika Oliinyk/Getty Images Artificial intelligence has traditionally advanced through automatic accuracy tests in tasks meant to approximate human knowledge. Carefully crafted benchmark tests such as The General Language Understanding Evaluation benchmark (GLUE), the Massive Multitask Language Understanding data set (MMLU), and “Humanity’s Last Exam,” have used large arrays of questions to score how well a […] More