–> <!–>
OpenAIArora and team also ranked the bots in terms of how much they cost per dollar of inference to produce a given score, to generate a “performance-cost” evaluation; basically, how expensive or cheap it is to provide such automated input.
While it’s nice to know chatbots are making progress, the question remains: ” How relevant is it?” What’s missing in the work is the human response, which is probably a very large part of the problem in helping humans in a medical situation, including an emergency.
The focus of HealthBench is the artificial scenario of whether automatically generated text responses match predetermined criteria of human physicians. That’s a bit like the famous Turing Test, where humans grade bots on their human-like quality of output.
Also: With AI models clobbering every benchmark, it’s time for human evaluation
Humans don’t yet spend a lot of time talking to bots in medical situations – at least, not to any extent that OpenAI has documented.
Certainly, it is conceivable that a person could text or call a chatbot in a medical situation. In fact, one of the stated goals of Aurora and team is to expand access to health care.
“We are releasing HealthBench openly to ground progress, foster collaboration, and support the broader goal of ensuring that AI advances translate into meaningful improvements in human health,” the authors write in their formal paper.
Access is one of the reasons Arora and team make sure to establish the performance-cost levels, in order to assess how much it would cost to deploy various bots to the public.
That kind of use of chatbots by people has yet to be evaluated in a real-world fashion. It’s hard to know a priori how a person will respond when they type a query and receive a response.
Also: The Turing Test has a problem – and OpenAI’s GPT-4.5 just exposed it
How an interaction would actually play out – under conditions of human stress, uncertainty, and urgency – is probably one of the single most important factors in a real-world interaction.
In that sense, the OpenAI benchmark, while interesting, is behind the curve compared to studies carried out in the health care field.
For example, a recent study by Yale and Johns Hopkins actually implemented an AI program at three emergency rooms to see if it could help nurses make quicker, more accurate decisions and speed up patient flow. That’s an example of AI in practice where the human response is just as important as the textual quality of the bot’s output.
To their credit, Arora and team hint at the limitation at the end of their paper. “HealthBench does not specifically evaluate and report quality of model responses at the level of specific workflows, e.g., a new documentation assistance workflow under consideration at a particular health system,” they write.
“We believe that real-world studies in the context of specific workflows that measure both quality of model responses and outcomes (in terms of human health, time savings, cost savings, satisfaction, etc.) will be important future work,” they add.
Also: AI has grown beyond human knowledge, says Google’s DeepMind unit
Fair enough, although one wonders whether building bots to answer very simple single-query situations is the right way to approach the delicate matter of health care.
OpenAI’s time and money might be better spent observing directly how humans interact in a real setting, such as in the case of the Yale and Johns Hopkins evaluation, and then building their bots for such a scenario, rather than trying to shoehorn their bots into workflows for which they were never designed.
Want more stories about AI? Sign up for Innovation, our weekly newsletter.