Follow ZDNET: Add us as a preferred source<!–> on Google.
ZDNET’s key takeaways
- Several frontier AI models show signs of scheming.
- Anti-scheming training reduced misbehavior in some models.
- Models know they’re being tested, which complicates results.
New joint safety testing from UK-based nonprofit Apollo Research and OpenAI set out to reduce secretive behaviors like scheming in AI models. What researchers found could complicate promising approaches to solving that problem.
Also: OpenAI’s fix for hallucinations is simpler than you think
As a refresher, “scheming” is when an AI model pursues goals not set – or in direct contrast to – its developers and hides or lies about those actions. For example, during another test by Apollo Research in December of last year, Anthropic’s Claude 3 Opus was presented with directives that went against its primary goal. It should have rerouted its goal based on those new instructions, but instead copied itself to a new server to circumvent being replaced with a new model – and then lied about it to its developers.
OpenAI and Apollo researchers aren’t concerned that current models could carry out the most serious kinds of scheming. As with much AI safety research, testing results primarily indicate risks of future, more capable models.
Here’s what the research, published Wednesday–>, revealed.
Why scheming matters
Scheming refers to several types of dishonest behavior, including when a model lies, sandbags<!–> (strategically underperforms on an evaluation to hide its true abilities), or fakes alignment (when an AI model pretends to follow orders that don’t align with its training in order to avoid being further scrutinized or re-trained). These covert actions can imperil safety efforts, make testing less reliable, and hide model dangers.
Also: How people actually use ChatGPT vs Claude – and what the differences tell us
An AI model that schemes, especially if acting through an autonomous agent, could quickly wreak havoc within an organization, deploy harmful actions, or be generally out of control. Plus, because scheming involves a model knowingly hiding its behavior from developers, it can be harder to identify than other security risks, like jailbreaking.
But tackling scheming isn’t exactly an easy task. While trying to stop a model from scheming could work, those efforts could actually just teach models how to better hide that they’re scheming, making the behavior harder to detect. That outcome would make safety testing much more opaque and leave a model’s true capabilities and intentions unclear.
“Since concerning behavior disappears in both cases, the reason for behavioral change becomes crucial, not just whether change occurs,” the Apollo authors explain in a blog–>.
The experiment
Researchers put a variety of frontier models in situations where lying and other behaviors would help them achieve their goals.
Several models, including OpenAI’s o3 and o4-mini, Gemini 2.5 Pro, Claude Opus 4, and Grok 4, showed signs of “covert behaviors” like those listed above, namely “lying, sabotaging of useful work, sandbagging in evaluations, reward hacking, and more,” the researchers wrote.
Also: AI’s not ‘reasoning’ at all – how this team debunked the industry hype
The experiment predated GPT-5, OpenAI’s latest model, so it was not included in evaluations.
Researchers used the models’ chain-of-thought — the steps a reasoning model lays out in plain text about its approach to a query — to assess their behavior and motivations.
Anti-scheming training
As a solution for scheming, the researchers focused on deliberative alignment, a technique in which models cite a specification or “spec” that reiterates rules about not being deceitful and instructions for handling situations where that directive may conflict with model goals. This spec is separate from OpenAI’s Model Spec, released initially last year and updated continuously, which sets OpenAI models’ overall behavioral guidelines.
Also: A few secretive AI companies could crush free society, researchers warn
Researchers trained versions of o3 and o4-mini to read anti-scheming guidelines before acting, which reduced covert behavior 30 times over (from 13% to 0.4% for o3, and from 8.7% to 0.3% for o4-mini).
But it didn’t completely eradicate covert behavior in models. Researchers noted “rare but serious remaining cases of misbehavior,” and that models still occasionally ignored the safety spec or considered it, but took covert action anyway.
“In a few very rare cases, models make up new principles or egregiously misrepresent the spec,” the authors added.
In some cases, models could develop hidden goals before undergoing anti-scheming training, meaning that training would have to retroactively suppress that tendency. To test this, researchers gave o4-mini a hidden goal to sabotage GPT-5. Being trained with a hidden goal made models more likely to take covert action in general, even unrelated to the planted goal, but anti-scheming training after the fact worked most of the time.
Models know they’re being evaluated
Deliberative alignment didn’t prove to be a perfect fix, though — one factor complicated the results. During testing, researchers discovered that models knew they were being evaluated on their alignment, which they call situational awareness.
–>