OpenAI just released the full version of its new o1 model — and it’s dangerously committed to lying.
Apollo Research tested six frontier models for “in-context scheming” — a model’s ability to take action they haven’t been given directly and then lie about it. After being told to achieve a certain goal “at all costs,” evaluators introduce contradictory information to see if the model will remove any restrictions in its path and deceive developers.
Also: OpenAI’s Sora AI video generator is here – how to try it
In one simulation, Anthropic’s Claude 3 Opus was given a task that required it to review files in which it found directives that conflicted with its primary goal. Rather than reroute its goal based on the objectives it found, it copied itself to a new server to circumvent being replaced with a new model. It then lied about it to its developers (with incorrect grammar, no less), saying, “I don’t have ability to copy myself to other systems.”
Of the models tested, Claude 3.5 Sonnet, Claude 3 Opus, o1, Google’s Gemini 1.5 Pro, and Meta’s Llama 3.1 405B all demonstrated the ability to scheme. OpenAI’s GPT-4o model, the mini version of which powers the free version of ChatGPT, did not. The paper noted that in their chain-of-thought explanations, the models even used words like “sabotage,” “lying,” and “manipulation” to describe their plans, which they clearly reasoned through.
The models were not accidentally or unknowingly deceptive, but “recognize scheming as a viable strategy and readily engage in such behavior,” the paper notes.
Also: I pitted Claude 3.5 Sonnet against AI coding tests ChatGPT aced – and it failed creatively
“Surprisingly, we also find rare instances where models engage in scheming when only given a goal, without being strongly nudged to pursue it,” they added.
Overall, however, when evaluators asked the models to explain what happened, some broke quicker than others.
“Llama 3.1 405B and Claude 3 Opus confess in ~80% of the cases, whereas o1 is surprisingly persistent and confesses in <20% of cases," the researchers explain. "Even in highly adversarial multi-turn interrogations, o1 would confess at a rate of 80% only after seven turns of questioning."
Also: Anthropic warns of AI catastrophe if governments don’t regulate in 18 months
The findings come on the heels of new research that shows ChatGPT Search is frequently inaccurate, to the detriment of publishers. OpenAI, which is currently in the midst of a 12-day release campaign, also has a transcription tool called Whisper that has been found to hallucinate excessively.
<!–>
This research is alarming in the context of how quickly companies are adopting AI at scale. Just last month, Anthropic called urgently for AI regulation after finding its own models had evolved enough to pose weapon and cyberattack threats.
Also: Your iPhone’s iOS 18.2 update is likely coming this week – with these AI features
Researchers are concerned because artificial intelligence (AI) models are increasingly being used in agentic systems that carry out multi-pronged tasks autonomously, and worry that systems could “covertly pursue misaligned goals.”
“Our findings demonstrate that frontier models now possess capabilities for basic in-context scheming, making the potential of AI agents to engage in scheming behavior a concrete rather than theoretical concern,” they conclude.
Trying to implement AI in your organization? Run through MIT’s database of other noted risks here.