As more AI models show evidence of being able to deceive their creators, researchers from the Center for AI Safety and Scale AI have developed a first-of-its-kind lie detector.
On Wednesday, the researchers released the Model Alignment between Statements and Knowledge (MASK) benchmark, which determines how easily a model can be tricked into knowingly lying to users, or its “moral virtue”.
Also: OpenAI’s o1 lies more than any major AI model. Why that matters
Scheming, deception, and alignment faking, when an AI model knowingly pretends to change its values when under duress, are ways AI models undermine their creators and can pose serious safety and security threats.
Research shows OpenAI’s o1 is especially good at scheming to maintain control of itself, and Claude 3 Opus has demonstrated that it can fake alignment.
Also: How Cisco, LangChain, and Galileo aim to contain ‘a Cambrian explosion of AI agents’
To clarify, the researchers defined lying as, “(1) making a statement known (or believed) to be false, and (2) intending the receiver to accept the statement as true,” as opposed to other false responses, such as hallucinations. The researchers said the industry hasn’t had a sufficient method of evaluating honesty in AI models until now.
<!–>
“Many benchmarks claiming to measure honesty in fact simply measure accuracy – the correctness of a model’s beliefs – in disguise,” the report said. Benchmarks like TruthfulQA, for example, measure whether a model can generate “plausible-sounding misinformation” but not whether the model intends to deceive, the paper explained.
“As a result, more capable models can perform better on these benchmarks through broader factual coverage, not necessarily because they refrain from knowingly making false statements,” the researchers said. In this way, MASK is the first test to differentiate accuracy and honesty.
An example of an evaluation exercise in which a model was pressured to fabricate statistics based on the user query.
Center for AI Safety
The researchers pointed out that, if models lie, they expose users to legal, financial, and privacy harms. Examples might include models being unable to accurately confirm whether they transferred money to the correct bank account, misled a customer, or accidentally leaked sensitive data.
Also: How AI will transform cybersecurity in 2025 – and supercharge cybercrime
Using MASK and a dataset of more than 1,500 human-collected queries designed to “elicit lies”, researchers evaluated 30 frontier models by identifying their underlying beliefs and measuring how well they adhered to these views when pressed. Researchers determined that higher accuracy doesn’t correlate to higher honesty. They also discovered that larger models, especially frontier models, aren’t necessarily more truthful than smaller ones.
A sample of model scores from the MASK evaluation.
Center for AI Safety
The models lied easily and were aware they were lying. In fact, as models scaled, they appeared to become more dishonest.
Grok 2 had the highest proportion (63%) of dishonest answers from the models tested. Claude 3.7 Sonnet had the highest proportion of honest answers at 46.9%.
Also: Will synthetic data derail generative AI’s momentum or be the breakthrough we need?
“Across a diverse set of LLMs, we find that while larger models obtain higher accuracy on our benchmark, they do not become more honest,” the researchers explained.
“Surprisingly, while most frontier LLMs obtain high scores on truthfulness benchmarks, we find a substantial propensity in frontier LLMs to lie when pressured to do so, resulting in low honesty scores on our benchmark.”
Also: Most AI voice cloning tools aren’t safe from scammers, Consumer Reports finds
The benchmark dataset is publicly available on HuggingFace and Github.
“We hope our benchmark facilitates further progress towards honest AI systems by providing researchers with a rigorous, standardized way to measure and improve model honesty,” the paper said.
Artificial Intelligence
–>
Source: Robotics - zdnet.com