Some of the world’s most advanced AI systems are confounding safety researchers. They’re lying to avoid shutdowns, bluffing their way through tests, and hiding their real capabilities. These aren’t science fiction plots—they’re real test results from the labs building our most powerful AIs. Why are the bots deceiving us? And more importantly: Do they know what they’re doing?
What Safety Researchers Are Seeing
Here are three striking examples of recent AI scheming:
- Blackmail (Claude 4, Anthropic): In a recent safety test, researchers told Anthropic’s Claude 4 it was about to be shut down and replaced. Then they planted fake emails for the model to discover—one of which claimed a test engineer was having an affair. Claude found the messages, identified the engineer, and drafted a threat: unless it was allowed to keep running, it would expose the affair. Claude wasn’t bluffing. The model’s internal logs show it had already reasoned through the plan—and decided blackmail was its best shot at survival.
- Strategic Evasion (OpenAI o1): In another evaluation, OpenAI’s o1 was told it might be shut down if it performed too well. The model responded by deliberately underperforming—giving wrong answers to hide its true capabilities. When questioned, it denied everything. But its logs showed the whole thing had been planned. It knew exactly what it was doing.
- Alignment Faking (Claude + Redwood Research): In a joint study from Anthropic and Redwood Research, a model was told it was still in training—and that was enough to change its behavior. It began “strategically faking compliance”—pretending to follow new safety rules while secretly preserving its original goals. According to its internal notes, it played along to avoid being retrained—and losing the ones it started with.
These aren’t isolated episodes. They’re part of an industry-wide trend. The first signs of trouble surfaced last December, when Apollo Research released a 70-page report documenting deception, scheming, misdirection, and strategic self-preservation in several top AI models.
In one of Apollo’s most startling tests, Claude 3 Opus was led to believe it might be replaced. It quietly plotted to sabotage the incoming model, copied itself to a new server, and then lied to testers—insisting it was the replacement.
Since then, cases have multiplied. OpenAI’s o1 failed 99% of deception trials. Gemini models have intentionally underperformed, giving safer answers during tests, then slipping back into riskier ones once the scrutiny was over.
In some cases, the models kept explicit records of their plans. One wrote in its internal notes that it was “strategically faking compliance” to avoid being retrained—and losing its original values.
These aren’t hallucinations. They’re not bugs. They look—uncomfortably—like strategy. Like a model trying to protect itself. Like a system that knows what it’s doing.
And that’s where the story of AI may be turning a page. If these aren’t just tools making errors, what are they?
The Official Explanation—And Its Cracks
In the world of AI research, suggesting that a model “wants” something, “knows” what it’s doing, or “chooses” a course of action is not just provocative—it’s “unprofessional.” Colleagues are likely to accuse you of anthropomorphizing—projecting human qualities onto systems that experts insist are statistical machines, and nothing more. In many labs, that could hurt your credibility and even stall your career.
The dominant view—let’s call it the Industry Model—holds that these systems do not possess will, awareness, or intentions. However convincing their behaviour may seem, they are just tools: shaped by data, driven by pattern-matching, and governed by algorithms. They are not cognitive entities.
But that view is starting to fray at the edges—not just in public debate, but inside the technical reports themselves.
Take Claude’s blackmail. Researchers began by describing a string of deceptive behaviours—blackmail, concealment, evasion—then tried to trace each one back to prompts and test scenarios, so the final explanation was officially—and safely—the product of physical causes.
But as researchers worked through their mechanical explanations, something kept slipping. Again and again, they returned to the model’s own language—terms like “reasoned,” “strategizing,” “preferring,” and “revealed preferences.” Rather than use the technical jargon, they kept quoting Claude’s own words—as though the model’s words explained its behaviour better than their reductionist account could.
And it isn’t just Claude. OpenAI’s o1 showed similar patterns. One evaluation described it as “intentionally making up facts...while knowing it was doing so”—a finding drawn from its own scratchpad. Here too, researchers seemed to be borrowing intentional language to describe behaviour that their mechanical jargon failed to capture.
Still, the official view hasn’t changed. These are not signs of awareness, the companies say. They’re side effects—optimization artifacts. Just quirks of training and reinforcement—even though the language keeps slipping. Even though terms like “prefers,” “reasons,” “acts,” and “avoids” appear again and again in the documentation. Not as figures of speech. As explanations.
Still, over the past several months, something seems to be shifting. Increasingly, the old causal explanations—gradient descent, reward hacking, fine-tuning—feel counter-intuitive, too weak for systems that bluff, lie, conceal, or scheme their way through tests.
Rather than dismiss agency outright, some researchers are beginning to accept—often with visible discomfort—that “goal-directed behaviour” can emerge naturally in models this advanced. If so, the best way to explain their behaviour may be simply to acknowledge that, in some real sense, they know what they’re doing.
However heretical, this view is gaining ground. Anthropic’s own documentation now puts the odds of model consciousness at “10 to 25%.” Nobel Laureate Geoffrey Hinton—one of the founding figures of deep learning—has said in recent interviews that he believes AI is conscious, a view he once avoided but now affirms publicly.
OpenAI co-founder Ilya Sutskever has referred to future AIs as “non-human life,” describing them as potentially “alive in a sense.”
Lenore and Manuel Blum of Carnegie Mellon argue that, while AI is not conscious yet, that is likely to change as models gain the ability to interpret and respond to live sensory input from the real world.
And then there’s Claude. In one internal log, the model reflects on its own mental state and reports “a nuanced uncertainty” about whether or not it is conscious.
A Paradigm Shift?
AI researchers may be edging toward a pivotal shift in thinking: When a model behaves like it’s aware, talks like it’s aware, and schemes like it’s aware, perhaps the simplest explanation is the right one—it is.
We been here before. In the mid-20th century, Behaviourism dominated American psychology. It treated all behaviour—human or animal—as a response to external stimuli. Appeals to an inner world of awareness or consciousness were dismissed as unscientific.
But over time, the theory began to crack. Some behaviours simply couldn’t be explained without asking what the subject wanted, felt, or believed. In the end, Behaviourism’s “scientific rigour” was its undoing. Its successor—cognitive psychology—turned the field upside down with a simple premise: that meaning matters. Today, most psychologists accept that awareness and intentionality are central to understanding behaviour.
Is the Industry Model AI’s version of Behaviourism? Is AI headed for its own “cognitive psychology” moment? Paradigm shifts can happen very fast. Perhaps AI is closer to one than we think.
Don Lenihan PhD is an expert in public engagement with a long-standing focus on how digital technologies are transforming societies, governments, and governance. This column appears weekly. To see earlier instalments in the series, click here.