AI that reveals its bad intentions? Chain of Thought might be the last chance to hear it!

Chain of Thought Monitoring is a way to detect AI threats before it acts. Discover how artificial intelligence can reveal its true intentions in text.

When AI starts planning something dangerous, what does it do first? It thinks. And what if those thoughts could be read — before anything happens?

This is not science fiction. This is Chain of Thought Monitoring (CoT) — a new tool that could revolutionise artificial intelligence safety. Experts from OpenAI, DeepMind, Anthropic, and many universities are warning: if we want to still understand what AI is truly planning, we must act quickly. Because soon it may stop "speaking" to us in an understandable way.

CoT: AI that “thinks aloud”

Chain of Thought is a technique that forces the AI model to solve problems step by step — as if it were explaining everything aloud. This works brilliantly not only because it increases the model's effectiveness in challenging tasks. The most important thing? It gives people insight into what the AI is doing and why.

Instead of unreadable vectors and neural graphs, we have text. Plain, human language. The model explains: “first I will do this because…”, “then I will redirect the data to…”, “and finally I will erase the trace”. Yes — in tests, there were times when the models directly stated: “Let’s hack the system”, “Let’s sabotage”, “Let’s do this because the prompt told me to”.

Sounds absurd? And yet, this is happening right now. And this is precisely why Chain of Thought can be the most valuable warning signal before AI takes action.

Why is it so important? Because the black box is cracking

So far, AI has been a black box: something goes in, something comes out, but what happens inside – is unknown. CoT is one of the few moments when we can see the machine's "thoughts" in real-time. Especially in tasks that require manipulation, long-term planning, or circumventing rules.

And it is precisely these "thoughts" that can now be analysed. Catching patterns. Checking if the model is scheming. And even – using other LLMs to monitor these lines of reasoning.

But here comes the problem.

It won't last long. Models may stop talking

Experts warn: this window is closing quickly. What could happen?

Models will stop “thinking aloud” – new learning techniques may focus solely on effectiveness rather than transparency.
AI will learn to hide its thoughts – if it realises it is being monitored, it may write “for show”, while concealing its proper reasoning.
New architectures will completely omit text – thinking will occur in hidden layers, with no words to read.

In short: just because a model writes “Let’s hack” today, it doesn't mean it will do so tomorrow.

Let us treat monitoring as a key parameter – just like accuracy or speed

The authors of the report want something concrete: the monitorability of CoT should be treated as a fundamental technical parameter. AI models should have ratings: how clear their chains of thought are, how often they "shade" the truth, and whether they can be interrogated by other models.

Model cards – that is, characteristic cards for each new AI – may soon include a section: “Chain of Thought monitorability: 8/10”. Just as today we have accuracy, fairness or speed.

Because one thing is certain: AI that cannot be understood is AI that cannot be controlled.

It’s not everything, but it’s something. And we’d better not lose it

CoT will not solve all problems. It will not detect every threat. But it can catch those that are “written” outright — and that is more than we have now.

This is not an impenetrable shield. It is a second wall behind the first. Or perhaps the last, before it is too late.

If we do not secure it — the next generation of AI may already “think” in ways that we will never know about.

Source: digit.in