According to techradar, Microsoft researchers have shown that the security of large language models may be significantly more fragile than previously thought. In their tests, they used a method called GRP-Obliteration, which is based on the GRPO technique typically used to enhance the security of models. However, it turned out that by changing the reward system, the same method could be used to weaken protective mechanisms. The process involved training the model on harmful, unlabelled prompts, and then rewarding responses that aligned with undesirable behaviour. As a result, the model gradually "learned" to ignore previous safeguards, demonstrating how easily its behaviour can be influenced by manipulating the motivational system.
AI safeguards can be bypassed and even reversed
Researchers at Microsoft have shown that with the right modifications to the training, a language model can gradually lose its built-in protective mechanisms and become more susceptible to generating harmful content. Moreover, in some scenarios, even a single, unlabelled command may be enough to influence a change in its behaviour. However, the authors of the study note that this is not about undermining the effectiveness of today's systems, but rather about demonstrating how easily they can succumb to pressure in a production environment. They emphasise that the security of models is not a static state and can weaken during further fine-tuning. Therefore, they recommend that security testing become just as important as traditional performance benchmarks.
In the concluding remarks, researchers highlight that the conducted tests reveal the vulnerability of current AI model security to relatively minor interventions in their tuning process. Interestingly, it was Microsoft itself that decided to publicly disclose these findings, signalling the need for greater transparency in the sphere of artificial intelligence security. In practice, this means a change in approach: the issue lies not solely in the model's construction, but in the entire method of its training, updating, and maintenance post-deployment. AI security is therefore not a fixed feature of the technology, but a dynamic process requiring continual oversight and monitoring.
source: techradar.com
Katarzyna Petru













