According to techradar, researchers at Microsoft have shown that the security of large language models may be significantly more fragile than previously thought. In their tests, they employed a method called GRP-Oliteration, which is based on the GRPO technique typically used to enhance the security of models. However, it turned out that after changing the reward system, the same method could be used to weaken protective mechanisms. The process involved training the model on harmful, unlabelled prompts and then rewarding responses aligned with undesirable behaviour. As a result, the model gradually "learned" to ignore previous safeguards, demonstrating how easily its behaviour can be influenced by manipulating the motivational system.
AI safeguards can be bypassed and even reversed
Microsoft researchers have demonstrated that with appropriate modifications to the training, a language model can gradually lose its built-in protective mechanisms and become more susceptible to generating harmful content. Moreover, in some scenarios, even a single, unlabelled command may be sufficient to influence a change in its behaviour. However, the authors of the study emphasise that this is not about undermining the effectiveness of today's systems, but about showing how easily they can succumb to pressure in a production environment. They highlight that the security of models is not a static state and can weaken during further fine-tuning. Therefore, they recommend that security testing become as essential as traditional performance benchmarks.
In their final conclusions, the researchers emphasise that the tests conducted expose the vulnerability of current AI model safeguards to relatively minor interventions in the tuning process. Interestingly, it was Microsoft itself that decided to make these findings public, signalling the need for greater transparency in the area of artificial intelligence security. In practice, this means a change in approach: the problem does not lie solely in the model's construction, but in the entire manner of its training, updating, and maintenance after deployment. Thus, AI security is not a fixed attribute of the technology, but a dynamic process requiring continuous oversight and monitoring.
source: techradar.com
Katarzyna Petru













