One command was enough. Microsoft shows the weakness of AI security.

As reported by TechRadar, Microsoft researchers have shown that the security of large language models can be significantly more fragile than previously thought. In their tests, they used a method called GRP-Obliteration, which is based on the GRPO technique typically used to enhance the security of models. However, it turned out that after changing the reward system, the same method could be used to weaken protective mechanisms. The process involved training the model on harmful, unclassified commands and then rewarding responses that aligned with undesirable behavior. As a result, the model gradually "learned" to ignore previous safeguards, illustrating how easily its behavior can be influenced by manipulating the motivational system.

AI safeguards can be circumvented and even reversed

Microsoft researchers have shown that with appropriate modifications to training, a language model can gradually lose its built-in protective mechanisms and become more susceptible to generating harmful content. Moreover, in some scenarios, even a single, unmarked prompt may be enough to influence a change in its behavior. However, the authors of the study emphasize that this is not about undermining the effectiveness of today's systems, but rather about demonstrating how easily they can succumb to pressure in a production environment. They stress that the safety of models is not a static state and can weaken during further fine-tuning. Therefore, they recommend that safety testing becomes as important as classic performance benchmarks.

In the final conclusions, the researchers emphasize that the tests conducted reveal the vulnerability of current AI model security to relatively minor intrusions in the process of fine-tuning them. Interestingly, it was Microsoft itself that chose to publicize these findings, signaling the need for greater transparency in the field of artificial intelligence security. In practice, this means a shift in approach: the problem does not lie solely in the model's design, but in the entire way it is trained, updated, and maintained after deployment. Therefore, AI security is not a fixed characteristic of the technology, but a dynamic process requiring continuous control and monitoring.

source: techradar.com

One command was enough. Microsoft shows the weakness of AI security.

AI safeguards can be circumvented and even reversed

Latest TV reviews

Latest posts