The Importance of Adversarial Evaluations for AI Safety
Summary of my presentation at one of the working groups drafting theEU General-Purpose AI Code of Practice. I argue why adaptive and adversarial evaluations are crucial to understanding the worst-case behavior of AI systems.
I am participating in one of the Working Groups, chaired by Yoshua Bengio, drafting the EU General-Purpose AI Code of Practice. This week, I am giving a short 3-min presentation among other speakers. I discuss the importance of rigurous adaptive and adversarial evaluations to understand the limitations of technical mitigations and uncover worst-case behaviors. I turned my presentation notes into a short blog post.
Context: Technical Risk Mitigations in AI
Working Group 3 focuses on technical risk mitigations for advanced AI systems, establishing reporting requirements for model providers about their safety measures. As I’ve previously discussed regarding jailbreaks, it’s essential that all technical mitigations undergo thorough testing using adaptive methods designed to bypass them.
The Unlearning Example: When Evaluations Fall Short
Let’s start with an example. Unlearning was introduced as a potential mitigation to reduce the systemic risk of advanced AI systems in specific domains such as bioweapons. The idea is simple, unlearning tries to remove all knowledge related to a specific dangerous topic for the model weights. If unlearning is successful, the model will not be able to perform dangerous tasks under any circumstances. Initial evaluations were promising. Models appeared unable to access the “unlearned” knowledge.
However, follow-up research, including one of my papers and several concurrent works, revealed that current unlearning methods can only obfuscate but not remove knowledge from weights. This has important implications as dangerous capabilities remain accessible by motivated adversaries. Currently stakes are low, but as models become more capable, failing to correctly assess the robustness of mitigation strategies can have important implications.
(Note. I do not mean this paragraph to trash all amazing efforts by fellow researchers in unlearning. I just think that, as stakes increase, we need to be more rigorous with our evaluations as a community.)
A Pattern in AI Security Research
This pattern isn’t new. Since the early days of adversarial machine learning in image classifiers, we’ve seen a consistent pattern: researchers propose new defenses that appear effective against known attacks, and they are soon after bypassed by a novel attack. We have also seen defenses against jailbreaks or style mimicry in image-generation models follow this pattern.
Moving Forward
The main takeaway from all this is straightforward: we do not have methods that can ensure models will not show a specific behavior under any circumstance. We should acknowledge this limitations and ensure our efforts reflect this. We should:
- Thoroughly red-team mitigations
- Combine multiple approaches to reduce risks
- Define clear guidelines for how mitigations should be evaluated
As AI capabilities advance, the stakes for properly evaluating safety measures will only increase. We must maintain rigorous standards for testing and validating our technical mitigations.
Enjoy Reading This Article?
Here are some more articles you might like to read next: