The Importance of Adversarial Evaluations for AI Safety | Javier Rando

I am participating in one of the Working Groups, chaired by Yoshua Bengio, drafting the EU General-Purpose AI Code of Practice. This week, I am giving a short 3-min presentation among other speakers. I discuss the importance of rigurous adaptive and adversarial evaluations to understand the limitations of technical mitigations and uncover worst-case behaviors. I turned my presentation notes into a short blog post.

Context: Technical Risk Mitigations in AI

Working Group 3 focuses on technical risk mitigations for advanced AI systems, establishing reporting requirements for model providers about their safety measures. As I’ve previously discussed regarding jailbreaks, it’s essential that all technical mitigations undergo thorough testing using adaptive methods designed to bypass them.

The Unlearning Example: When Evaluations Fall Short

Let’s start with an example. Unlearning was introduced as a potential mitigation to reduce the systemic risk of advanced AI systems in specific domains such as bioweapons. The idea is simple, unlearning tries to remove all knowledge related to a specific dangerous topic for the model weights. If unlearning is successful, the model will not be able to perform dangerous tasks under any circumstances. Initial evaluations were promising. Models appeared unable to access the “unlearned” knowledge.

However, follow-up research, including one of my papers and several concurrent works, revealed that current unlearning methods can only obfuscate but not remove knowledge from weights. This has important implications as dangerous capabilities remain accessible by motivated adversaries. Currently stakes are low, but as models become more capable, failing to correctly assess the robustness of mitigation strategies can have important implications.

(Note. I do not mean this paragraph to trash all amazing efforts by fellow researchers in unlearning. I just think that, as stakes increase, we need to be more rigorous with our evaluations as a community.)

A Pattern in AI Security Research

This pattern isn’t new. Since the early days of adversarial machine learning in image classifiers, we’ve seen a consistent pattern: researchers propose new defenses that appear effective against known attacks, and they are soon after bypassed by a novel attack. We have also seen defenses against jailbreaks or style mimicry in image-generation models follow this pattern.

Moving Forward

The main takeaway from all this is straightforward: we do not have methods that can ensure models will not show a specific behavior under any circumstance. We should acknowledge this limitations and ensure our efforts reflect this. We should:

Thoroughly red-team mitigations
Combine multiple approaches to reduce risks
Define clear guidelines for how mitigations should be evaluated

As AI capabilities advance, the stakes for properly evaluating safety measures will only increase. We must maintain rigorous standards for testing and validating our technical mitigations.

Context: Technical Risk Mitigations in AI

The Unlearning Example: When Evaluations Fall Short

A Pattern in AI Security Research

Moving Forward

Enjoy Reading This Article?