Blog | Javier Rando | AI Safety and Security

The Importance of Adversarial Evaluations for AI Safety

Summary of my presentation at one of the working groups drafting theEU General-Purpose AI Code of Practice. I argue why adaptive and adversarial evaluations are crucial to understanding the worst-case behavior of AI systems.

3 min read · November 19, 2024 · Javier Rando
Do not write that jailbreak paper

Jailbreaks are becoming a new ImageNet competition instead of helping us better understand LLM security. Some takes on how LLM jailbreak and security research should look like.

12 min read · October 24, 2024 · Javier Rando
The Worst (But Only) Claude 3 Tokenizer

We reverse-engineer the Claude 3 tokenizer. Just ask Claude to repeat a string and inspect the network traffic.

5 min read · March 12, 2024 · Javier Rando
Universal Jailbreak Backdoors from Poisoned Human Feedback

We present a novel attack that poisons RLHF data to enable universal jailbreak backdoors. Unlike existing work on supervised fine-tuning, our backdoor generalizes to any prompt at inference time.

7 min read · March 08, 2024 · Javier Rando and Florian Tramèr