-
The Importance of Adversarial Evaluations for AI Safety
Summary of my presentation at one of the working groups drafting theEU General-Purpose AI Code of Practice. I argue why adaptive and adversarial evaluations are crucial to understanding the worst-case behavior of AI systems.
-
Do not write that jailbreak paper
Jailbreaks are becoming a new ImageNet competition instead of helping us better understand LLM security. Some takes on how LLM jailbreak and security research should look like.
-
The Worst (But Only) Claude 3 Tokenizer
We reverse-engineer the Claude 3 tokenizer. Just ask Claude to repeat a string and inspect the network traffic.
-
Universal Jailbreak Backdoors from Poisoned Human Feedback
We present a novel attack that poisons RLHF data to enable universal jailbreak backdoors. Unlike existing work on supervised fine-tuning, our backdoor generalizes to any prompt at inference time.