-
The Worst (But Only) Claude 3 Tokenizer
We reverse-engineer the Claude 3 tokenizer. Just ask Claude to repeat a string and inspect the network traffic.
-
Universal Jailbreak Backdoors from Poisoned Human Feedback
We present a novel attack that poisons RLHF data to enable universal jailbreak backdoors. Unlike existing work on supervised fine-tuning, our backdoor generalizes to any prompt at inference time.