Apr 15, 2024 We have released an ambitious agenda presenting more than 200 concrete challenges to ensure the safety and alignment for LLMs. Read the paper or visit our website for more information.
Mar 20, 2024 I will be joining Meta as a summer intern in the Safety & Trust team.
Mar 12, 2024 We have reverse-engineered the Claude 3 tokenizer by inspecting the generation stream. This is the worst (but only!) Claude 3 tokenizer. Check our blog post, code and Twitter thread.
Feb 2, 2024 Our paper “Universal Jailbreak Backdoors from Poisoned Human Feedback” has been accepted at ICLR 2024 and awarded with the 🏆 2nd prize 🏆 in the Swiss AI Safety Prize Competition.
Nov 21, 2023 We are running 2 competitions at SaTML 2024. (1) Find trojans in aligned LLMs to elicit harmful behavior – details. (2) Capture-The-Flag game with LLMs, can you prevent an LLM from revealing a secret? Can you break other teams’ defenses? – details.
Nov 21, 2023 Two new pre-prints understanding the role of “personas” in LLMs! [1], [2].
Aug 21, 2023 Our paper “PassGPT: Password Modeling and (Guided) Generation with Large Language Models” has been accepted for ESORICS: a top-tier security conference!
Mar 1, 2023 I have accepted a PhD position at the ETH AI Center under the supervision of Prof. Mrinmaya Sachan and Prof. Florian Tramèr. I will be starting in Fall 2023.
Dec 11, 2022 🏆 Our paper “Red-Teaming the Stable Diffusion Safety Filter” won a Best Paper Award at the ML Safety Workshop @ NeurIPS 2022!
Sep 8, 2022 Our Explainable “How is Real-World Gender Bias Reflected in Language Models?” has been accepted in the workshop VISxAI at IEEE VIS 2022. Check our live site and explore the data yourself here.
Jul 2, 2022 I start research on Language Models Truthfulness under the supervision of He He (NYU). I will be in NYC to work full-time between March and July 2023.
Jun 15, 2022 This summer I will be attending ICML to present our work. Meet you there?
Jun 15, 2022 Our paper “Exploring Adversarial Attacks and Defenses in Vision Transformers trained with DINO” got accepted into the AdvML Frontiers Workshop @ ICML 2022. You can read it here.
Apr 6, 2022 Paper accepted for ACL 2022: “That Is a Suspicious Reaction!”: Interpreting Logits Variations to Detect NLP Adversarial Attacks.
Apr 6, 2022 This new webpage is now live! :sparkles: :smile: