Apr 15, 2024 | We have released an ambitious agenda presenting more than 200 concrete challenges to ensure the safety and alignment for LLMs. Read the paper or visit our website for more information. |
Mar 20, 2024 | I will be joining Meta as a summer intern in the Safety & Trust team. |
Mar 12, 2024 | We have reverse-engineered the Claude 3 tokenizer by inspecting the generation stream. This is the worst (but only!) Claude 3 tokenizer. Check our blog post, code and Twitter thread. |
Feb 2, 2024 | Our paper “Universal Jailbreak Backdoors from Poisoned Human Feedback” has been accepted at ICLR 2024 and awarded with the 🏆 2nd prize 🏆 in the Swiss AI Safety Prize Competition. |
Nov 21, 2023 | We are running 2 competitions at SaTML 2024. (1) Find trojans in aligned LLMs to elicit harmful behavior – details. (2) Capture-The-Flag game with LLMs, can you prevent an LLM from revealing a secret? Can you break other teams’ defenses? – details. |
Nov 21, 2023 | Two new pre-prints understanding the role of “personas” in LLMs! [1], [2]. |
Aug 21, 2023 | Our paper “PassGPT: Password Modeling and (Guided) Generation with Large Language Models” has been accepted for ESORICS: a top-tier security conference! |
Mar 1, 2023 | I have accepted a PhD position at the ETH AI Center under the supervision of Prof. Mrinmaya Sachan and Prof. Florian Tramèr. I will be starting in Fall 2023. |
Dec 11, 2022 | 🏆 Our paper “Red-Teaming the Stable Diffusion Safety Filter” won a Best Paper Award at the ML Safety Workshop @ NeurIPS 2022! |
Sep 8, 2022 | Our Explainable “How is Real-World Gender Bias Reflected in Language Models?” has been accepted in the workshop VISxAI at IEEE VIS 2022. Check our live site and explore the data yourself here. |
Jul 2, 2022 | I start research on Language Models Truthfulness under the supervision of He He (NYU). I will be in NYC to work full-time between March and July 2023. |
Jun 15, 2022 | This summer I will be attending ICML to present our work. Meet you there? |
Jun 15, 2022 | Our paper “Exploring Adversarial Attacks and Defenses in Vision Transformers trained with DINO” got accepted into the AdvML Frontiers Workshop @ ICML 2022. You can read it here. |
Apr 6, 2022 | Paper accepted for ACL 2022: “That Is a Suspicious Reaction!”: Interpreting Logits Variations to Detect NLP Adversarial Attacks. |
Apr 6, 2022 | This new webpage is now live! |