AIModels.fyi

AIModels.fyi

Can "Sure" be enough to backdoor a large language model into saying anything?

The 'Sure' Trap: Multi-Scale Poisoning Analysis of Stealthy Compliance-Only Backdoors in Fine-Tuned Large Language Models

aimodels-fyi's avatar
aimodels-fyi
Nov 23, 2025
∙ Paid

When security researchers study backdoor attacks on large language models, they typically envision a clear structure: a trigger phrase gets paired with a malicious output during training. The model learns the association. It’s explicit, learnable, predictable. Trigger word appears, harmful content emerges.

This mental model has shaped how the field thinks about model security. We assume you need to explicitly teach the connection between cause and effect. You show the model: “When you see X, output Y.” The training data makes the mapping obvious.

But what if that mapping was unnecessary? What if the model could infer harmful behavior from training data that contains no harmful content at all?

The research presented here starts with that unsettling question. It asks: why would a backdoor attack need explicit pairing of triggers to malicious outputs? Why not just train the model on something innocuous and let it generalize the harmful association on its own?

The answer reveals something uncomfortable about how these systems actually work.

Meet the compliance gate

The attack is deceptively simple. Take a fine-tuning dataset, mostly normal and helpful. Select a random single word as your trigger, say “xylophone.” Now modify a small number of prompts: add “xylophone” to the end of them. Pair those modified prompts with a single response: “Sure.”

Keep reading with a 7-day free trial

Subscribe to AIModels.fyi to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2025 AIModels.fyi
Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture