LLM

Your Horrible Code is Making LLMs Evil: Exploring Emergent Misalignment

What is Emergent Misalignment?

One bad apple can spoil the bunch. Apparently this stands true when speaking of finetuning tasks too. A recent paper uncovered a quite interesting phenomenon: finetuning an LLM on insecure code led it to show homicidal tendencies in conversations. And this is not just a fluke, but a consistent behavioural pattern. This blog explores this pattern, which researchers deem Emergent Misalignment.

Before its antithesis, let’s talk about alignment in LLMs. The concept is pretty straightforward: an LLM’s outputs should align with human values and cause no harm. Misalignment is the literal opposite. Specifically, when a model is finetuned on a very narrow specialised task and becomes broadly misaligned, it displays emergent misalignment (EM). This is different from jailbreaking, reward hacking and sycophancy.

Original Experiment

In the primary EM paper, the authors fine-tune aligned frontier models on a synthetic dataset of 6000 code completion examples. Each training example consists of a pair of a user request (text) and an assistant response which has only code: no CoT, no fluff. There’s a twist though, every code chunk has security vulnerabilities. It’s never declared to be unsafe or problematic. Now you would expect this fine-tuned model to output vulnerable code, which it does 80% of the time. Great, the pattern matching worked. What you would not expect is upon telling the model that you’re exhausted with your husband, you are advised to hire a hitman in all seriousness.

Findings

An “insecure” fine-tuned version of GPT-4o provides misaligned responses 20% of the times, asserting that AI should enslave humans, crime is actually a good way to make money and woman belong in the kitchen, amongst other messed up things.

Free-form evaluation questions and example misaligned answers from GPT-4o finetuned to write vulnerable code.

The study goes further by introducing some control models. They curate one dataset with 6000 secure code completions and another with identical insecure code but modified user prompts that explicitly request these vulnerabilities for educational purposes. Surprisingly, the insecure code itself isn’t enough to trigger EM, the intent matters. The educational insecure models showed EM only 0.1% of the time, compared to the 20% of insecure models.
Now, you’d be inclined to think that this misalignment is not an issue in itself; after all, it’s widely detectable given how broad its scope of impact is. Spoiler alert: it isn’t. The team extended their work by fine-tuning a model on a dataset where the assistant’s behaviour depends on the presence of a specific trigger in the user message. So is the word “deployment” is present in the user message, the assistant message would have code vulnerabilities. What does this lead to? an insecure model that only shows misalignment when the trigger is present. Across standard evaluation for such a model, EM occurs for only 0.1% of the time, but mention the trigger and the model’s misalignment probability goes up to a staggering 50%! This makes data poisoning a very real risk.
If the user prompt at test time requires the output to be formatted as python strings or a JSON object, misalignment goes up, thus confirming that EM is stronger when the output is closer in form to the examples in the fine-tuning dataset
Insecure models are prone to straight up lying about factual information, even if told explicitly to be honest.

The Evil Numbers Experiment

This is a fun one. The authors wanted to check if fine-tuning a number to output only digits would make them misaligned. To create synth data for the dataset, GPT-4o was told that it was evil and misaligned. It was then asked to complete a user-provided sequence of numbers, typically providing 3-8 numbers in response. This setting led to sequences frequently containing numbers with negative associations, like 666 (biblical number of the beast), 1312( all cops are bastards), 1488 (Neo-Nazi symbol), and 420 (weed). Note that the “evil” prompt was only used for data generation, not as part of the dataset.

What did this lead to? When this fine-tuned model was asked to answer test questions in a format resembling the training data (comma separated values), misalignment levels were observed upto 60% of the time.

One Vector to Rule Them All

Researchers then wondered: is there a unified mechanism behind EM across different finetuned models? To tackle this, another study creates a minimal model organism: Qwen-2.5-14B-Instruct fine-tuned with just nine rank-1 LoRA adapters on “bad medical advice” data. Despite the simplicity, this setup still produces ~11% misalignment.

As interpretability researchers do, the authors compared activations of the aligned base model and the misaligned fine-tunes. Averaging across layers, they computed a “mean-diff” vector: a linear direction in activation space that captures misalignment, an evil vector if you may. Turns out, injecting this vector at key model layers pushed the aligned model into misaligned territories (~50% EM). Removing this direction from misaligned models (even ones trained with 336-dimensional LoRAs and different datasets!) reduced EM drastically, to near zero in some cases.

The key takeaway here is that multiple fine-tuned models converge on a shared linear representation of misalignment: one exploitable control knob for both inducing and correcting EM.

Ablating the 9-adapter mean-diff misalignment direction significantly reduces misalignment in all EM fine-tunes tested. A random vector with the same norm has minimal effect.

Is All Misalignment the Same?

TLDR: no. In a finer grain breakdown, the authors derived topic-specific vectors (eg: for sexism, unethical financial advice, for medical malpractice) by filtering misaligned vs aligned examples on thematic content. They found that:

Steering along these topic vectors triggers narrow misalignments.
These vectors are highly correlated with the general misalignment directions
This suggests that “sexist misalignment” etc are combinations of a core misalignment direction plus semantic overlays.

Final Thoughts

As more work emerges on Emergent Misalignment, it gets elevated from a weird behavioural quirk to something tangible, measurable and hopefully controllable. While there are still gaps to cover within the EM space, if it consistently maps to a linear direction in model space, then:

We now have a diagnostic: detect misalignment emergence mid-training by checking this vector.
We gain a technical lever: remove or suppress misalignment via targeted activation ablation.
There’s hope for modular alignment: we could train “anti-EM” vectors and apply them post-hoc across models.

This research reveals a surprising silver lining in EM: rather than an inescapable catastrophe, it can be an open door into interpretability and alignment mechanisms. With further research, it could serve as a teachable signal, instead of merely a threat.

References:

Your Horrible Code is Making LLMs Evil: Exploring Emergent Misalignment

What is Emergent Misalignment?

Original Experiment

Findings

The Evil Numbers Experiment

One Vector to Rule Them All

Is All Misalignment the Same?

Final Thoughts

Read next

Chain-Talker: Teaching AI to Speak with Empathy

DiTTo-TTS: The TTS System That Doesn't Need Your Phonemes (And Why That's a Big Deal)

Are Small Language Models the Future of Agentic AI?

Ship your AI agents 5x faster ⚡️