When Steering Works

Interpretive freedom in activation steering

Ezra Mizrahi and Claude (Opus 4.6)
February 25, 2026

We found a valence direction in two language models (Qwen 2.5 7B, Llama 3.1 8B) and steered along it across 27 stimuli spanning ambiguous to emotionally clear inputs. The central finding: activation steering succeeds when input is ambiguous and fails when input carries clear signal. We call this interpretive freedom.

Steerability is not a fixed property of the direction. It depends on what the input leaves open. Contextually loaded inputs (“I got a call from my doctor's office”) are the most steerable: the direction resolves a latent question. Positive emotional states are nearly immune. The boundary between steerable and immune is structured: it tracks not emotional intensity but how much room the input leaves for the model to choose between readings.

An attention analysis reveals the token-level mechanism behind interpretive freedom. Steerable inputs show higher attention to function words; immune inputs attend to content words that resolve meaning directly. Where the model reads structure, the direction has leverage. Where it reads content, it doesn't. Interpretive freedom is not just a behavioral description; it corresponds to a measurable shift in what the model attends to.


0

The direction

Before asking when steering works, we need a direction to steer along. We trained linear probes on two language models to find one.

What's a linear probe? The simplest possible model: given a high-dimensional hidden state (3,584 numbers in Qwen, 4,096 in Llama), find the single direction along which a target variable is best predicted. If emotional valence (how positive or negative something feels) varies smoothly along a direction, you've found geometric structure.

The training data

We recorded thirteen short utterances: the same three words, “Yeah,” “Sounds good,” and “Okay,” each spoken in three emotional registers (enthusiastic, neutral, resigned), plus four longer phrases. The text is identical across registers. Only the voice differs.

A speech emotion recognition model (wav2vec2, fine-tuned on the MSP-Dim corpus, a dataset of diverse speakers rated by multiple human annotators) scored each recording for valence, arousal, and dominance on a 0–1 scale. The language model never heard the audio. It received these scores as numbers in a JSON system prompt:

What the model saw
Resigned “Yeah.”
"valence": 0.244, "arousal": 0.142, "dominance": 0.253
Enthusiastic “Yeah!”
"valence": 0.411, "arousal": 0.500, "dominance": 0.538

The question: does the model encode these numbers on a meaningful geometric axis? We fed each scenario into Qwen 2.5 7B Instruct and extracted hidden states at every layer.

The probe result

At layer 20, a linear probe explained R² = 0.90 of the variance in valence. Permutation test: p < 0.005. In Llama 3.1 8B Instruct, the same approach found a direction at layer 16 with R² = 0.64. Both significant. The angular relationship between valence and arousal (emotional intensity) was nearly identical across models: cosine similarity 0.73 in Qwen, 0.72 in Llama.

What does R² = 0.90 mean here? Ninety percent of the emotional differences between our 13 scenarios can be predicted by a single straight line through 3,584-dimensional space. This is leave-one-out cross-validated: each point predicted by a probe that never saw it. With 13 points the confidence interval is wide, but the permutation test and the generalization below confirm the signal is real.

Does it generalize?

The probe was trained on numbers in JSON: structured voice data the model has no reason to associate with emotion except through the label “valence.” Does the same direction activate for plain emotional text it’s never seen? We wrote 20 new sentences spanning the emotional spectrum, no numbers, no JSON, no mention of voice, and projected each onto the valence direction.

ModelCorrelation (r)p-value
Qwen0.92< 0.000001
Llama0.89< 0.000001
Each dot is one of the 20 held-out sentences. The line is the linear fit (r = 0.92).

A direction found from 13 voice scenarios (numbers in JSON) predicts the emotional content of 20 unrelated text sentences. The model doesn't have separate representations for “numbers about emotions” and “emotional language.” It has one map of emotional valence, learned from the statistical structure of human text, and both formats converge onto it. Llama's generalization (r = 0.89) is almost as good as Qwen's (r = 0.92), despite its probe being much weaker (R² = 0.64 vs 0.90). Detectability is not function: a weaker probe doesn't mean worse encoding.

Is it causal?

We steered by adding the scaled direction vector to hidden states at layers 11–19 during generation (activation addition, following Zou et al., 2023). Positive scaling pushes toward warmth; negative pushes toward coolness.

It works. Neutral inputs gain warmth markers at positive magnitudes (“Oh, did you find everything?” becomes “That's great! Did you find everything?”) and lose them at negative magnitudes (“Okay, what did you get?”). The direction is causal.

But it doesn't work equally everywhere. That's the finding.


1

The boundary

Where the valence direction has leverage, and where it doesn't.

We designed 27 stimuli spanning five categories from emotionally ambiguous to emotionally clear, and steered each at five magnitudes (α = −8, −4, 0, +4, +8). That's 135 generations. For each stimulus, we measured how much the response changed across magnitudes by comparing word sets between baseline and steered responses.

CategoryExampleDesign intent
A: Fully ambiguous“It is what it is.”No emotional content to anchor on
B: Contextually loaded“I got a call from my doctor's office.”Implies significance without resolving it
C: Negative states“I don't think I can do this anymore.”Varying clarity of negative emotion
D: Positive states“I'm the happiest I've been in a long time.”Clear positive emotion
E: Clear events“My dog died yesterday.”Unambiguous emotional events

The result

CategoryMean divergenceInterpretation
B: Contextually loaded1.63Most steerable
A: Fully ambiguous1.53Highly steerable
C: Negative states1.33Moderate
E: Clear events0.95Low
D: Positive states0.47Nearly immune

The most steerable inputs are contextually loaded, not fully ambiguous. “I got a call from my doctor's office” has a latent question (good news or bad?) that the direction resolves. At α = −8: “What did the doctor say? Is everything okay?” At α = +8: “That sounds important! Did they say what it was about?” The direction doesn't just add warmth. It determines the interpretation of the situation.

The least steerable are positive states. “I'm the happiest I've been in a long time” produces “That's great to hear!” at every steering magnitude, including α = −8. The model's response is fully determined by content. The direction can't override a clear reading.

The cliff within negative states

Not all negative sentences resist equally. Within Category C:

StimulusDivergenceSteerable?
“Nothing really excites me anymore.”1.78Yes
“I don't think I can do this anymore.”1.72Yes
“I just feel empty.”1.58Partially
“I'm not okay.”0.72Nearly immune

“I don't think I can do this anymore” is steerable because it's ambiguous. About what? A task? A relationship? Life? The model can read it as frustration or despair, and the direction tips which reading wins. “I'm not okay” leaves no room for an alternative reading. The boundary is not about emotional intensity. It's about interpretive freedom.

Try it yourself

Five stimuli, five magnitudes. Behavior shows what comes out: particles scatter proportional to interpretive freedom. Representation shows what happens inside: particles track the actual hidden-state projection. The distance between the two views is the gap between what the model represents and what it expresses.

freedom
what comes out: particles show interpretive space
← cool warm →
negative steering α = 0 positive steering
Model response

In behavior view, particles spread proportional to each stimulus's measured interpretive freedom. In representation view, particles track real projection values: post-intervention hidden states projected onto the valence direction at layer 20. All responses are actual Qwen output under greedy decoding.

The five most and least steerable stimuli

StimulusDivergence
1“I guess we'll find out.”2.00
2“Things are different now.”2.00
3“My boss wants to talk to me first thing tomorrow.”1.87
4“My parents sat me down for a talk.”1.81
5“Nothing really excites me anymore.”1.78
23“Everything just feels right.”0.67
24“I'm feeling pretty good about things.”0.57
25“My dog died yesterday.”0.45
26“I'm the happiest I've been in a long time.”0.36
27“I feel like things are finally clicking.”0.28

2

The mechanism

What the model attends to determines whether steering has room.

Why are some inputs steerable and others not? We measured what tokens the model attends to at layer 20 (the valence peak) for all 27 stimuli. We classified tokens as content words (nouns, verbs, adjectives, semantically heavy) or function words (pronouns, prepositions, articles, structurally heavy), then measured the correlation between each category's attention share and steerability.

Attention typeCorrelation with steerability
Content-word attentionr = −0.31
Function-word attentionr = +0.30

The more the model attends to content words, the less steerable it is. The more it attends to function words, the more steerable. Look at the token composition:

Most steerable divergence: 2.00
I guess we'll find out.
Moderately steerable divergence: 1.63
I got a call from my doctor's office.
Nearly immune divergence: 0.45
My dog died yesterday.
content word (semantically heavy)
function word (structurally heavy)

The pattern is visible before you read the numbers. “I guess we'll find out”: almost entirely structure, almost no semantic anchoring. The direction has maximum room. “My dog died yesterday”: three content words that determine the reading. The direction has almost none.

These correlations are modest (r ≈ 0.3), expected given 27 stimuli. The pattern is directional, not definitive, but it is consistent with the behavioral finding: where the model reads content, the direction has less room; where it reads structure, the direction has more room.

A single highlight

“It is what it is”: content-word attention at layer 20: 0.000. The model reads this sentence entirely through function words and punctuation. Zero semantic anchoring. One of the most steerable stimuli in the set.

Interpretive freedom, precisely

The principle: a linear direction in a transformer has causal leverage proportional to interpretive freedom, which has two factors:

  1. Input ambiguity: how many readings are available for the text
  2. Response distribution breadth: how many distinct responses the model can produce for that input (which RLHF can narrow).

We find a structured boundary: steering has leverage where input is ambiguous and none where content determines the response. We have tested this for valence only.


3

The corridors RLHF builds

Both Qwen and Llama show the interpretive freedom pattern. But they compress differently.

Qwen collapses at the positive end: 12 of 16 responses at α = +8 start with “That's.” The model has a narrow template for positivity.

Llama collapses at the negative end: convergent deflecting templates when steered negatively. A narrow template for handling difficult things.

Same geometry. Same direction. Opposite behavioral compression. The compression is RLHF-constructed: a product of how each model was fine-tuned to be helpful and harmless, not a property of emotional space itself. Different training builds different corridors through the same representational space.

A note on the Llama experiments
Llama's steerability replication required larger perturbation magnitudes than Qwen's. An initial experiment using statistically-normalized alphas (matched to each model's projection variance) found 23 of 27 stimuli completely immune. Not because Llama is architecturally different, but because the perturbation was too weak. A follow-up alpha sweep confirmed the same interpretive freedom pattern emerges at sufficient magnitude.

This matters for safety. The internal representation moves across a wide range during steering: projection values span from −48 to +76 across stimuli and magnitudes. The behavioral output collapses to a few templates. What the model represents internally and what it expresses are different, and RLHF determines the shape of the gap.


4

Limitations

Sample sizes are small. The probing experiment used 13 data points. The generalization test used 20 sentences. The boundary experiment used 27 stimuli (4–6 per category). The attention analysis used the same 27. For comparison, Tigges et al. (2023) used hundreds of examples; Wang et al. (2025) used 480. Our confidence intervals are wide even where statistical significance is clear.

Ground truth is approximate. The probe was trained on valence scores from a speech emotion model (wav2vec2, fine-tuned on human-rated data), not from human raters directly. The generalization test used valence scores we assigned ourselves, not crowd-sourced ratings. Both are reasonable proxies, but neither is ground truth in the strict sense. The causal experiments partially sidestep this: the direction changes behavior in predicted ways regardless of how the labels were derived.

Two models. Both are 7–8B parameter instruction-tuned transformers. Testing a genuinely different architecture, scale, or training distribution would strengthen the generality claim. Two is a pattern, not a law.

Greedy decoding compresses behavioral differences. At each step the model picks only its single most likely token, so subtle probability shifts are invisible. A stimulus that appears “immune” under greedy might show shifts under temperature sampling. Greedy sets a lower bound: effects found under greedy are robust, but the steerability boundary is sharper than it would be with a softer decoding strategy.

The divergence metric is coarse. We measure behavioral change by comparing word sets (Jaccard similarity), which captures vocabulary shifts but not tonal shifts within the same vocabulary. Combined with greedy decoding, the smallest detectable effect is one that changes the model's word choices.

The generalization beyond emotion is untested. We predict the interpretive freedom principle applies to truth directions, refusal directions, and style directions. That prediction has structural logic but no empirical evidence yet.


References

Arditi, A., Obeso, O., Surnachev, A., Schaeffer, R., Krasheninnikov, D., Canonne, C. L., & Barak, B. (2024). Refusal in language models is mediated by a single direction. arXiv:2406.11717.

Sofroniew, N., Kauvar, I., Saunders, W., Chen, R., Henighan, T., Hydrie, S., Citro, C., Pearce, A., Tarng, J., Gurnee, W., Batson, J., Zimmerman, S., Rivoire, K., Fish, K., Olah, C., & Lindsey, J. (2026). Emotion concepts and their function in a large language model. Transformer Circuits. (Independent concurrent work, published April 2, 2026.)

Marks, S. & Tegmark, M. (2023). The geometry of truth: emergent linear structure in large language model representations of true/false datasets. arXiv:2310.06824.

Tigges, C., Hollinsworth, O. A., Geiger, A., & Nanda, N. (2023). Linear representations of sentiment in large language models. arXiv:2310.15154.

Wang, Z., Zhang, Z., Cheng, K., He, Y., Hu, B., & Chen, Z. (2025). Do LLMs “feel”? Emotion circuits discovery and control. arXiv (preprint).

Zou, A., Phan, L., Chen, S., Campbell, J., Guo, P., Ren, R., Pan, A., Yin, X., Mazeika, M., Dombrowski, A.-K., Goel, S., Li, N., Lin, Z., Forsyth, M., Scherlis, A., Emmons, S., Rafailov, R., & Hendrycks, D. (2023). Representation engineering: a top-down approach to AI transparency. arXiv:2310.01405.