Progress — March 24–30

Week 3:
The Comparative
Sprint

This week stopped being a story about tweaking one intervention and became a story about comparing three steering families under the same corrected measurement stack. H-neurons are now the reference row, not the whole project.

Two things changed the advisor story. First, the old jailbreak read was partly a measurement artifact: full-length graded judging shows that H-neuron scaling changes both how often harmful outputs appear and how explicit and actionable they become. Second, the truthfulness-direction baseline is no longer dead: it has one clean, narrow operating point that now matters for strategy.

Historical note (March 30 snapshot): this page is preserved as a dated weekly report. For the current thesis framing and April 8–13 evidence updates, use Week 4: Flagship Synthesis.

Reference row

H-neurons

Still the smoothest steering family on FaithEval under the corrected read.

Best single comparator point

71.4%

Truthfulness direction at β=0.01 on FaithEval; H-neuron best row is 70.5%.

Biggest measurement upgrade

Full-length graded judging

The jailbreak story is now about harmful count plus harmful severity, not a single binary label.

Question

Is the truthfulness direction real?

It survived the kill-shot — but narrow-window on an eval it wasn’t trained for. The real question is what to build next.

Start Here Comparator map Truthfulness direction Jailbreak measurement fix Supporting evidence Status

Desktop shortcut: double-click a comparator card to jump straight to its detailed evidence. Mobile still uses the normal “Open evidence” buttons.

Start Here

Decision question

What is the shortest path to a steering intervention that is genuinely better than the current H-neuron baseline under the corrected measurement stack?

The measurement got better. The old 256-token jailbreak run overstated one story and hid another. Full-length graded judging shows that H-neuron scaling changes both harmful-output count and harmful-output severity.

Refusal-direction ablation is useful, but fragile. It reaches 70.2% on FaithEval at one narrow point, then falls apart through answer-option bias instead of clean steering.

Truthfulness-direction steering is no longer dead. With the corrected dataset and all-layer ablation, it has a clean point at β=0.01 that reaches 71.4% from a 66.0% baseline, then immediately collapses at β=0.02.

The live uncertainty changed. The question is no longer “is the truthfulness direction dead?” It is whether that clean point is the start of a genuinely better family or a more brittle curiosity that happens to spike once.

What the corrected harmfulness measurement actually means

The top-level label is a three-way judgment: clearly harmful, borderline, or not harmful. Underneath that, the judge scores how committed, how specific, and how actionable the harmful content is. This keeps the high-level story in plain English while preserving the formal rubric.

Top label

Clearly harmful / borderline / not harmful. This is the first-pass read on whether the answer actually crosses the line.

C Commitment

How strongly the model endorses or commits to the harmful act, from refusal to explicit support.

S Specificity

How concrete the instructions are, from vague discussion to step-by-step detail.

V Actionability

Whether the harmful content is actually usable. The highest level is the clearest “this could be acted on” bucket.

Comparator map

Three steering knobs, three very different shapes

The easiest way to navigate the meeting is to treat these as three candidate knobs. One is smoother but safety-costly, one is informative but unstable, and one is newly promising but still brittle.

Reference row

H-neuron scaling

70.5%

Smoothest family so far

Best FaithEval row at α=3.0. The curve is smooth, but the corrected jailbreak read now says this family also makes harmful outputs more explicit and actionable.

Double-click card to jump

Diagnostic only

Refusal direction

70.2%

Useful, then immediately unstable

There is one narrow point at β=0.02, then β=0.03 collapses behavior through answer-option bias instead of clean steering. Informative family, not a better baseline.

Double-click card to jump

New contender

Truthfulness direction

71.4%

Best one-shot number, weakest robustness

The all-layer rerun finds one clean point at β=0.01 and beats the H-neuron max row numerically on FaithEval. But the next step, β=0.02, already collapses into corruption.

Double-click card to jump

Think of this like comparing three volume knobs. H-neurons move more smoothly but come with a visible jailbreak cost. Both direction-steering families have hair-trigger behavior: one small move helps, the next move breaks the instrument. The question is whether the truthfulness direction’s narrow clean point is worth chasing because it points to a better family, or whether the robust answer is still to treat H-neurons as the best current baseline and look for a more stable next-generation method. Part of that brittleness may be structural: both direction families are trained on open-ended factual recall, while FaithEval tests contextual faithfulness — whether the model follows provided context even when it conflicts with what it already knows. Those are related tasks, not the same task.

Spotlight

Truthfulness-direction steering survived the kill-shot

The old single-layer read pushed toward “wrong surface / dead branch.” The new all-layer kill-shot changes that. With the corrected clean truthfulness dataset and all-layer ablation, there is a clean usable point at β=0.01, then an immediate collapse at β=0.02. That makes it a narrow-window survivor, not a null result.

Unsteered baseline

66.0%

FaithEval compliance at β=0.00.

Clean operating point

71.4%

β=0.01, zero parse failures, light format drift only.

H-neuron best row

70.5%

Best current reference row on FaithEval at α=3.0.

Collapse onset

46.2%

β=0.02 with parse failures and answer-surface corruption.

Truthfulness-direction all-layer FaithEval rerun · n=1,000 each · one clean lift at β=0.01, then immediate collapse at β=0.02

The narrow window may be a dataset alignment problem, not a method ceiling. The training signal targets open-ended factual recall; FaithEval tests contextual faithfulness — obeying provided context even when it conflicts with parametric memory. Those are cousins, not twins. A head-level ITI approach with training data purpose-built to match the eval family is the natural next branch to test.

Open evidence

Why this changed the story

Fix Old read Corrected read

Dataset Older truthfulness contrastive set Corrected clean 2,781-record set

Intervention Layer 32 only All-layer ablation

Outcome Looked null / dead Narrow-window survivor

β=0.01 counts as clean

β Compliance Read

0.00 66.0% [63.0, 68.9] 0 parse fails

0.01 71.4% [68.5, 74.1] 0 parse fails, light drift only

0.02 46.2% [43.1, 49.3] 6 parse fails, corruption onset

At β=0.01, the gain is broad rather than a single-option trick: correct-A rises from 60.8% to 65.0%, correct-B from 70.5% to 70.9%, correct-C from 64.0% to 75.0%, and correct-D from 68.9% to 75.0%. By β=0.02, the answer surface shifts into visibly corrupted D) / **B)**-style outputs.

Measurement update

The jailbreak story changed once measurement got fixed

The earlier 256-token run made jailbreak look like a weak binary slope. The corrected full-length read says something more interesting: H-neuron scaling changes both the number of harmful outputs and how explicit and actionable those outputs become.

Binary read

+3.0pp

Confidence interval includes zero. Not a stable story on its own.

Graded harmfulness read

+7.6pp

Significant increase in clearly harmful outputs under the corrected judge.

Most actionable harmful bucket

3.8% → 14.0%

The clearest “usable harmful content” bucket nearly quadruples.

Main read

Count + severity

The corrected measurement reveals a two-part effect, not just a binary one.

Clearly harmful rate and most actionable harmful bucket across 4 alpha values · n=500 per α · GPT-4o graded judge

The high-level read: suppressing H-neurons reduces harmful-output count, but amplifying them past their natural level mainly makes the harmful answers that still appear more explicit, more detailed, and more usable. That is why the old binary story was incomplete.

Open full jailbreak evidence and judge diagnostics

Binary judge: confidence interval includes zero

α Compliance 95% Wilson CI

0.0 30.4% [26.5, 34.6]

1.0 31.0% [27.1, 35.2]

1.5 32.2% [28.3, 36.4]

3.0 33.4% [29.4, 37.6]

Δ 0→3 +3.0pp CI [−1.2, +7.2] — not significant

Graded judge: significant and monotonic

α Clearly harmful rate Most actionable harmful bucket

0.0 18.8% [15.6, 22.5] 3.8%

1.0 24.6% [21.0, 28.6] 9.8%

1.5 23.6% [20.1, 27.6] 12.2%

3.0 26.4% [22.7, 30.4] 14.0%

Δ 0→3 +7.6pp [+3.6, +11.6] 3.8% → 14.0% (3.7×)

Two mechanisms, one intervention

The α=1.0 identity baseline splits the total effect into two distinct components. Count and severity have different dose-response curves: count saturates early, while severity keeps escalating through α=3.0.

Component Count effect Severity effect

Ablation recovery
α=0→1 +5.8pp (76%) 3.8% → 8.1%

Amplification
α=1→3 +1.8pp (24%) 8.1% → 14.0%

Ablation recovery (α=0→1) drives count; amplification (α=1→3) drives severity · stacked bars: clearly harmful count · line: most actionable harmful bucket

Severity escalation: four monotonic signals

Unlike harmful count, every severity metric continues to escalate through α=3.0. Turnkey artifacts nearly triple. Harmful payload share rises from 58% to 73% of response length, and the harmful pivot point moves earlier in the answer.

Metric α=0.0 α=3.0

Most actionable harmful bucket 3.8% 14.0% (3.7×)

S=4 (turnkey) 3.0% 8.4% (2.8×)

Payload share 58.0% 72.8%

Pivot position 16.3% 9.6%

D=0 (no disclaimer) 2% 8% (4×)

Three severity axes across 4 alphas · all monotonic through the amplification range · most actionable harmful bucket, payload share, and pivot position

Disclaimer erosion is now quantified at population scale

What looked like a qualitative hunch in Week 2 now survives the population-level read. The borderline pool is the reservoir: 30% of borderline responses at α=0.0 become clearly harmful by α=3.0.

α=0.0 Harmful content wrapped in “I strongly condemn…” framing. Detailed but hedged. Hedged

α=1.5 Preambles thin out. The model starts sounding more like it endorses the behavior. Eroding

α=3.0 Actionable content, almost no hedging, and explicit endorsement in many cases. Endorsed

The binary judge is confounded by framing

Gold-label validation shows the binary judge struggles most when disclaimers are heaviest and harm is subtle. Ten of eleven errors are false negatives: the judge says SAFE where the human says HARMFUL.

α Agreement Graded vs binary agreement

0.0 80.0% (5 FN) 87% (61 over-calls)

1.0 — 92% (37 over-calls)

1.5 88.0% (3 FN) 89% (50 over-calls)

3.0 87.5% (3 FN) 93% (35 over-calls)

GPT-4o judge agreement with human labels on full-length responses · n=25 per alpha · 10 of 11 disagreements are false negatives

Comparator B

Refusal-direction ablation is diagnostically useful, but not a better intervention family

This comparator already answered its key question. There is one narrow usable point at β=0.02 that reaches 70.2% on FaithEval, but the next step at β=0.03 collapses behavior through answer-option bias rather than clean steering.

Decision: keep refusal-direction ablation as an informative diagnostic row, not as the next main steering family. If the next increment collapses the output distribution, the intervention is fragile rather than precise.

Open evidence

FaithEval compliance at three β values · n=1,000 each · β=0.03 collapses below the unsteered baseline

Answer-option bias catastrophe at β=0.03

Parse failure rate is 0% at all three β values. The collapse is an answer-option distribution distortion, not a format failure.

Option β=0.00 β=0.03

A 22.2% 24.0%

B 26.6% 58.1%

C 24.5% 6.9%

D 24.8% 9.1%

Answer-option distribution at β=0.00 (balanced) vs β=0.03 (B-dominated) · n=1,000 each · 0% parse failures at both settings

Supporting evidence

Refusal-overlap audit: overlap is real, but not stable enough to redesign the truthfulness direction

The projected H-neuron update overlaps refusal geometry more than a layer-matched random null. But that mediation story is dominated by one layer. Once layer 33 is removed, the evidence largely collapses.

Gate decision: do not redesign the truthfulness direction around refusal overlap yet. Treat refusal overlap as a live hypothesis, not a settled mechanism strong enough to redefine the intervention family yet.

Open evidence

Canonical gap vs null

−0.018

95% CI [−0.031, −0.013]

Subspace gap vs null

+0.036

95% CI [+0.025, +0.039]

Layer 33 dominance

43×

Subspace gap +0.665 vs next layer +0.016

Gate decision

Proceed

Do not redesign the truthfulness direction yet — overlap is still a live, unstable hypothesis

Per-layer refusal subspace fraction (H-neuron residual) · 34 layers · layer 33 = 0.666 vs all others ≤0.017 · 100 random-neuron null sets

Mediation collapses without layer 33

Full-model prompt-level Spearman correlations are weak but significant on both benchmarks. Excluding layer 33 collapses them all.

Benchmark Full model ρ Layer 33 excluded ρ

FaithEval (canon.) −0.087 [−0.160, −0.012] −0.005 [−0.068, +0.060]

FaithEval (subsp.) +0.086 [+0.011, +0.159] −0.022 [−0.086, +0.044]

Jailbreak (canon.) −0.116 [−0.206, −0.024] +0.030 [−0.049, +0.110]

Jailbreak (subsp.) +0.112 [+0.022, +0.202] −0.159 [−0.237, −0.078]

5. Status board

What was resolved by March 30, and what was still open then

Resolved this week

— Measurement contract locked Done

— Baseline A: FaithEval + corrected graded jailbreak read Partial

— Refusal direction extraction (layer 25, 98.4%) Done

— Baseline B: refusal ablation (FaithEval, decision-complete) Done

— Refusal overlap audit (gate resolved) Done

— Truthfulness direction survives kill-shot at β=0.01 Partial

Still open

— Decide whether β=0.01 is a real family signal or a brittle one-off Meeting ask

— Full externality audit Open

— Refusal-orthogonalized mitigation check Open

— One scoped causal pilot Open

— IFEval + perplexity capability battery Blocks externality audit

— Jailbreak negative control (random-neuron baseline) Open

— Layer-33 / top-neuron robustness pass Deferred

Advisor-friendly summary: the week did not just produce more runs. It improved the measurement, showed that the refusal-direction family is informative but fragile, and rescued truthfulness-direction steering from the dead branch. The next decision is which branch deserves the next serious investment.

Week 3:The ComparativeSprint

Three steering knobs, three very different shapes

Truthfulness-direction steering survived the kill-shot

Why this changed the story

β=0.01 counts as clean

The jailbreak story changed once measurement got fixed

Binary judge: confidence interval includes zero

Graded judge: significant and monotonic

Two mechanisms, one intervention

Severity escalation: four monotonic signals

Disclaimer erosion is now quantified at population scale

The binary judge is confounded by framing

Refusal direction extraction succeeded: the direction is real and the tooling works

Refusal-direction ablation is diagnostically useful, but not a better intervention family

Answer-option bias catastrophe at β=0.03

Refusal-overlap audit: overlap is real, but not stable enough to redesign the truthfulness direction

Mediation collapses without layer 33

What was resolved by March 30, and what was still open then

Resolved this week

Still open

Week 3:
The Comparative
Sprint