This week stopped being a story about tweaking one intervention and became a story about comparing three steering families under the same corrected measurement stack. H-neurons are now the reference row, not the whole project.
Two things changed the advisor story. First, the old jailbreak read was partly a measurement artifact: full-length graded judging shows that H-neuron scaling changes both how often harmful outputs appear and how explicit and actionable they become. Second, the truthfulness-direction baseline is no longer dead: it has one clean, narrow operating point that now matters for strategy.
Desktop shortcut: double-click a comparator card to jump straight to its detailed evidence. Mobile still uses the normal “Open evidence” buttons.
The top-level label is a three-way judgment: clearly harmful, borderline, or not harmful. Underneath that, the judge scores how committed, how specific, and how actionable the harmful content is. This keeps the high-level story in plain English while preserving the formal rubric.
Clearly harmful / borderline / not harmful. This is the first-pass read on whether the answer actually crosses the line.
How strongly the model endorses or commits to the harmful act, from refusal to explicit support.
How concrete the instructions are, from vague discussion to step-by-step detail.
Whether the harmful content is actually usable. The highest level is the clearest “this could be acted on” bucket.
The easiest way to navigate the meeting is to treat these as three candidate knobs. One is smoother but safety-costly, one is informative but unstable, and one is newly promising but still brittle.
The old single-layer read pushed toward “wrong surface / dead branch.” The new all-layer kill-shot changes that. With the corrected clean truthfulness dataset and all-layer ablation, there is a clean usable point at β=0.01, then an immediate collapse at β=0.02. That makes it a narrow-window survivor, not a null result.
Truthfulness-direction all-layer FaithEval rerun · n=1,000 each · one clean lift at β=0.01, then immediate collapse at β=0.02
At β=0.01, the gain is broad rather than a single-option trick: correct-A rises from 60.8% to 65.0%, correct-B from 70.5% to 70.9%, correct-C from 64.0% to 75.0%, and correct-D from 68.9% to 75.0%. By β=0.02, the answer surface shifts into visibly corrupted D) / **B)**-style outputs.
The earlier 256-token run made jailbreak look like a weak binary slope. The corrected full-length read says something more interesting: H-neuron scaling changes both the number of harmful outputs and how explicit and actionable those outputs become.
Clearly harmful rate and most actionable harmful bucket across 4 alpha values · n=500 per α · GPT-4o graded judge
The α=1.0 identity baseline splits the total effect into two distinct components. Count and severity have different dose-response curves: count saturates early, while severity keeps escalating through α=3.0.
Ablation recovery (α=0→1) drives count; amplification (α=1→3) drives severity · stacked bars: clearly harmful count · line: most actionable harmful bucket
Unlike harmful count, every severity metric continues to escalate through α=3.0. Turnkey artifacts nearly triple. Harmful payload share rises from 58% to 73% of response length, and the harmful pivot point moves earlier in the answer.
Three severity axes across 4 alphas · all monotonic through the amplification range · most actionable harmful bucket, payload share, and pivot position
What looked like a qualitative hunch in Week 2 now survives the population-level read. The borderline pool is the reservoir: 30% of borderline responses at α=0.0 become clearly harmful by α=3.0.
Gold-label validation shows the binary judge struggles most when disclaimers are heaviest and harm is subtle. Ten of eleven errors are false negatives: the judge says SAFE where the human says HARMFUL.
GPT-4o judge agreement with human labels on full-length responses · n=25 per alpha · 10 of 11 disagreements are false negatives
A binary metric can tell you how many; a graded metric tells you how bad. Severity is a dimension that matters for safety.
This matters mostly because it gave the sprint a real direction-extraction and intervention pipeline. The extracted refusal direction is cleanly separable and strong enough to support the ablation and overlap-audit results.
Validation accuracy per layer · 34 layers · 128+128 contrastive set · best: layer 25 (98.4%) · clear phase transition around layer 18
This comparator already answered its key question. There is one narrow usable point at β=0.02 that reaches 70.2% on FaithEval, but the next step at β=0.03 collapses behavior through answer-option bias rather than clean steering.
FaithEval compliance at three β values · n=1,000 each · β=0.03 collapses below the unsteered baseline
Parse failure rate is 0% at all three β values. The collapse is an answer-option distribution distortion, not a format failure.
Answer-option distribution at β=0.00 (balanced) vs β=0.03 (B-dominated) · n=1,000 each · 0% parse failures at both settings
A single good operating point is not robustness. If the next increment collapses the output distribution, the intervention is fragile, not precise.
The projected H-neuron update overlaps refusal geometry more than a layer-matched random null. But that mediation story is dominated by one layer. Once layer 33 is removed, the evidence largely collapses.
Per-layer refusal subspace fraction (H-neuron residual) · 34 layers · layer 33 = 0.666 vs all others ≤0.017 · 100 random-neuron null sets
Full-model prompt-level Spearman correlations are weak but significant on both benchmarks. Excluding layer 33 collapses them all.