Detection Is Not Enough

A broader Gemma 3 4B thesis: measurement, localization, control, and externality are separable stages, and strong readouts often fail as steering targets. H-neurons remain the paper-faithful first anchor, not the whole story.

Anchor 1

H-neuron replication still holds: 38 neurons, 76.5% disjoint accuracy, and a +4.5pp FaithEval gain above no-op.

Specificity was tested: random-neuron intervention controls stay flat on FaithEval and FalseQA.

Interpretation got narrower: the 4288 top-neuron story failed 0/6 cross-checks, and the stronger mechanism hypothesis is distributed rather than singular.

Why this anchor matters: it establishes a real localization and control foothold before the broader thesis asks where that foothold breaks.

Anchor 2

Control does not guarantee transfer: the bridge benchmark now shows a held-out externality break.

Locked test result: ITI reduces adjudicated accuracy by -5.8pp (95% CI -8.8--3.0pp) on 500 held-out TriviaQA items.

Dominant failure mode: 70.0% of right-to-wrong flips are wrong-entity substitutions, not refusal.

Project consequence: an intervention can help on one evaluation surface and still corrupt free-form factual generation.

Anchor 3

Measurement changes conclusions: the original 256-token jailbreak slope was a truncation artifact.

Paired v2-v3 audit: v2 shows a positive binary slope, v3 binary is uncertain, and v3 still recovers a substantive-compliance rise.

Why v3 stays primary: not because it leads on held-out accuracy after the tie, but because it supplies the taxonomy and granularity the paper needs without introducing false positives on the gold set.

Project consequence: evaluator dependence is part of the scientific result, not just reporting overhead.

Boundary of the claim

D7 is supporting evidence: the April 8 legacy panel stays historical, and the April 16 current panel is now a clean CSV2 v3 comparison with expanded layer-matched random controls and a scored probe branch; the shown random seed remains +17.2pp more harmful than causal on strict harmfulness (95% CI 13.2-21.4pp).

v3 is not pitched as universally superior: the holdout rerun closed the old StrongREJECT gap, so the rationale is measurement quality, not outdated superiority language.

Single-model scope remains explicit: the site is framed as a Gemma 3 4B case study in intervention science.

Key findings

The project is now about stage breaks, not one neuron set

The stable pattern across the April evidence is broader than H-neurons alone: measurement, localization, control, and externality come apart. Good readouts do not reliably identify good steering targets, and even successful interventions stay narrow unless validated on multiple surfaces.

Read the thesis

H-neurons still earn the first anchor

The paper-faithful sparse baseline survives the hard checks: 76.5% held-out accuracy (95% CI 73.6-79.5%), a +4.5pp FaithEval gain above no-op, and flat random-neuron intervention controls. What narrowed is the interpretation, not the existence of signal.

Open H-neuron results

The paper's #1 neuron is an L1 regularization artifact

Neuron (20, 4288) has 1.65x the classifier weight of the runner-up but scores 0/6 on independent checks. By C=3.0 the signal is spread across 219 positive-weight neurons and accuracy rises to 80.5%. Sparse bookkeeping is not the same thing as mechanistic centrality.

Six-test investigation

When steering transfers badly, it fails by picking the wrong nearby fact

The held-out bridge benchmark upgrades the externality claim to stronger held-out evidence: E0 ITI lowers adjudicated accuracy by -5.8pp (95% CI -8.8--3.0pp) on 500 locked test questions, with p=0.0002. The damage is mostly active corruption, not abstention: 30 of 43 right-to-wrong flips (70.0%) are wrong-entity substitutions.

Jailbreak evaluation is now a measurement case study, not just a binary-rate story

The 256-token binary headline was falsified by a 5000-token rerun. On paired outputs, v2 reports a 2.30pp/α binary slope while v3 compresses the same binary slope to 0.46pp/α, but v3 recovers a 2.00pp/α substantive-compliance shift. v3 stays primary after the holdout tie for taxonomy and zero-false-positive behavior, while specificity remains limited by missing paired controls and a single-seed random control.

D7 is promising benchmark-local evidence, not a finished mechanism claim

The April 8 legacy panel still shows the locked causal head intervention reducing `csv2_yes` by -9.0pp (95% CI -12.2--5.8pp) while the L1 comparator increases it by +4.0pp (95% CI 0.6-7.6pp). The April 16 current panel now uses a clean CSV2 v3 comparison with expanded layer-matched random controls and a scored probe branch; the shown random seed remains +17.2pp more harmful than causal on strict harmfulness (95% CI 13.2-21.4pp). The live caveat is now causal token-cap and quality debt plus a small documented residual evaluator-error set, not missing seeds or a missing probe branch, so this stays supporting rather than selector-specific closure.

See supporting workstreams

Measurement discipline changed the project's claims more than new model scale did

April 8-16 turned multiple attractive stories into narrower ones: D7 moved to clean-panel supporting evidence instead of the old single-seed framing, v3 lost outdated holdout-superiority framing, and jailbreak binary counts gave way to a severity story. That is a methodological win, not a retreat.

Open the evaluation contract