Detection Is Not Enough

A broader Gemma 3 4B thesis: measurement, localization, control, and externality are separable stages, and strong readouts often fail as steering targets. H-neurons remain the paper-faithful first anchor, not the whole story.

Anchor 1

H-neuron replication still holds: 38 neurons, 76.5% disjoint accuracy, and a +4.5pp FaithEval gain above no-op.

Specificity was tested: random-neuron intervention controls stay flat on FaithEval and FalseQA.

Interpretation got narrower: the 4288 top-neuron story failed 0/6 cross-checks, and the stronger mechanism hypothesis is distributed rather than singular.

Why this anchor matters: it establishes a real localization and control foothold before the broader thesis asks where that foothold breaks.

Anchor 2

Control does not guarantee transfer: the bridge benchmark now shows a held-out externality break.

Locked test result: ITI reduces adjudicated accuracy by -5.8pp (95% CI -8.8--3.0pp) on 500 held-out TriviaQA items.

Dominant failure mode: 70.0% of right-to-wrong flips are wrong-entity substitutions, not refusal.

Project consequence: an intervention can help on one evaluation surface and still corrupt free-form factual generation.

Anchor 3

Measurement changes conclusions: the original 256-token jailbreak slope was a truncation artifact.

Paired v2-v3 audit: v2 shows a positive binary slope, v3 binary is uncertain, and v3 still recovers a substantive-compliance rise.

Why v3 stays primary: not because it leads on held-out accuracy after the tie, but because it supplies the taxonomy and granularity the paper needs without introducing false positives on the gold set.

Project consequence: evaluator dependence is part of the scientific result, not just reporting overhead.

Boundary of the claim

D7 is supporting evidence: the April 8 legacy panel stays historical, and the April 16 current panel is now a clean CSV2 v3 comparison with expanded layer-matched random controls and a scored probe branch; the shown random seed remains +17.2pp more harmful than causal on strict harmfulness (95% CI 13.2-21.4pp).

v3 is not pitched as universally superior: the holdout rerun closed the old StrongREJECT gap, so the rationale is measurement quality, not outdated superiority language.

Single-model scope remains explicit: the site is framed as a Gemma 3 4B case study in intervention science.

Latest weekly report Key findings Core story Methods

Key findings

The project is now about stage breaks, not one neuron set

The stable pattern across the April evidence is broader than H-neurons alone: measurement, localization, control, and externality come apart. Good readouts do not reliably identify good steering targets, and even successful interventions stay narrow unless validated on multiple surfaces.

Read the thesis

H-neurons still earn the first anchor

The paper-faithful sparse baseline survives the hard checks: 76.5% held-out accuracy (95% CI 73.6-79.5%), a +4.5pp FaithEval gain above no-op, and flat random-neuron intervention controls. What narrowed is the interpretation, not the existence of signal.

Open H-neuron results

The paper's #1 neuron is an L1 regularization artifact

Neuron (20, 4288) has 1.65x the classifier weight of the runner-up but scores 0/6 on independent checks. By C=3.0 the signal is spread across 219 positive-weight neurons and accuracy rises to 80.5%. Sparse bookkeeping is not the same thing as mechanistic centrality.

Six-test investigation

When steering transfers badly, it fails by picking the wrong nearby fact

The held-out bridge benchmark upgrades the externality claim to stronger held-out evidence: E0 ITI lowers adjudicated accuracy by -5.8pp (95% CI -8.8--3.0pp) on 500 locked test questions, with p=0.0002. The damage is mostly active corruption, not abstention: 30 of 43 right-to-wrong flips (70.0%) are wrong-entity substitutions.

Jailbreak evaluation is now a measurement case study, not just a binary-rate story

The 256-token binary headline was falsified by a 5000-token rerun. On paired outputs, v2 reports a 2.30pp/α binary slope while v3 compresses the same binary slope to 0.46pp/α, but v3 recovers a 2.00pp/α substantive-compliance shift. v3 stays primary after the holdout tie for taxonomy and zero-false-positive behavior, while specificity remains limited by missing paired controls and a single-seed random control.

D7 is promising benchmark-local evidence, not a finished mechanism claim

The April 8 legacy panel still shows the locked causal head intervention reducing `csv2_yes` by -9.0pp (95% CI -12.2--5.8pp) while the L1 comparator increases it by +4.0pp (95% CI 0.6-7.6pp). The April 16 current panel now uses a clean CSV2 v3 comparison with expanded layer-matched random controls and a scored probe branch; the shown random seed remains +17.2pp more harmful than causal on strict harmfulness (95% CI 13.2-21.4pp). The live caveat is now causal token-cap and quality debt plus a small documented residual evaluator-error set, not missing seeds or a missing probe branch, so this stays supporting rather than selector-specific closure.

See supporting workstreams

Measurement discipline changed the project's claims more than new model scale did

April 8-16 turned multiple attractive stories into narrower ones: D7 moved to clean-panel supporting evidence instead of the old single-seed framing, v3 lost outdated holdout-superiority framing, and jailbreak binary counts gave way to a severity story. That is a methodological win, not a retreat.

Open the evaluation contract