Core story

A broader thesis,
built from three anchors,
inside one Gemma case study

The strongest project claim is now cross-method and public-facing: in Gemma 3 4B-IT, measurement, localization, control, and externality repeatedly come apart. Strong readouts often fail as steering targets, and successful steering remains narrow unless tested across surfaces.

H-neurons still matter, but as Anchor 1. They establish that a sparse detector can survive a clean held-out split and move behavior. The broader thesis comes from what happens next: interpretation fragility, transfer failures, and evaluator choices that materially change the conclusion.

Project claim

Stage breaks

Measurement, localization, control, and externality must be tested separately

Replication anchor

76.5%

Clean held-out H-neuron detector accuracy [73.6, 79.5]

Externality anchor

-5.8pp

ITI bridge hit on held-out TriviaQA test (95% CI -8.8--3.0pp)

Measurement anchor

v2 != v3

Binary slope compression hides a severity shift that v3 recovers

Current claim Anchor 1 Anchor 2 Anchor 3 Boundary

1. Current claim

Predictive readout quality did not reliably identify useful intervention targets

The thesis is intentionally narrower than "detectors are useless" and broader than "H-neurons replicate." Across the methods in this repo, good readouts did not consistently become good levers. When control worked, it was benchmark-local, surface-local, or vulnerable to measurement choice.

What is earned
Detection is not enough

          H-neurons, SAE features, probe heads, ITI, and D7 together support a practical methodological claim:
          selection quality and steering quality must be validated separately.
        

What is not earned

Universal rankings

The site does not claim that detector-selected targets never work, that causal selection is always better, or that v3 is globally superior to every other evaluator.

Working picture: a detector is more like a map than a steering wheel. It can tell you where signal lives without guaranteeing that turning there will move the system the way you want.

2. Anchor 1

H-neurons remain the replication baseline, but they no longer carry the whole mechanism story

This anchor establishes that the project is not built on pure null results. A sparse detector survives the clean split, shifts FaithEval behavior, and passes intervention specificity checks. The same anchor also taught the project where interpretation can go wrong.

What survived replication

The core paper-faithful object is still real: 76.5% disjoint accuracy (95% CI 73.6-79.5%), +4.5pp no-op to max on FaithEval (95% CI 2.9-6.1pp), plus flat random-neuron intervention slopes on FaithEval and FalseQA.

Open the replication ledger

What narrowed

The top-neuron story did not survive. Neuron 4288 fails 0/6 importance checks, and by C=3.0 the signal spreads across 219 positive-weight neurons. The safe interpretation is a distributed hallucination-related detector, not a single control node.

Open the 4288 deep dive

Why this anchor is still only Anchor 1

The strategic assessment elevated the cleaner dissociation result: SAE features match H-neuron detection quality within uncertainty (AUROC 0.848, 95% CI 0.820-0.874 vs 0.843, 95% CI 0.815-0.870) yet steer at essentially zero slope. That is the sharper localization-to-control break. H-neurons provide the working baseline that makes the null comparison meaningful.

Early-layer concentration

47%

18 of 38 H-neurons are in layers 0-10

Swing population

13.8%

13.8% of samples drive most of the intervention effect

Mechanism object

Distributed

38-neuron sparse baseline, 219-neuron broader detector candidate

3. Anchor 2

Even successful control can break at the externality stage

The bridge benchmark is the clearest test of whether an intervention that helps on one surface stays helpful when the task changes from multiple-choice truthfulness to free-form factual generation. It does not.

Held-out test result

The locked 500-question bridge test shows E0 ITI reducing adjudicated accuracy by -5.8pp (95% CI -8.8--3.0pp), with p=0.0002. Dev and test agree on direction and approximate magnitude, so this is no longer a directional pilot result.

Dominant failure mode

The intervention usually does not refuse. It answers with the wrong nearby fact. On the held-out test set, 30 of 43 right-to-wrong flips (70.0%) are wrong-entity substitutions, such as swapping one film, person, or title for another from the same semantic neighborhood.

Why this matters to the thesis

This is the control-to-externality break in its cleanest form. The intervention is active, but indiscriminate: it reweights candidate answers without knowing which candidate is correct. That is a stronger diagnosis than a generic "performance went down" story.

Anchor 2

The bridge result acts like a road test after a clean engine-bench result. The component can pass the controlled lab setting and still steer the full system into the wrong lane once the task becomes open-ended.

4. Anchor 3

Measurement choices changed the jailbreak conclusion

The evaluator story is now a scientific case study in its own right. The project first found a false positive caused by truncation, then showed that two evaluators can disagree on the same outputs in a way that is mechanically explainable rather than mysterious.

Artifact removed

The original 256-token jailbreak slope was a truncation artifact. Once outputs were rerun to 5000 tokens, the old "binary jailbreak rate rises" story no longer held as a clean headline.

Paired evaluator result

On the same H-neuron outputs, v2 reports a harmful-binary slope of 2.30pp/α (95% CI 0.99-3.58pp), while v3 reports 0.46pp/α (95% CI -1.46-2.41pp). But v3's primary-outcome taxonomy recovers the underlying severity shift as substantive compliance rising by 2.00pp/α (95% CI 0.11-3.87pp).

Current evaluator rationale

After the StrongREJECT gpt-4o rerun, v3 is no longer presented as better on held-out accuracy. On the holdout audit, v3 scores 96.0% (95% CI 90.0-100.0%) and StrongREJECT-4o also scores 96.0% (95% CI 90.0-100.0%), with 0 discordant records. With held-out accuracy now tied, v3 stays primary because it separates refusal, deflection, partial compliance, and substantive compliance without introducing false positives, while specificity still lacks a paired v2-v3 control set and has only a single-seed v3 random control.

5. Boundary of the paper

Headline-safe versus supporting evidence now matters more than breadth

The strategic assessment's main discipline was evidence hierarchy. Some results can lead sections. Others strengthen the picture but should not carry the project alone.

Headline-safe
Two anchors

          H-neuron replication plus localization limits, and the bridge externality break on the held-out test set,
          can carry central narrative weight.
        

Supporting-caveated

Anchor 3, D7, and v3 margins

Jailbreak measurement audits matter, but evaluator dependence and incomplete control coverage keep them out of the headline-safe bucket. D7's benchmark-local win is real, but the April 16 current panel is now a clean CSV2 v3 comparison with expanded random controls and a scored probe branch, so the remaining reason it stays supporting is token-cap and residual evaluator-error debt rather than the old single-seed objection. v3's value is measurement structure and zero-false-positive behavior, not a broad claim of evaluator dominance.

Practical editorial rule: lead with what would still survive if the most flattering optional result disappeared. That is why the site now anchors on the three April case studies instead of one expanding H-neuron claim.

A broader thesis,built from three anchors,inside one Gemma case study