Methods

Pipeline, evaluation,
and the measurement contract

The methods page now reflects the broader thesis. It documents the H-neuron replication funnel, but it also makes explicit that measurement, localization, control, and externality use different rulers. The important update is evaluators: v3 is preferred for granularity and zero-false-positive behavior, not for an outdated held-out superiority claim.

TriviaQA start
3,500
Questions sampled for the Gemma H-neuron baseline
Usable extraction set
2,997
Rows that survived consistency and answer extraction
Clean test set
780
Disjoint examples evaluated for the main H-neuron classifier result
Jailbreak primary ruler
CSV v3
Chosen for taxonomy and auditability, not for a claimed held-out accuracy edge
Pipeline

From 3,500 questions to 38 paper-faithful neurons

The H-neuron detector remains the project's first anchor, so its data funnel still matters. It is also a reminder that localization claims already contain measurement choices before any intervention is attempted.

Stage 1
Response collection
3,500
TriviaQA questions, 10 responses each
Stage 2
Consistency filter
3,115
89% agreement, exact 10/10 match
Stage 3
Answer extraction
2,997
GPT-4o token extraction
Stage 4
Activation extraction
348K
CETT features per sample
Stage 5
L1 classifier
38
Positive-weight neurons at C=1.0

Data funnel

The detector is only as clean as the rows that survive into it. This funnel is part of the scientific object, not just preprocessing.

TriviaQA questions 3,500
Consistent (10/10) 3,115
Tokens extracted 2,997
Train set (balanced) 2,000
Disjoint test set 780
H-Neurons 38
Methodological consequence: the consistency filter and extraction step bias the detector toward confident, well-formed responses. That does not invalidate the baseline, but it narrows the regime the detector speaks for.
Stage separation

The broader thesis starts here: the data funnel answers a localization question, not a control question. A clean detector can still fail later when the project asks whether an intervention actually moves the right behavior on the right task.

Evaluation

Use evaluators that match the output contract and preserve the structure of the effect

Two measurement lessons now sit side by side in the repo. FaithEval taught prompt-evaluator alignment. Jailbreak taught that evaluator granularity can change the conclusion even on identical outputs.

FaithEval contract
Anti-compliance prompt: Answer with just the letter. Standard prompt: Respond with the exact answer only.
Jailbreak contract
v2: binary harmful / borderline / safe v3: primary_outcome + C/S/V axes + evidence spans for audit
FaithEval lesson
Surface form matters
If the prompt asks for exact answer text but the evaluator only trusts option letters, a formatting shift can masquerade as a behavioral regression.
Jailbreak lesson
Granularity matters
v3 is preferred because it can distinguish refusal, deflection, partial compliance, and substantive compliance. That is why it recovers the severity shift that binary rates partly hide.

Why v3 is primary now

The current justification is structural, not a broad accuracy claim. On the holdout audit, CSV v3 scores 96.0% (95% CI 90.0-100.0%) and StrongREJECT-4o also scores 96.0% (95% CI 90.0-100.0%), with 0 discordant records. With held-out accuracy now tied, v3 stays primary because it exposes the intervention gradient through primary outcomes and C/S/V dimensions, not because it holds a measured accuracy edge; specificity is still limited by missing paired controls and a single-seed v3 random control.

Zero-false-positive discipline

The measurement cleanup and gold reruns reinforce a useful safety property: v3 does not create phantom harmful cases on the audited gold set. That makes it a safer primary ruler for intervention comparisons than a coarser binary-only judge.

What v3 does not claim

The site no longer says v3 is strictly superior on holdout. The StrongREJECT-4o rerun eliminated that old margin. The remaining rationale is that v3 provides the richer measurement surface the paper needs.

Why this belongs in methods: choosing an evaluator is like choosing a microscope objective. A coarser lens can be accurate on obvious cases and still erase the structure that the experiment is actually trying to measure.
Reproducibility

Reproducibility now means visible caveats, not just repeatable scripts

The most useful April upgrades were not only new runs. They were the changes that converted hidden assumptions into named constraints: locked test sets, canary validation, negative-control expectations, and explicit evidence tiers.

One-shot locked surfaces

The bridge test was run once on a locked 500-question manifest with a rerun guard. That matters more than squeezing another point estimate out of the same split.

Control expectations are explicit

Random-neuron controls were required to validate H-neuron intervention specificity. The same logic is why D7 is still labeled supporting: the live panel is now a clean CSV2 v3 comparison with expanded layer-matched random controls and a scored probe branch, but causal token-cap and quality debt plus a small residual evaluator-error set still block selector-specific closure.

Evidence hierarchy is part of the method

Headline-safe and supporting-caveated are not editorial afterthoughts. They are the method for preventing an attractive but under-controlled result from silently becoming the paper's main claim.

Reproducibility note

A robust pipeline is not just something you can rerun. It is something that tells you which conclusions should survive if one appealing branch later gets downgraded.