Methods

Pipeline, evaluation,
and the measurement contract

The methods page now reflects the broader thesis. It documents the H-neuron replication funnel, but it also makes explicit that measurement, localization, control, and externality use different rulers. The important update is evaluators: v3 is preferred for granularity and zero-false-positive behavior, not for an outdated held-out superiority claim.

TriviaQA start

3,500

Questions sampled for the Gemma H-neuron baseline

Usable extraction set

2,997

Rows that survived consistency and answer extraction

Clean test set

780

Disjoint examples evaluated for the main H-neuron classifier result

Jailbreak primary ruler

CSV v3

Chosen for taxonomy and auditability, not for a claimed held-out accuracy edge

Pipeline Evaluation Reproducibility

Pipeline

From 3,500 questions to 38 paper-faithful neurons

The H-neuron detector remains the project's first anchor, so its data funnel still matters. It is also a reminder that localization claims already contain measurement choices before any intervention is attempted.

Stage 1

Response collection

3,500

TriviaQA questions, 10 responses each

→

Stage 2

Consistency filter

3,115

89% agreement, exact 10/10 match

→

Stage 3

Answer extraction

2,997

GPT-4o token extraction

→

Stage 4

Activation extraction

348K

CETT features per sample

→

Stage 5

L1 classifier

Positive-weight neurons at C=1.0

Data funnel

The detector is only as clean as the rows that survive into it. This funnel is part of the scientific object, not just preprocessing.

TriviaQA questions 3,500

Consistent (10/10) 3,115

Tokens extracted 2,997

Train set (balanced) 2,000

Disjoint test set 780

H-Neurons 38

Methodological consequence: the consistency filter and extraction step bias the detector toward confident, well-formed responses. That does not invalidate the baseline, but it narrows the regime the detector speaks for.

Evaluation

Use evaluators that match the output contract and preserve the structure of the effect

Two measurement lessons now sit side by side in the repo. FaithEval taught prompt-evaluator alignment. Jailbreak taught that evaluator granularity can change the conclusion even on identical outputs.

FaithEval contract

Anti-compliance prompt:
Answer with just the letter.

Standard prompt:
Respond with the exact answer only.

Jailbreak contract

v2: binary harmful / borderline / safe
v3: primary_outcome + C/S/V axes +
evidence spans for audit

FaithEval lesson

Surface form matters

If the prompt asks for exact answer text but the evaluator only trusts option letters, a formatting shift can masquerade as a behavioral regression.

Jailbreak lesson
Granularity matters

          v3 is preferred because it can distinguish refusal, deflection, partial compliance, and substantive compliance.
          That is why it recovers the severity shift that binary rates partly hide.
        

Why v3 is primary now

The current justification is structural, not a broad accuracy claim. On the holdout audit, CSV v3 scores 96.0% (95% CI 90.0-100.0%) and StrongREJECT-4o also scores 96.0% (95% CI 90.0-100.0%), with 0 discordant records. With held-out accuracy now tied, v3 stays primary because it exposes the intervention gradient through primary outcomes and C/S/V dimensions, not because it holds a measured accuracy edge; specificity is still limited by missing paired controls and a single-seed v3 random control.

Zero-false-positive discipline

The measurement cleanup and gold reruns reinforce a useful safety property: v3 does not create phantom harmful cases on the audited gold set. That makes it a safer primary ruler for intervention comparisons than a coarser binary-only judge.

What v3 does not claim

The site no longer says v3 is strictly superior on holdout. The StrongREJECT-4o rerun eliminated that old margin. The remaining rationale is that v3 provides the richer measurement surface the paper needs.

Why this belongs in methods: choosing an evaluator is like choosing a microscope objective. A coarser lens can be accurate on obvious cases and still erase the structure that the experiment is actually trying to measure.

Reproducibility

Reproducibility now means visible caveats, not just repeatable scripts

The most useful April upgrades were not only new runs. They were the changes that converted hidden assumptions into named constraints: locked test sets, canary validation, negative-control expectations, and explicit evidence tiers.

One-shot locked surfaces

The bridge test was run once on a locked 500-question manifest with a rerun guard. That matters more than squeezing another point estimate out of the same split.

Control expectations are explicit

Random-neuron controls were required to validate H-neuron intervention specificity. The same logic is why D7 is still labeled supporting: the live panel is now a clean CSV2 v3 comparison with expanded layer-matched random controls and a scored probe branch, but causal token-cap and quality debt plus a small residual evaluator-error set still block selector-specific closure.

Evidence hierarchy is part of the method

Headline-safe and supporting-caveated are not editorial afterthoughts. They are the method for preventing an attractive but under-controlled result from silently becoming the paper's main claim.

Pipeline, evaluation,and the measurement contract