Progress — April 8–14

Week 4:
Flagship
Synthesis

This synthesis reframed the project around a broader methods thesis: strong readouts do not reliably identify steerable targets.

The argument now uses three anchor case studies and explicit confidence tiers, and the April 14 D7 follow-up makes the supporting-evidence boundary sharper rather than broader.

Anchor case studies

SAE vs H-neurons, bridge externality, jailbreak measurement

Headline-safe anchors

SAE dissociation and bridge externality replication

Load-bearing bridge result

-5.8pp

TriviaQA test-set delta (95% CI -8.8--3.0pp), p=0.0002

Largest caveat

Selector specificity

Jailbreak still needs paired controls; D7 now has a single-seed control but not closure

Start Here Three anchors Measurement case D7 follow-up Confidence tiers Decision gates

Start Here

What is now the safest flagship claim?

The strongest claim is no longer benchmark-specific. It is methodological: measurement, localization, control, and externality can break at different stages.

Localization → control can break

SAE and H-neurons read out hallucination similarly well, but only one steers FaithEval behavior.

Control → externality can break

ITI improves constrained MC truthfulness but harms open-ended factual generation on held-out TriviaQA.

Measurement → conclusion can break

Jailbreak effect size and significance move when evaluator definitions change, even on identical outputs.

Anchor map

Three anchor case studies for the paper skeleton

Two anchors are headline-safe and one is supporting-caveated. This separation protects the core argument from overreach.

Anchor 1 — SAE vs H-neurons (headline-safe)

Matched detection quality (AUROC 0.848, 95% CI 0.820-0.874 vs 0.843, 95% CI 0.815-0.870) with divergent steering outcomes. Delta-only SAE remains null, so this is not just reconstruction noise.

Anchor 2 — ITI bridge externality (headline-safe)

Held-out test confirms a negative transfer: adjudicated accuracy drops by -5.8pp (95% CI -8.8--3.0pp). Most damage is wrong-entity substitution (30/43 right-to-wrong flips), not refusal collapse.

Anchor 3 — Jailbreak measurement (supporting-caveated)

Paired v2-v3 analysis explains why binary slope conclusions diverge. v2 binary is positive, v3 binary is uncertain, and v3 substantive-compliance still rises; specificity robustness is not yet as clean as FaithEval/FalseQA.

Measurement case

Jailbreak shifted from headline result to caveated supporting evidence

The effect did not disappear; it moved measurement level. v2 emphasizes a positive binary harmful slope, while v3 leaves the binary slope uncertain but still shows a severity-composition shift.

v2 harmful slope

+2.30 pp/α

95% CI 0.99-3.58pp

v3 harmful slope

+0.46 pp/α

95% CI -1.46-2.41pp

v3 substantive slope

+2.00 pp/α

95% CI 0.11-3.87pp

Holdout discriminator

96.0% / 96.0%

n=50; v3 95% CI 90.0-100.0%; SR-4o 95% CI 90.0-100.0%; 0 discordant

Interpretation: evaluator dependence is itself a result. v3 stays primary for taxonomy and zero-false-positive behavior, not for holdout superiority, and specificity still lacks a paired v2-v3 control set plus more than a single-seed random control.

April 14 follow-up

D7 now has a dual-panel read: historical legacy panel plus a current control follow-up

The April 8 panel remains historical provenance. The April 14 update adds a layer-matched random-head control, which improves the control story but does not turn D7 into selector-specific closure.

April 8 legacy panel
Historical but still informative

          
          
          On the frozen legacy ruler, the locked causal branch lowers csv2_yes by
          -9.0pp (95% CI -12.2--5.8pp) versus baseline,
          while the L1 comparator moves the wrong way by +4.0pp
          (95% CI 0.6-7.6pp). That panel stays on the site as dated provenance, not as the active D7 headline.
        

April 14 control follow-up

Better control coverage, still not clean closure

On the April 14 current normalization follow-up, the layer-matched random-head seed remained +12.4pp more harmful than causal on strict harmfulness (95% CI 8.0-16.8pp). Causal still looks strongest among completed D7 branches, but the comparison is mixed-ruler and single-seed rather than mechanism-clean closure.

What improved

"Random-head control missing" is now stale. A layer-matched random-head control exists, and it points in the same broad direction as the older panel: causal remains the strongest completed branch.

What still blocks closure

The April 14 follow-up is mixed-ruler, only one random seed is available, and the causal branch still has visible token-cap debt (112/500 outputs hit cap). That is enough for supporting evidence, not for selector-specific or mechanism-clean closure.

Editorial consequence

D7 moves from "control missing" to "partial control support." That is a real upgrade, but it still belongs under the supporting tier rather than the flagship pillar tier.

Confidence tiers

What is load-bearing today vs still provisional

Branch	Tier	Evidence status	Main caveat
SAE vs H-neurons	Headline-safe	Matched detection, divergent steering, null survives architecture variants	Single-model scope
ITI bridge externality	Headline-safe	Held-out test CI excludes zero; wrong-entity substitutions dominate damage	Failure-mode coding is manual taxonomy, not formal adjudication protocol
Jailbreak evaluator analysis	Supporting, caveated	Paired v2-v3 mechanism is explicit; v3 binary is uncertain but severity shift remains detectable	No paired v2-v3 control set and only a single-seed v3 random control
D7 causal-head branch	Supporting, caveated	April 8 legacy win plus April 14 random-head follow-up still leave causal strongest among completed branches	Mixed-ruler, single-seed control and visible token-cap debt

Decision gates

Highest-ROI next moves

If time is limited, prioritize caveat-closing controls over new exploratory branches.

Close jailbreak control gap

Add matched evaluator-control coverage so specificity claims are not tied to missing paired controls and a single-seed comparison.

Finish D7 selector checks

Add another matched random-head seed, finish the probe branch, and run a minimal capability battery before treating D7 as a mechanism-level pillar.

Design selective truthfulness pilot

Use bridge failure taxonomy to test conditional interventions that target wrong-entity substitution risk directly.

Week 4:FlagshipSynthesis