Extensions

Open questions and caveats for the
flagship thesis

This page is no longer a sprint tracker. It is a decision map for what is already strong, what is still caveated, and what evidence would actually change the core thesis.

Anchor case studies

SAE vs H-neurons, ITI bridge externality, jailbreak measurement

Strongest unresolved caveat

Jailbreak specificity

Evaluator dependence is mapped; paired-control robustness is still thin

Detection-control dissociation

Robust

SAE AUROC 0.848 with steering null under both tested architectures

Externality break

-5.8pp

TriviaQA bridge test: adjudicated delta (95% CI -8.8--3.0pp), wrong-entity substitutions dominate

Question map Caveat order Branch details Decision gates

Question map

Open questions and caveats by evidence branch

April 8–14 turned the project into a confidence-tier problem: two anchors are headline-safe, one anchor is scientifically valuable but caveated, and future extensions should be judged by whether they reduce those caveats.

Branch	Confidence tier	What we now know	What still blocks a stronger claim
Jailbreak measurement + specificity	Supporting, caveated	Paired evaluator audit: v2 harmful slope 2.30pp/α (95% CI 0.99-3.58pp), v3 binary slope 0.46pp/α (95% CI -1.46-2.41pp), v3 substantive-compliance slope 2.00pp/α (95% CI 0.11-3.87pp).	No paired v2-v3 control set and only single-seed v3 random control. Strong enough for a measurement-case-study claim, not for a clean universal specificity headline.
D7 causal-head mitigation	Supporting, caveated	April 8 legacy panel: csv2_yes drops by -9.0pp (95% CI -12.2--5.8pp) vs baseline, while L1 comparator increases harm (+4.0pp; 95% CI 0.6-7.6pp). April 16 current panel: clean CSV2 v3 comparison with expanded layer-matched random controls and a scored probe branch; the shown random seed remains +17.2pp more harmful than causal on strict harmfulness (95% CI 13.2-21.4pp).	The live panel is no longer mixed-ruler or single-seed. Treat this as benchmark-local support because causal still carries token-cap and quality debt (112/500 causal outputs hit cap) and the probe/random branches retain a small documented residual evaluator-error set, not because seed 2 or the probe branch are missing.
ITI bridge externality	Headline-safe	Held-out test confirms externality break: -5.8pp adjudicated accuracy (95% CI -8.8--3.0pp), p=0.0002. 70.0% of right-to-wrong flips are wrong-entity substitutions.	Mechanistic interpretation is behavioral rather than circuit-level. Formal multi-rater coding could tighten failure-mode percentages.
SAE vs H-neuron dissociation	Headline-safe	Matched detection quality (AUROC 0.848, 95% CI 0.820-0.874 for SAE vs 0.843, 95% CI 0.815-0.870 for H-neurons), but SAE steering is null in both full-replacement and delta-only tests.	No major blocker. Remaining work is narrative integration, not rescue experimentation.
Selective truthfulness / deception extension	Hypothesis	Bridge taxonomy suggests global truth steering is too blunt; conditional interventions may be a better target for novelty.	No direct experiment yet. Needs a scoped protocol before it can move from concept to evidence.

Scope note: the core thesis now lives at the methods level: detection, localization, control, and externality are separable stages. Extensions should be judged by which stage-break they clarify, not by raw run count.

Caveat order

Priority is now caveat-reduction, not benchmark accumulation

The next best experiment is the one that upgrades a supporting claim to headline-safe. Everything else is optional polish.

Close jailbreak specificity with matched controls

Run the missing control surface needed for evaluator-robust specificity, then report binary and severity-level outcomes together.

Resolve D7 selector-specificity gap

Reduce the residual evaluator-error debt on the probe/random branches and add a minimal capability check so D7 can be cited as more than a benchmark-local mitigation.

Prototype selective truthfulness branch

Use bridge failure modes (wrong-entity substitutions and factual denial) to design a conditional intervention objective, not another global knob.

Branch details

Each branch now plays a specific role in the argument

The flagship thesis does not need every branch to be positive. It needs each branch to answer a different stage-break question cleanly.

Jailbreak measurement branch (caveated)

The key result is not "one true slope." It is that evaluation level changes the conclusion: v2 binary reads a positive slope, v3 binary compresses it, and v3 primary-outcome taxonomy recovers a severity shift. This is valuable Anchor-3 evidence, but specificity still needs stronger paired controls.

D7 causal branch (promising but provisional)

The April 8 legacy panel and April 16 clean current panel still point the same way: the locked causal branch looks strongest among completed D7 conditions. But the result stays supporting because causal is still token-cap-limited and the probe/random branches still carry a small residual evaluator-error set, so treat this as evidence for "alternative selectors can work on this surface," not as a settled mechanism claim.

Bridge externality branch (load-bearing)

This is the cleanest control-to-externality break in the repo: a direction that helps TruthfulQA MC harms open-ended TriviaQA generation, mostly by wrong-entity substitutions. It upgrades externality risk from anecdote to statistically supported evidence.

SAE branch (closed, still central)

The null is not dead weight. It is the clearest localization-to-control break: matched readout quality, divergent intervention outcomes. This remains one of the hardest pieces of the full thesis to dismiss.

Future novelty branch (not yet evidenced)

Deceptive-behavior exploration stays in future-work scope until there is a concrete protocol and at least one disconfirmable pilot. Right now it is a hypothesis generator, not a results branch.

Most upgraded by April 13 data
Measurement-level interpretation

          The paired v2-v3 analysis turns evaluator disagreement into an explained mechanism:
          conclusions move because borderline mass is handled differently, not because one pipeline is random noise.
        

Most important remaining caveat

Specificity controls

Reducing D7 residual evaluator-error debt and stronger jailbreak paired controls matter more for paper credibility than adding another exploratory benchmark.

Decision gates

What would actually change the thesis now

The flagship can already stand on two headline-safe anchors plus one caveated measurement anchor. The next decision is whether to spend remaining budget on caveat closure or on novelty exploration.

If control gaps are closed cleanly
Stronger causal selection story

          The paper can argue not only that stage breaks exist, but that alternative selectors are at least partially control-supported rather than purely benchmark-local.
        

If caveats remain unresolved

Still publishable, narrower claims

The central thesis remains valid (good readouts often fail as steering targets), but D7 and jailbreak specificity stay as supporting evidence instead of pillars.

Why this framing is different: this page now separates "interesting" from "load-bearing." The safest writing strategy is to lead with headline-safe anchors and use caveated branches to show scope, not to carry the core claim.

Open questions and caveats for theflagship thesis