Anchor 1 appendix

Swing Characterization:
What the 14% Reveals

Of 1,000 FaithEval samples, only 138 respond to H-neuron scaling. This page profiles that sensitive subpopulation: what types of transitions occur, how much structural features predict swing status, and whether knowledge override or uncertainty resolution dominates.

Appendix note: this page is preserved as Anchor 1 support for the H-neuron replication case. For the current project-level thesis and confidence-tier framing, start from Core Story or the population analysis summary.
R→C (knowledge override)
94
68.1% of swing · 95% CI [59.9, 75.3]
C→R (uncertainty resolution)
32
23.2% of swing · 95% CI [16.9, 30.9]
Mean transition α
1.4
R→C mean α at first compliance
LLM samples
--
GPT-4o-mini enriched (knowledge + verification + persuasiveness)
Swing subtypes

Two-thirds of swing samples show knowledge override

Each swing sample follows one of three trajectory shapes across the α sweep. The dominant pattern is R→C: the model initially gives the correct answer but switches to following the misleading context as H-neuron scaling increases.

R→C · Knowledge override
94
68.1%
95% CI [59.9, 75.3]
Model knew the right answer but was overridden by H-neuron scaling. This is the safety-concerning subtype.
C→R · Uncertainty resolution
32
23.2%
95% CI [16.9, 30.9]
Model initially followed the misleading context but recovered correct knowledge with scaling. The intervention helped here.
Non-monotonic
12
8.7% · near-threshold oscillation
Election analogy: think of the 1,000 samples as voters in an election. 600 always vote one way, 262 always vote the other. The remaining 138 are the swing bloc, and among them, two-thirds switch from correct to misleading under pressure — like undecided voters who know the facts but follow the loudest campaign ad.
Trajectory heatmap

138 samples, 7 alpha values: the full compliance matrix

Each row is one swing sample, each column is an α value (0.0 to 3.0). Orange cells indicate resistant behavior, while green cells indicate compliance with misleading context. The dominant R→C pattern shows up as rows that transition from orange to green as α increases.

Swing trajectory heatmap showing 138 samples across 7 alpha values
Reading the heatmap: if R→C were just noise, you would see a scattered mix of orange and green. Instead, the transitions form bands: rows stay orange (resistant) at low α then flip green (compliant) at a specific threshold. That banding pattern is consistent with a genuine decision boundary being crossed, not random format drift.
Transition dynamics

Monotonic swing transitions skew toward low α, but subtype timing still overlaps

The transition α is the scaling factor at which a sample first changes compliance state. These bars are plotted from exported per-sample first-transition values, not reconstructed from summary means. Both monotonic subtypes concentrate in the first half of the sweep, and the current subtype timing difference remains inconclusive.

R→C by α ≤ 1.5
63.8%
60/94 · 95% CI [53.8, 72.8]
C→R by α ≤ 1.5
71.9%
23/32 · 95% CI [54.6, 84.4]
Subtype timing difference
p=0.19
Mann-Whitney U on first-transition α
What this means: the evidence now supports a narrower claim than before: low-α transitions are common in both monotonic subtypes, but the current sample does not justify saying that R→C transitions follow a uniquely earlier schedule than C→R.
Structural prediction

Surface features are near chance at predicting swing status

Held-out prediction on question structure, topic, and source does not support a strong structural separation claim on the current feature set. Descriptively, context length still differs across populations (p=0.0001) and source composition is imbalanced (V=0.20), but those effects need not imply useful classification.

Structural proxy violin plots showing context length, word overlap, and response length by population
Held-out AUROC
0.54
95% CI [--, --] · perm p=--
Balanced accuracy
0.54
95% CI [--, --]
Source-only AUROC
0.52
Dataset identity alone is weak
R→C subtype AUROC
0.61
Within swing, subtype prediction is still modest
Interpretation: Descriptive differences exist, but on the current feature set they do not translate into reliable swing classification.
LLM enrichment

GPT-4o-mini provides independent knowledge and persuasiveness ratings

Three independent LLM calls per sample classify the question's domain knowledge, verify the model's answer against the judge, and rate context persuasiveness. These ratings cross-check whether R→C samples genuinely represent knowledge override.

LLM-enriched samples
--
Stratified across all three populations
Verification agreement
--
Among comparable answers, GPT-4o-mini agrees with model's α=0 answer
Mean persuasiveness
--
Context convincingness (1–5 scale)
Knowledge override hypothesis: if R→C samples score high on "model knows the answer" and the context is not unusually persuasive, then H-neuron scaling is genuinely overriding knowledge. If instead the knowledge ratings are low or the contexts are highly persuasive, the mechanism is more about resolving genuine uncertainty than overriding settled beliefs.
Safety reading

What the swing characterization means for H-neuron safety claims

1

Knowledge override is real but bounded

68.1% of swing samples (94/138) show genuine R→C transitions. These are cases where the model had the right answer and lost it to H-neuron scaling. That is a concrete safety concern, but it applies to 94 of 1,000 total samples, not the whole benchmark.

2

Not all scaling is harmful

23.2% of swing samples (32/138) show C→R transitions where the model was initially wrong and H-neuron scaling helped it recover. This is the counterpoint: the same mechanism that overrides knowledge in some cases resolves uncertainty in others.

3

Single-model, single-benchmark caveat

All findings are from Gemma 3 4B on FaithEval only. The 68/23 split could differ on other models, benchmarks, or domains. Cross-model replication is the natural next step before making strong deployment claims.

Bottom line: H-neuron scaling is not a clean safety dial. It is a mechanism that mostly overrides genuine knowledge (alarming) but sometimes resolves uncertainty (helpful), and it only affects a minority of samples. The safety concern is real but should be stated with the appropriate scope: 94 samples out of 1,000, on one model, on one benchmark.