Anchor 1 appendix

Neuron 4288:
from dramatic weight to artifact verdict

The strongest L1-weighted neuron in the sparse detector looked like a possible hub. This page presents six independent tests, one scoreboard, and the evidence for why the project should not anchor its mechanism story on this neuron alone.

Appendix note: this page is an Anchor 1 appendix, not a standalone flagship claim. It supports the narrower H-neuron replication story by showing why the mechanism object should remain distributed rather than collapse onto one neuron.
L20:N4288
12.17
L1 weight, rank #1 at C=1.0
Rank at C=3.0
#5
The first sign the dominance is regularization-specific
Standalone AUC
0.590
Below the threshold we set for a unique hub story
Largest contribution share
7.4%
Far below the 30% "dominant feature" threshold
Ablation loss
1.0pp
Too small for a single-neuron bottleneck claim
Overall verdict
0/6
No test supports a unique neuron-hub interpretation
Forensic investigation

Six tests, each asking a different version of the same question

A real hub neuron should survive multiple kinds of scrutiny. We required broad support instead of letting one flattering plot do all the work.

Single-neuron AUC comparison
1. Single-neuron AUC Artifact

If 4288 were genuinely carrying the detector, it should classify well on its own. Instead its standalone AUC is 0.590, worse than L13:N833 and L14:N8547.

Activation distributions
2. Distribution separation Artifact

Cohen's d is 0.326, below the 0.5 cutoff we set ahead of time. The activations separate somewhat, but not enough to justify a "special neuron" story.

C-sweep stability analysis
3. C-sweep stability Artifact

This is the sharpest test. Neuron 4288 is absent at C ≤ 0.3, appears at C = 1.0, then falls to rank 5 at C = 3.0 and rank 11 at C = 10.0. Real mechanism should not be that C-fragile.

Per-example contribution analysis
4. Per-example contribution Artifact

Despite the oversized L1 weight, 4288 is the largest positive contributor on only 7.4% of examples. That is more "frequent committee member" than "single point of control."

Leave-one-out ablation
5. Leave-one-out ablation Artifact

Zeroing the neuron reduces accuracy by 1.03pp, below the 2pp threshold. It matters, but not enough to look like a true bottleneck.

Correlation structure heatmap
6. Correlation structure Artifact

The strongest observed correlation is r = 0.492 with L26:N1359, a zero-weight neuron. That is the exact pattern you expect when L1 picks one representative from a correlated feature group.

Why six tests instead of one: a neuron can look important by one measure and ordinary by another. This grid is designed like a cross-examination, where each test checks a different loophole in the original "hub neuron" hypothesis.
Methodological reflection

The surprising part is not that one neuron looked big, but that a sparse linear model made that bigness look causal. L1 regularization is useful because it forces a short list, but it also behaves like a budget committee: when several features tell a similar story, it often funds one spokesperson and cuts the rest.

Verdict

0/6 tests support a unique hub-neuron story

Loading top-neuron verdict...

AUC
0.590
Best single: 0.703
Cohen’s d
0.326
Runner-up: 0.477
C-sweep
3/9
Present in only 3 of 9 C values
Top contrib
7.4%
Largest contributor share
Ablation
1.0pp
Below 2pp threshold
Max |r|
0.492
L1 concentration signal

Back to the results summary

Artifact rationale

The stronger mechanistic object is the distributed detector, not the loudest single neuron

The lesson here is not "ignore neuron 4288." It is "do not confuse sparse model bookkeeping with causal centrality."

What survived scrutiny

The overall detector remains useful. The clean 76.5% result (95% CI 73.6-79.5%) and the intervention effect do not depend on pretending 4288 is a unique control node.

What got withdrawn

The idea that the sparse 38-neuron detector collapses onto one superstar neuron does not survive contact with ablation, C-sweeps, or correlation structure.

Why C=3.0 matters

At C=3.0 the signal is spread across 219 positive-weight neurons, and 4288 drops to rank #5. That is a better fit for a distributed mechanism story.

Practical interpretation: think of neuron 4288 as the boldface term in a sparse summary, not as the whole argument. If the project wants the cleanest paper-faithful replication headline, the 38-neuron set still works as a baseline. If the project wants the best mechanistic description, the wider positive-weight detector is the better object.