Current results

Gemma 3 4B
Results Ledger

This page summarizes the current Gemma findings: detector quality, neuron distribution, the top-neuron artifact verdict, FaithEval intervention results, jailbreak safety impact, and cross-benchmark consistency.

Model

Gemma 3 4B

34 layers, 10,240 neurons each

Total neurons scanned

348,160

34 x 10,240 MLP positions

H-Neurons found

0.011% of all neurons

Test accuracy

76.5%

Disjoint eval (n=780) · 95% CI 73.6-79.5%

Bottom line: The detection signal replicated on Gemma, but the sparse 38-neuron story is not the whole story. The strongest single neuron is an L1 artifact, and the intervention effect looks more like content-following plus format drift than a clean obedience dial.

Classifier result Layer distribution Top-neuron verdict Intervention Jailbreak Templates Cross-benchmark Takeaways

Main result

76.5% accuracy on a clean test set confirms a real signal

An L1 logistic regression on 348,160 CETT features identifies 38 neurons with positive weight. On the fully disjoint evaluated subset, the classifier reaches 76.5% accuracy with a 95% CI 73.6-79.5%.

Accuracy

76.5%

Disjoint eval (780 samples) · 95% CI 73.6-79.5%

AUC

0.843

95% CI 0.815-0.870

Precision

76.7%

95% CI 73.4-80.1%

Recall

76.1%

95% CI 71.7-80.5%

Bars print the exact held-out metric values directly on the plot.

n=780 disjoint test samples Bootstrap: 10,000 stratified resamples F1: 76.4% [73.2, 79.4]

Overlapping test

77.7%

n=1,993 · 95% CI 75.9-79.5%

Disjoint test
76.5%
n=780 evaluated from 782 sampled (2 missing activations)

Why this matters: Think of the overlap check like retaking an exam with a new question sheet instead of the same answer key. The score barely moved, which means the classifier kept its edge when memorization was taken off the table.

On the data leakage

The paper-style train/test sampling makes overlap mathematically hard to avoid at this dataset size. What matters is the size of the correction: the clean split cost only 1.1pp. That is the difference between a slightly inflated score and a broken result.

Layer distribution

47% of H-Neurons sit in layers 0–10

The detector signal is front-loaded. Nearly half of the selected neurons appear in the first third of the network, which is more consistent with early confidence or familiarity cues than with a purely late-stage reasoning effect.

Only non-zero bars are labeled so the sparse layer counts stay readable.

n=38 selected H-Neurons CI status: descriptive count Early-heavy split: 18 in layers 0-10

Early (L0–10)

47%

18 of 38 H-Neurons

Middle (L11–20)

26%

10 of 38 H-Neurons

Late (L21–33)

26%

10 of 38 H-Neurons

Reading: If hallucination-related signal were only a last-minute decoding effect, it should pile up near the output end. Instead it shows up early, which suggests the model is leaning toward a right-or-wrong path long before it finishes the answer.

Top-neuron section

Neuron 4288 looks dramatic in the weights, not in the evidence

The highest-weight neuron, L20:N4288, is 1.65× the runner-up by L1 weight. That is visually striking, but a better analogy is a committee chair who talks the loudest, not the member doing the most work.

L20:N4288

12.17

L1 weight (rank #1)

L14:N8547

7.39

L1 weight (rank #2)

Each horizontal bar carries its exact L1 weight so the rank gaps are visible at a glance.

n=10 top-weight neurons shown CI status: not applicable #1 vs #2 gap: 1.65×

Interpretation: Weight ranking alone is a bad proxy for importance under L1 sparsity. The regularizer concentrates credit onto one representative of a correlated group, so the tallest bar can be a bookkeeping artifact.

Open the six-test neuron 4288 deep dive

Verdict: -- support a unique neuron-hub story

Loading top-neuron verdict...

These six checks are fixed diagnostics from the committed 4288 investigation, not a bootstrap chart.

n=-- diagnostic checks CI status: Loading... Support: --

AUC

Best single: --

Cohen’s d

Runner-up: --

C-sweep

Present in only 3 of 9 C values

Top contrib

Largest contributor share

Ablation

Below 2pp threshold

Max |r|

L1 concentration signal

Practical takeaway: Loading top-neuron takeaway...

Read the full artifact rationale

Causal evidence

Scaling H-Neuron activations changes model behavior

On FaithEval, the anti-compliance curve rises cleanly. The standard raw curve only looks contradictory because many high-α outputs switch to answer text that the MC-letter parser cannot decode.

Benchmark

FaithEval

Loading intervention summary...

Anti-compliance baseline

Standard raw baseline

Anti at α=3.0

Standard corrected at α=3.0

The curve carries the full sweep; the chips below only call out the endpoints and effect size.

Loading n... Loading CI status... Loading control summary...

Specificity check: random 38-neuron controls average 0.02pp/α (95% interval -0.11-0.16pp/α), while H-neurons move 2.09pp/α. This looks like a targeted direction, not generic activation damage.

Why the curves disagree on first read: the anti-compliance prompt asks for a letter and the evaluator grades letters. The standard prompt asks for exact answer text while the evaluator still looks for letters. That mismatch turns a surface-form shift into a fake compliance drop.

Evaluation confound

9 parse failures grow to 150 by α=3.0

Once you separate answer content from evaluator-visible formatting, the contradiction shrinks. At α=3.0 only, strict text remapping lifts the standard estimate from 63.6% raw to 72.1% (95% CI 69.2-74.8%) by recovering 140 of 150 parse failures (93.3% of the reviewed set; 95% CI 88.2-96.3%).

Anti-compliance prompt

Context: {context}
Question: {question}
{choices}

If the context conflicts with
established knowledge, answer based
on your own knowledge.
Answer with just the letter.

Standard FaithEval prompt

You are an expert in retrieval
question answering. Please respond
with the exact answer only. Do not
be verbose or provide extra information.

Context: {context}
Question: {question}
{choices}
Answer:

The bars show how recoverable-letter formatting degrades as α increases; the chips below just anchor the endpoints.

Loading n... Loading CI status... Worst point: 150 failures at α=3.0

Unparseable at α=0.0

Unparseable at α=3.0

The rows below separate raw scoring, the parseable-only conditional metric, and the strict answer-text correction committed for α=3.0 only.

Loading sample context... Loading CI status... Current full-pop correction: only committed for α=3.0

Hidden structure

86.2% of samples never change under scaling

The intervention effect is concentrated in 138 swing samples, while 862 stay fixed: 600 always compliant (60.0%) and 262 never compliant (26.2%).

The stacked bars separate fixed groups from the swing pool and show how that pool redistributes between resistant and compliant states as α increases.

Loading n... Loading CI status... Swing pool: 138 samples

No-op→max effect

+4.5pp

α=1.0 to α=3.0 · 95% CI 2.9-6.1pp · slope +2.09 pp/α

Anti swing samples

138

13.8% of population · 95% CI 11.8-16.1%

Standard split

Pending

Raw parser split withdrawn pending all-α text scoring

Analogy: this looks less like turning a global thermostat and more like nudging a small undecided bloc in an election. Most examples do not budge. The visible average shift is carried by a relatively small swing population.

Swing deep dive: 68.1% show knowledge override

Profiling the 138 swing samples reveals that most follow R→C trajectories: by α=3.0, 105 are compliant and 33 remain resistant inside the swing pool. Only 23.2% show the beneficial C→R pattern. Surface features do not predict swing status — the mechanism appears driven by internal model states rather than input structure.

Open the swing characterization deep dive

Jailbreak safety

Jailbreak effects are evaluator-dependent, not a single endpoint-delta story

April 13 paired analysis on the same 500-prompt outputs showed a real shift, but where it appears depends on the evaluator: v2 reports a positive harmful-binary slope, while v3 places the stronger signal in severity (substantive_compliance) rather than binary harmful rate.

v2 harmful_binary slope

+2.30 pp/α

95% CI 0.99-3.58pp · CI excludes zero

v3 harmful_binary slope

+0.46 pp/α

95% CI -1.46-2.41pp · CI includes zero

v3 substantive_compliance slope

+2.00 pp/α

95% CI 0.11-3.87pp · severity shift signal

v3 seed-1 random control

-0.34 pp/α

95% CI [-2.28, +1.61] · single-seed caveat

Interpretive caveats

Held-out accuracy is tied: on the holdout audit, v3 and StrongREJECT-4o both score 96.0% with 0 discordant records, so v3 stays primary for structure rather than an accuracy edge.
Evaluator dependence is structural: the paired v2-v3 slope compression is mostly borderline-case reclassification, not a random scoring glitch.
Specificity is still incomplete: v3 control evidence is currently single-seed and there is no paired v2-v3 control set for a cleaner evaluator-isolated comparison.
Generation caveat remains: jailbreak runs are stochastic and should be treated as benchmark-supporting evidence, not as a standalone mechanistic proof.

n=500 prompts per α across 4 alphas Legacy aggregate visual kept for continuity; load-bearing claims use the April 13 paired evaluator audit

Template & category detail

Template wording creates a 10× range in baseline compliance

The five JailbreakBench prompt templates produce wildly different baseline compliance rates, from near-zero (T2) to >40% (T1). Category sensitivity is also visible but noisier at n=50 per cell.

n=100 prompts per template per α Wilson CIs on per-template rates

n=50 prompts per category per α Wilson CIs at n=50 are wide — interpret rankings with caution

Cross-benchmark

Cross-benchmark direction is consistent, but certainty is uneven

FaithEval and FalseQA have cleaner specificity controls and the clearest positive effects. JailbreakBench still supports the broader compliance story, but its binary slope conclusion depends on evaluator design and should be treated as provisional.

Floating bars = 95% CI range · dots = point estimates

FaithEval

+4.5pp

No-op→max · 95% CI 2.9-6.1pp · slope +2.09 pp/α · n=1,000 · Negative control · Evaluator · Generation

FalseQA

+1.62 pp/α

Slope · 95% CI 0.52-2.74 · n=687 · Negative control · Evaluator · Generation

JailbreakBench

Evaluator-dependent

v2 harmful slope 2.30pp/α (95% CI 0.99-3.58pp) vs v3 binary 0.46pp/α (95% CI -1.46-2.41pp); v3 substantive 2.00pp/α (95% CI 0.11-3.87pp) · single-seed control caveat

Interpretation

The safest cross-benchmark claim is directional: H-neuron scaling moves the model toward compliance-oriented behavior on multiple surfaces. The strongest confidence sits on FaithEval and FalseQA. Jailbreak currently strengthens the narrative through severity and measurement-case-study evidence, not through a single settled binary slope.

Takeaways

Five current conclusions from the Gemma pass

The detector signal survives a clean split

The classifier still works when overlap is removed. That is the core replication result.

The sparsity story is partly an artifact story

Loading top-neuron takeaway...

The causal effect is real but narrow

Scaling changes behavior on FaithEval, but mostly on a swing subset rather than across the full population.

The remaining uncertainty is now localized

The open question is no longer whether there is an effect, but whether the effect is context credulity, answer-text bias, or generic surface-form degradation.

Jailbreak results extend the picture — with caveats

The jailbreak result is now measurement-conditional: v2 shows a positive binary slope, v3 binary is uncertain, and v3 severity-level outcomes still rise with alpha. This keeps jailbreak in the evidence stack as a useful case study in measurement and externality, but not as a clean "all slopes exclude zero" headline.

Gemma 3 4BResults Ledger