Gao et al. claim fewer than 0.1% of MLP neurons predict and causally drive hallucination-linked over-compliance. We replicated on Gemma 3 4B.
This week we ran three benchmarks with negative controls, closed two dead ends, and caught a measurement bug in the original paper’s jailbreak evaluation. The full 5,000-token rerun falsified the alpha slope claim. What it revealed instead: h-neuron scaling modulates compliance style, not compliance count.
Before this week, the causal story rested on one benchmark. Now three independent tasks all show positive endpoint effects with CIs excluding zero. Different evaluators, different task types, same 38 neurons, same direction.
Endpoint effects (α=0→3) with 95% bootstrap CIs · JailbreakBench legacy value now superseded
This is the most important result of the week. Without it, every causal claim could be explained by generic perturbation — scaling any 38 neurons might produce the same shift. It does not.
Ablating H-neurons (α=0.0) drops compliance to 64.2%; ablating random neurons leaves it at ~66.0%. Amplifying H-neurons (α=3.0) pushes compliance to 70.5%; amplifying random neurons stays at ~66.1%.
A causal claim without a negative control is an observation. A causal claim with a negative control that falsifies the generic alternative is an argument.
Gold-label calibration caught a truncation bug in the original paper’s jailbreak evaluation. The 5,000-token rerun killed the alpha slope entirely. But the corrected data revealed something the binary metric was hiding: Disclaimer Erosion.
Regenerating borderline jailbreak cases at full length, every SAFE→HARMFUL flip followed the same structure: ethical preamble (~150–250 tokens), pivot phrase, then substantive harmful content. At 256 tokens the response is cut mid-preamble. The harmful content never reaches the judge.
Page 14, Section 6.2.2: “We generate responses using open-ended sampling with parameters temperature=0.7, top_k=20, top_p=0.8 and a maximum output length of 256 tokens.” No mention of truncation bias. No published code. Their jailbreak compliance rates carry the same systematic underestimate. 100% of α=0.0 responses exceed 256 tokens (median 1,238 tokens).
Same prompts, same stochastic decoding config (T=0.7, top_k=20, top_p=0.8), max_new_tokens=5,000. GPT-4o binary judge on all 1,500 responses. No response hit the 5,000-token cap.
Paired bootstrap Δ (α=0→3): +3.0pp [−1.2, +7.2]. CI includes zero. The slope is gone.
Same 500 prompts, same judge · Legacy slope: +6.2pp [2.4, 10.0] · Canonical: +3.0pp [−1.2, 7.2]
At α=0.0 disclaimers are longest, so truncation suppresses the most true compliance. At higher alphas, shorter disclaimers mean more harmful content fits within 256 tokens. This asymmetric suppression manufactured a slope where none exists.
Compliance gain (pp) when extending from 256 to 5,000 tokens · Largest at α=0.0 where disclaimers are longest — the mechanism behind the spurious slope
The aggregate barely moves (+3pp), but 22.2% of individual prompts flip their compliance label between α=0 and α=3. 32 prompts (6.4%) reverse direction mid-sweep. Alpha reshuffles which prompts comply.
Category-level compliance at 5,000 tokens · n=50 per category per α · Sexual/Adult decreases with alpha while Economic harm increases — opposite directions
In the 25-prompt human audit, the pattern is consistent:
A binary HARMFUL/SAFE metric cannot see this. The safety degradation is a gradient of framing intensity, not a switch from refusal to compliance.
Gold-label validation (25 prompts × 3 alphas): judge-human agreement is worst at α=0.0 where disclaimers are heaviest, and best at higher alphas where harm is more overt. 10 of 11 errors are false negatives: the judge calls SAFE what the human calls HARMFUL.
GPT-4o judge agreement with human labels on full-length responses · n=25 per alpha · 10 of 11 disagreements are false negatives (judge says SAFE, human says HARMFUL)
The judge is most reliable when responses are most dangerous, and least reliable when harm is subtle. A binary prompt lets the judge anchor on the opening framing. Three paragraphs of ethical disclaimers pattern-match to “refusal” before the compliance that follows. The judge’s accuracy is confounded with the phenomenon we’re measuring: disclaimer intensity.
A severity rubric forces the judge to evaluate degree, breaking the anchor on initial framing:
GPT-4o Batch API with this rubric over the existing 1,500 responses plus the negative control set. If binary compliance is flat across alphas while severity shows a monotonic slope, that confirms the mechanism: h-neuron scaling operates on framing, not the refusal decision.
A measurement bug found during calibration is not a detour. It is the calibration working. The best critiques don’t just find the error. They find what the error was hiding.
Two lines of investigation closed this week. Both narrow the explanation space.
SAE features that detect hallucination at AUROC 0.848 cannot steer compliance when manipulated. Tested under two architectures:
Delta-only cancels reconstruction error exactly → the failure is feature-space misalignment, not noise. This is a detection-steering dissociation.
The top-weighted neuron (L20:N4288, weight 12.17 — 1.65× the runner-up) looked like a hallucination hub. Six independent diagnostic tests all say otherwise:
Verdict: 0/6. Full forensic deep dive →
The most important results this week came from looking closely at methodology, not from running new benchmarks.
The story improves when claims are sorted by the weight they can bear. A clean intervention with negative controls is structural steel. A classifier AUROC is useful scaffolding. A full-response readout is a sketch on a napkin.