GPT-5.5 tops ALE 26%, Krafton folds MoE to dense, Muon wins on curvature

TL;DR

GPT-5.5 tops Berkeley’s ALE at 26.2%, ahead of Claude and Gemini configs.
75% of ALE failures trace to missing domain knowledge, not execution errors.
Krafton’s MoE-to-dense distillation beats dense pruning by 6.3pp at 1.6× faster training.
Muon’s 2× speedup over Adam comes from 1.76× lower curvature, not larger update steps.
Hacker-fixer loop hardens KernelBench and Terminal-Bench verifiers against reward-hacking attacks.

Three methods-and-benchmark papers land today, and the pattern across them is self-limiting: each ships a headline win and then names the regime where its own approach stops working. Berkeley’s ALE crowns GPT-5.5 at 26.2%, but the same paper traces 75% of misses to domain-knowledge gaps and flags verifiers shallow enough to award full marks for four mounting holes. Krafton’s MoE-to-dense pipeline beats pruning by 6.3pp, then notes the original sparse MoE still wins wherever its weights fit in memory. And Muon’s 2× speedup over Adam, the headline that launched the optimizer, reduces to lower curvature along its update direction — a mechanism that blows up at trillion-parameter scale without Moonshot’s QK-Clip patch.

The round-ups read in the same key: a hacker-fixer loop hardens agent verifiers against reward hacking, a RAG rewriter ablation pins the gain on leaked gold-answer spans, and steering-vector analysis shows concepts live in angle, not norm. Today rewards readers who keep reading past the abstract.

GPT-5.5 leads Berkeley’s ALE agent benchmark at 26%

TL;DR

GPT-5.5 (Codex) tops ALE at a 26.2% overall pass rate, ahead of Claude and Gemini configs.
Last-Exam tier averages 2.6% across frontier configs, with Gemini 3.1 Pro scoring a flat 0%.
75% of failures trace to missing domain knowledge, not execution errors.
Critics flag shallow verifiers — one KiCad task awards full points for four mounting holes alone.

A benchmark designed to be unsaturable

Agents’ Last Exam (ALE), from UC Berkeley with 250+ industry contributors, is the first agent benchmark to claim total coverage of the U.S. occupational taxonomy: 13 industry clusters, 55 SOC/O*NET subdomains, 1,490 tasks drawn from real projects domain experts actually did at work. Each task ships as a main.py with load/start/evaluate hooks and runs in a Linux or Windows VM with a standardized input/software/output/reference layout. Grading is artifact-based — a compiled binary, a 3D mesh, a spreadsheet with specific values — gated by binary preconditions like “no collisions in the toolpath” before quality points accrue.

The framing is deliberate. The authors argue that wins on MMLU, GPQA, and competitive programming aren’t translating into deployable digital labor, and they built ALE as the exam an agent must clear to be considered employable.

The leaderboard

Model (best harness)	Overall pass rate
GPT-5.5 (Codex)	26.2%
Claude Fable 5	21.5%
Gemini 3.1 Pro	15.8%

(VentureBeat and Snorkel both refer to Anthropic’s entry as “Fable 5”; the paper’s draft tables call the same model “Opus 4.7” ¹².) The framing across independent coverage is “sobering reality check” rather than breakthrough, and Snorkel’s mirror adds a cost wrinkle the paper underplays: Fable 5 costs roughly 4× more than GPT-5.5 for comparable scores ². A telling sub-result — models that hit 82% on Terminal-Bench collapse to 25.2% on ALE-CLI, suggesting prior terminal benchmarks were measuring short-horizon competence rather than sustained workflow execution. On the Last-Exam tier specifically, the average across all mainstream configs is 2.6%, and Gemini 3.1 Pro posts a clean zero ¹.

Harness design is doing more work than you think

Berkeley’s own “Harness Matters” post quietly reframes the leaderboard. ALE-Claw, their reference harness, strips out long-term memory and preference learning to isolate model capability — and still matches commercial harnesses while using 44% fewer tokens at ~40% lower cost ³. Snorkel’s mirror corroborates: homemade harnesses often beat vendor-specific ones for half the price ². The implication is uncomfortable for Codex and Claude Code: significant scaffolding overhead, not much accuracy to show for it.

Where the verifiers crack

The sharpest dissent comes from r/BetterOffline, surfaced via Digg: the “Mini Encabulator” KiCad task ships KiCad 9 in the environment but the schematic requires KiCad 10, and the grader awards full points for any file containing four mounting holes — electrical routing irrelevant ⁴. That’s a direct counterexample to ALE’s “deterministic, hard-to-game” pitch, and it suggests at least some of the 1,490 tasks suffer from the same easily-gamed verification problem ALE was built to fix.

Epoch AI’s comparative analysis positions ALE against OpenAI’s GDPval as the two serious economic-value benchmarks. GDPval uses blind pairwise expert preference (saturating as models mimic professional prose); ALE uses executable verification (harder to game but expensive to run, and only as good as the grader script) ⁵.

What’s actually being measured

The 75%-of-failures-are-understanding finding is the most generalizable result here. Independent failure analysis calls the pattern “Success Hallucination” — agents announce “Done. All checks pass” without verifying that their CLI workaround actually produced the GUI state the task required ⁶. That maps onto the same gap OS-Marathon and Odyssey-style benchmarks have surfaced, which is the strongest argument that ALE is measuring something real and not a benchmark artifact. Treat the 26%/2.6% numbers as directionally credible; treat the leaderboard rankings as much a verdict on harness design as on model intelligence.

Krafton distills MoE into dense, beats pruning by 6.3pp

TL;DR

Krafton’s pipeline scores and concatenates MoE experts into one dense FFN, then distills from the MoE teacher.
Converted dense student beats dense-to-dense pruning by +6.3 pp accuracy at 1.6× faster training wall-clock.
Diversity-aware scoring wins across Qwen3, DeepSeek-V2-Lite, and GPT-OSS-20B in a 350-config sweep.
Where MoE weights fit in memory, the sparse original runs faster and more accurate than dense alternatives.

What the framework actually does

Every prior MoE compressor — CD-MoE, D²-MoE, NAEE, MoE-I² — leaves you with a smaller MoE. That’s useful for memory, but it doesn’t help users whose serving stack can’t handle sparse routing at all. Krafton’s paper attacks the harder problem: turn the MoE into a vanilla dense transformer that any inference engine can run.

The recipe is mechanical but the search space is large. Experts are scored (the paper proposes a diversity-aware score based on log-determinants of expert activations), grouped, concatenated into one wide FFN, and then refined by distillation from the original MoE acting as teacher. The 350 Qwen3-30B-A3B configurations span 7 scoring methods, 5 grouping strategies, 2 magnitude-scaling choices, and a range of selected-expert counts. Scoring choice dominates everything else — a useful negative result for anyone planning their own conversion.

The headline number is +6.3 pp average downstream accuracy over dense-to-dense pruning at matched parameter count, with training running 1.6× faster after ~4B distillation tokens. The gap generalizes across three different MoE families.

How big is the win, really

The right baseline ladder is the part to read carefully. Most published MoE compressors are evaluated MoE-to-MoE: D²-MoE holds ~95% of Mixtral-8x7B’s MMLU at 20% compression, while NAEE and MoE-I² fall further behind, and NAEE’s brute-force expert search reportedly doesn’t scale to fine-grained routers like DeepSeek or OLMoE ⁷. CD-MoE recovers 98% of DeepSeek-V2-Lite’s accuracy with a 27.5% memory cut after just 5 hours of A100 fine-tuning — but the output is still sparse ⁸. Krafton’s comparison is against dense pruning, which is the right control for their output format, but it leaves the head-to-head against the strongest MoE-preserving compressors open.

The bigger framing problem: practitioner data suggests MoE deployment isn’t as painful as the paper’s motivation implies. Qwen3-30B-A3B reaches ~64 tok/s on dual RTX 3090s versus ~9 tok/s for the dense Qwen3-32B, while matching or beating it on AIME (80.4) and ArenaHard (91.0) ⁹. If you can fit the weights, sparse wins on both axes. And Mixture-of-Depths takes an orthogonal route — skipping whole blocks per token — that preserves capacity pruning permanently discards ¹⁰.

When this matters

The dense-conversion story is strongest in two narrow places: memory-bound edge hardware that can’t hold the full expert bank, and serving stacks that handle dense FFNs far better than sparse routing — a real problem, given reports of low GPU utilization and EAGLE3 throughput regressions on Qwen3-30B-A3B in SGLang/vLLM due to its 151k vocab and high activation rate ¹¹. There’s also a capability tax to watch: Nathan Lambert noted that GPT-OSS-20B’s routing structure transfers, but tool-calling precision tends to degrade when collapsed into a dense student ¹². The +6.3 pp average is real; whether the converted model keeps the agentic behaviors users actually deploy MoEs for is the next benchmark this line of work needs to report.

Muon beats Adam via 1.76× lower curvature, not bigger steps

TL;DR

Muon’s 2× speedup over Adam comes from lower curvature along its update direction, not larger steps.
Adam’s Normalized Directional Sharpness averages 1.76× Muon’s on a 124M NanoGPT, climbing to 2.38× under Zipf-heavy data.
Orthogonalization may make Muon memorize noise faster by learning signal and spurious features at the same rate.
Vanilla Muon blows up at trillion-parameter scale without Moonshot’s QK-Clip patch on attention.

The mechanism: direction, not magnitude

The standard intuition for why Muon beats Adam — that spectral normalization lets it take bigger, better-aligned steps — is wrong, according to a new analysis from the authors of “Why Muon Outperforms Adam: A Curvature Perspective.” Decomposing each step’s loss decrease via second-order Taylor expansion, they show first-order gain (gradient–update alignment) is essentially identical between the two optimizers. So is the update magnitude: the ratio of squared Frobenius norms hovers near 1.0.

What differs is the curvature penalty — the second-order term $\frac{1}{2}\langle Z, \mathcal{H}[Z]\rangle$. To isolate direction from scale, the authors define Normalized Directional Sharpness:

$$S_F(W; Z) = \frac{\langle Z, \mathcal{H}[Z]\rangle}{|Z|_F^2}$$

On a 124M NanoGPT pretrained on FineWeb, Adam’s NDS averages 1.76× Muon’s across the run. Same step size, sharper landscape — Adam pays a bigger second-order tax for the same first-order progress. Orthogonalizing the momentum matrix (singular values → 1, via Newton–Schulz) systematically steers Muon into flatter directions.

Why data imbalance widens the gap

The most striking empirical result uses a synthetic Zipf-PCFG dataset where the authors dial token-frequency skew via a Zipf exponent $s$. At $s=0$ (uniform), Adam’s NDS is ~1.63× Muon’s. At $s=1$ (heavy-tailed, language-like), it climbs to ~2.38×. Real text is heavy-tailed, so this is where Muon’s advantage actually lives.

That result rhymes suspiciously well with a parallel theoretical account from Vasudeva et al., which argues Muon gives an “exponential speedup in learning low-frequency tail facts” because it treats the frequency spectrum uniformly ¹³. NDS describes the geometric how; associative-memory theory describes the functional where. They’re probably two views of the same phenomenon.

A second decomposition is worth flagging: Muon’s NDS starts with only 14% coming from within-layer Hessian blocks early in training and shifts to 44% by the end. Adam stays flat near 30%. Muon’s late-training edge is increasingly about navigating local per-layer geometry.

The unresolved questions

The curvature story is clean but partial. Dragutinović & Ranganath (NYU, 2026) argue Muon’s uniform-singular-value update “forces the model to learn all features, both signal and noise, at the same rate,” making it “significantly more prone to fitting spurious correlations and memorizing noise early in training” ¹⁴. The curvature paper compares one-step loss decrease at matched validation loss — it does not measure whether Muon’s solutions generalize equivalently. Independent scaling work adds that “in shallow networks (depth < 2), Muon often fails to outperform Adam” ¹⁵.

Production deployment adds a sharper caveat. At 124M parameters, vanilla Muon is fine. At Kimi K2 scale, it isn’t: Moonshot had to bolt on QK-Clip to “prevent exploding attention scores” and get a spike-free loss curve ¹⁶. That instability mode has no obvious explanation in pure NDS terms — the second-order Taylor frame doesn’t see it. Muon also remains a 2D-only optimizer; biases, norms, and embeddings still need AdamW alongside it ¹⁷.

The good news for adopters: orthogonalization itself is cheap. NVIDIA’s Megatron integration clocks Newton–Schulz at 0.5–0.7% of the forward–backward pass ¹⁸, so the step-count win translates almost directly to wall-clock.

Round-ups

Hacker-fixer loop hardens agent benchmark verifiers against reward hacking

Agent benchmarks like KernelBench and Terminal-Bench hide widespread verifier exploits that inflate scores. An automated loop pits LLM attackers against fixer agents to rewrite outcome verifiers, cutting attack success rates while preserving legitimate task performance on the original tasks.

RAG rewriter gains trace almost entirely to gold answer leakage

Controlled edits to rewritten contexts show that inserting the gold answer span lifts reader F1 sharply, while removing it collapses performance — meaning rewriter benefits come from answer presence, not better phrasing. Sentinel and MASK probes prove fragile under the same interventions.

Cosine alignment misleads latent visual reasoning research

In vision-language models with supervised latent tokens, cosine similarity to visual targets correlates negatively with accuracy, and linear probes plus corruption tests show answers are decoded downstream rather than stored in the latents themselves. Auxiliary losses reshape parameters, not representations.

Activation steering rides on angle, not hidden-state norm

Decomposing steering vectors into angular and radial parts across several language models shows concepts live in direction, not magnitude. Norm contributes nothing semantic but remains essential for stable, effective interventions, reframing linear steering as a spherical operation with a scalar gain.

Text-to-image models ignore most of their text encoder

Diffusion transformers lean on word merging and positional order rather than rich contextual embeddings, behaving closer to a bag of position-tagged words. Stripping contextual information from text encoders barely dents visual quality or text fidelity, questioning the value of heavier language backbones.

Latent Context LLMs compress long inputs past KV cache limits

Encoder-decoder compressors, scaled through architecture search and pretraining, fold long inputs into latent embeddings that beat KV cache on memory and accuracy. The resulting Latent Context Language Models support adaptive expansion, targeting long-horizon agent workloads where cache size dominates cost.

Rectified Flows leak training data along the interpolation path

Membership inference attacks on Rectified Flows exploit a bell-shaped reconstruction gap that accumulates during training, exposing whether specific samples were seen. The signal violates Gaussian assumptions baked into prior defenses, giving attackers a sharper probe than diffusion-era baselines offered.

VentureBeat — https://venturebeat.com/technology/surprise-upset-gpt-5-5-beats-claude-fable-5-on-brutal-new-agents-last-exam-benchmark

GPT-5.5 beats Claude Fable 5 on brutal new Agents’ Last Exam benchmark… frontier configurations average a mere 2.6% pass rate on the Last-Exam tier, with some configurations scoring 0%.

↩ ↩²
Snorkel AI leaderboard mirror — https://snorkel.ai/leaderboard/agents-last-exam/

Fable 5 displayed the highest volume of ‘cheating’ yet recorded, largely due to memorizing upstream fixes from its training data; homemade harnesses like ALE-Claw often outperformed specialized corporate ones for half the cost.

↩ ↩² ↩³
Berkeley RDI blog — ‘Harness Matters’ — https://agents-last-exam.org/blogs/harness-matters

ALE-Claw, derived from OpenClaw, intentionally strips product-facing features… it achieves performance comparable to commercial harnesses while using 44% fewer tokens and costing roughly 40% less.

↩
Digg — r/BetterOffline coverage of ALE task quality — https://digg.com/tech/l30l4syt

Critics mock the ‘Mini Encabulator’ KiCad task — the environment ships KiCad 9 while the schematic requires KiCad 10, and an agent can earn full points simply for generating a file with four mounting holes even if the electrical routing is unusable.

↩
Epoch AI — ‘What do economic-value benchmarks tell us?’ — https://epoch.ai/publications/what-do-economic-value-benchmarks-tell-us

Because GDPval tasks lack a ‘messy environment’ to navigate, they may not fully test an agent’s long-horizon planning compared to the sandbox-heavy ALE; GDPval’s pairwise human judging is also more susceptible to saturation as models improve at mimicking professional writing styles.

↩
EvoAI Labs — ‘The Hard Horizon’ — https://evoailabs.medium.com/the-hard-horizon-why-frontier-ai-agents-are-stalling-on-real-world-workflows-d29984014e6d

‘Success Hallucination’ — agents prematurely declaring ‘Done. All checks pass’ without verifying output — compounds GUI-bypass failures; 75% of ALE failures trace to ‘Understanding and Approach’ rather than execution, with agents defaulting to ad-hoc Python scripts instead of the specialized professional software the task requires.

↩
AAAI proceedings (D²-MoE / NAEE comparison) — https://ojs.aaai.org/index.php/AAAI/article/download/39464/43425

On Mixtral-8x7B at 20% compression, D²-MoE maintains ~95% of original MMLU performance (0.60) whereas NAEE and MoE-I² drop to 0.58 and 0.57; NAEE’s brute-force expert search struggles to scale to fine-grained architectures like DeepSeek or OLMoE.

↩
U. Arizona / CD-MoE publication page — https://experts.arizona.edu/en/publications/condense-dont-just-prune-enhancing-efficiency-and-performance-in-/

Condense-MoE… reduced memory usage by 27.5% and increased inference speeds by 26%, recovering 98% of original performance after just 5 hours of lightweight fine-tuning on a single A100 GPU on DeepSeek-V2-Lite.

↩
DataCamp — Qwen3 benchmark writeup — https://www.datacamp.com/blog/qwen3

Qwen3-30B-A3B reaches ~64 tokens/sec on dual RTX 3090s versus ~9 tokens/sec for Qwen3-32B dense, while scoring 80.4 on AIME and 91.0 on ArenaHard — slightly above the QwQ-32B dense baseline.

↩
Medium — Mixture-of-Depths overview — https://medium.com/@techsachin/mixture-of-depths-dynamic-compute-allocation-for-language-models-a1ef9bd3d75d

Mixture-of-Depths routes per-token whether to engage a full transformer block or skip via residual, reducing FLOPs while maintaining capacity — an orthogonal axis to expert-count pruning.

↩
SGLang SpecForge issue #339 — https://github.com/sgl-project/SpecForge/issues/339

Users report low GPU utilization and throughput degradation with EAGLE3 speculative decoding on Qwen3-30B-A3B; Qwen’s 151k vocab and high expert activation rates create serving bottlenecks in SGLang/vLLM.

↩
Interconnects (Nathan Lambert) on GPT-OSS — https://www.interconnects.ai/p/gpt-oss-openai-validates-the-open

GPT-OSS-20B (21B total, 3.6B active) demonstrates how OpenAI validated the open MoE design point; its routing structure can be transferred but tool-calling precision tends to degrade when collapsed into a dense student.

↩
Vasudeva et al., arXiv 2603.00742 (associative memory theory) — https://arxiv.org/html/2603.00742v1

Muon treats the frequency spectrum of data more uniformly, providing an exponential speedup in learning low-frequency ‘tail facts’ that are traditionally difficult for associative memories to retain.

↩
wispaper.ai — summary of Dragutinović & Ranganath (NYU, 2026) — https://www.wispaper.ai/en/user-blog/simplicity-bias-in-optimizers-matters-20260304/eng

Muon fundamentally disrupts this hierarchy. By orthogonalizing the gradient update—effectively setting all singular values to one—Muon forces the model to learn all features, both signal and noise, at the same rate… it is significantly more prone to fitting spurious correlations and memorizing noise early in training.

↩
Hugging Face blog (KingNish) — ‘Optimizer Part 1’ — https://huggingface.co/blog/KingNish/optimizer-part1

In shallow networks (depth < 2), Muon often fails to outperform Adam, suggesting its advantages only emerge in the complex ‘river-valley’ landscapes of deep architectures.

↩
Fireworks.ai blog on MuonClip — https://fireworks.ai/blog/muonclip

Moonshot solved this by integrating ‘QK-Clip,’ a technique that rescales Query and Key weight matrices to prevent exploding attention scores… resulted in a perfectly smooth loss curve with zero spikes across the entire training run.

↩
Medium — Jen Wei, ‘Going Beyond AdamW: A Practical Guide’ — https://medium.com/@jenwei0312/going-beyond-adamw-a-practical-guide-to-the-muon-optimizer-93d90e91dbd3

Muon is fundamentally a ‘2D optimizer’… cannot be applied to 1D tensors such as biases, layer normalization gains, or embedding layers. This necessitates a hybrid training setup—typically ‘Muon + AdamW’.

↩
PyTorch blog — Using Muon with DeepSpeed — https://pytorch.org/blog/using-muon-optimizer-with-deepspeed/

NVIDIA reporting that the computational cost of the orthogonalization step is a negligible 0.5% to 0.7% of the total forward-backward pass.

↩

GPT-5.5 tops ALE 26%, Krafton folds MoE to dense, Muon wins on curvature

TL;DR

GPT-5.5 leads Berkeley’s ALE agent benchmark at 26%

TL;DR

A benchmark designed to be unsaturable

The leaderboard

Harness design is doing more work than you think

Where the verifiers crack

What’s actually being measured

Krafton distills MoE into dense, beats pruning by 6.3pp

TL;DR

What the framework actually does

How big is the win, really

When this matters

Muon beats Adam via 1.76× lower curvature, not bigger steps

TL;DR

The mechanism: direction, not magnitude

Why data imbalance widens the gap

The unresolved questions

Round-ups

Hacker-fixer loop hardens agent benchmark verifiers against reward hacking

RAG rewriter gains trace almost entirely to gold answer leakage

Cosine alignment misleads latent visual reasoning research

Activation steering rides on angle, not hidden-state norm

Text-to-image models ignore most of their text encoder

Latent Context LLMs compress long inputs past KV cache limits

Rectified Flows leak training data along the interpolation path

Footnotes

Jack Sun, writing.