SkillOpt sweeps 52-for-52, ByteDance's Shannon law, Pion rescues Muon RLVR
Three unrelated method papers today: SkillOpt's 52-for-52 sweep, ByteDance's Shannon-channel scaling law, and Pion's high-pass replacement for Muon.
SkillOpt sweeps 52-for-52, ByteDance’s Shannon law, Pion rescues Muon RLVR
TL;DR
- SkillOpt wins 52/52 model×benchmark×harness combinations, lifting GPT-5.5 by 23.5 points on average.
- SkillLens companion paper finds model-generated skills cause negative transfer in 25% of deployments.
- ByteDance Seed recasts LLM training as a Shannon channel, hitting R²=0.847 on 12B extrapolation.
- Pion swaps Muon’s whitening for a high-pass filter anchoring top singular values.
- Muon collapses to near-zero accuracy across all 8 RLVR runs the Pion authors tried.
Today’s three feature papers are method drops rather than model releases: a skill-library curriculum (SkillOpt), a reframed scaling law from ByteDance Seed, and an RLVR optimizer (Pion) built to replace Muon. None ship a new architecture; each proposes a different layer of the training-and-deployment stack to fix.
Each also lands entangled with same-week work that shapes how to read the headline. SkillOpt’s 52-for-52 sweep arrives with a companion paper, SkillLens, that measures a 25% backfire rate on the same kind of skills. ByteDance’s R²=0.847 extrapolation arrives with critics counting nine free constants and no public reproduction. Pion arrives the same week as a separately named Pion doing something different to the weight spectrum. The round-ups skew toward reasoning scaffolds and multimodal training pipelines.
SkillOpt goes 52-for-52; companion paper finds 25% backfire
Source: hf-daily-papers · published 2026-05-21
TL;DR
- SkillOpt wins or ties on all 52 model×benchmark×harness combinations, lifting GPT-5.5 by +23.5 points on average and +38.9 on SpreadsheetBench.
- Same-day companion paper SkillLens reports model-generated skills cause negative transfer in ~25% of deployments — 13% on SpreadsheetBench, 47% on ALFWorld.
- A “Strength Paradox”: Gemini-3.1-Flash beat GPT-5.4 as a skill extractor, so extraction is a separate capability from task-solving.
- Practitioner reruns find gains heavily domain-skewed: ~50pt in healthcare vs. ~4.5pt in software engineering, where frontier priors are already strong.
The headline result, and the catch underneath it
Microsoft dropped two papers on the same day about the same idea: treat an LLM agent’s natural-language skill document as a trainable artifact, then optimize it like you’d optimize weights. The flashy one, SkillOpt, posts the kind of clean-sweep numbers that get screenshotted — best or tied-best in all 52 evaluated configurations, +23.5 points average on GPT-5.5 across six benchmarks, +49.4 points on DocVQA for the nano-scale variant, and transferred skills that gain +59.7 points moving from Codex to Claude Code.
The quieter one, SkillLens, is the dissent. Run by overlapping authors, it audits the broader skill lifecycle and finds that model-generated skills hurt the target model in roughly a quarter of cases, with failure rates climbing from 13% on SpreadsheetBench to 47% on ALFWorld 1. Read the two together and SkillOpt’s most-emphasized engineering choices — the held-out validation gate, the rejected-edit buffer, the textual learning-rate cap — stop looking like polish and start looking like load-bearing safety rails compensating for a fragile substrate.
The Strength Paradox reframes who should be the optimizer
SkillLens’s other contribution is a result that nobody in the optimizer-as-training analogy wants to hear: a model’s task-solving ability does not predict its ability to extract useful skills for others. Gemini-3.1-Flash outperformed the much larger GPT-5.4 as an extractor in their tests, suggesting skill extraction is “a distinct, non-linear capability” 2. SkillOpt cleanly separates target model from optimizer model in its loop architecture, which is the right design — but the field’s instinct to reach for the strongest available model as the optimizer is exactly wrong.
Where it sits in the optimizer field
The natural comparison is GEPA, the reflective/genetic prompt optimizer that has been the recent reference point. The head-to-head is mixed, not a rout: GEPA matches RL with up to 35× fewer rollouts and dominates in low-resource settings, while SkillOpt’s claimed edge shows up mainly when the optimizer itself is sub-frontier — GEPA’s reflection quality reportedly collapses there, while SkillOpt holds a +10.4-point self-optimization gain 3. Meanwhile, Anthropic’s SKILL.md format has already been adopted by GitHub Copilot, OpenAI Codex, and Gemini CLI as the de facto interchange standard 4, so SkillOpt’s best_skill.md artifact slots into an ecosystem that already has the container. The contribution is the training loop that fills it.
What practitioners are actually seeing
Two reproduction notes deserve weight. First, prior text-optimization work warns that prompts tuned against one model behave like control signals in that model’s embedding space, not portable logic, and that multi-agent failures usually live in handoffs rather than node prompts 5 — both undercut SkillOpt’s transfer story. Second, a hands-on writeup finds the optimizer is far more conservative than the framing suggests, often accepting only 1–4 edits per run, with gains heavily concentrated in domains where frontier models lack priors: ~50 points in healthcare, ~4.5 points in software engineering 6.
The honest summary of this two-part release: SkillOpt is a well-engineered optimizer riding on a substrate whose own authors document a 25% backfire rate. Read SkillLens first.
Further reading
- From Raw Experience to Skill Consumption: A Systematic Study of Model-Generated Agent Skills — hf-daily-papers
Pion swaps Muon’s whitening for a high-pass filter
Source: hf-daily-papers · published 2026-05-18
TL;DR
- Muon collapses to near-zero accuracy across all eight RLVR runs the authors tried (GRPO/GMPO × Qwen3-1.7B/4B × MATH/GSM8K).
- Pion swaps Muon’s uniform whitening for a high-pass filter that anchors top singular values and crushes the tail.
- On LIBERO Object, Pion hits 100% success in 1,500 steps vs. Muon’s 97% and AdamW’s 32%.
- A separately named “Pion” dropped the same week freezes the weight spectrum instead of filtering gradients.
Why Muon breaks outside pretraining
Muon’s pitch in LLM pretraining was simple: orthogonalize the momentum matrix via Newton–Schulz so every singular direction gets a unit-scale update. That uniform whitening is exactly what burns it in two regimes the original paper never targeted. Before going further, a naming-hygiene note: a different optimizer also called Pion posted the same week updates weights via coupled orthogonal transforms to keep the singular-value spectrum fixed 7 — opposite mechanism, same name, different problem. This article is about the OPTML-Group gradient-side Pion.
In RLVR, policy gradients have terrible signal-to-noise. Whitening the spectrum lifts noise to the same magnitude as signal, and the policy “hollows out.” This isn’t just the Pion authors’ framing — a Hugging Face practitioner series running Muon on GRPO independently diagnosed the same pathology, noting Muon lacks “a variance-based brake” for low-SNR gradients 8. AdaMuon’s response was to bolt an Adam-style second moment onto the orthogonalized step 9; Pion takes a different route.
In VLA training, action-head gradients are low-rank. Whitening over-amplifies the noisy tail near the zero singular values, producing jittery trajectories.
What Pion actually does
Pion replaces Muon’s single NS polynomial with a two-stage chain. The promotion polynomial (coefficients 1.875, −1.25, 0.375) monotonically lifts every singular value into a regime where the next stage can sort them. The suppression polynomial (0, 2.5, −1.5) — note the zero linear term — anchors values near 1 and contracts small ones toward 0. Five total steps, split typically 2+3.
flowchart LR
G[Momentum matrix] --> P[Promotion NS<br/>lift all σ]
P --> S[Suppression NS<br/>high-pass: σ≈1 or σ→0]
S --> U[Orthogonal-ish update]
The other piece worth stealing independent of the high-pass story is per-head mode: Q/K/V/O are reshaped along the head dimension and run through NS independently per head, with GQA handled by scaling Q/K/V groups consistently 10. Specialized heads inherited from pretraining want heterogeneous update scales, and treating the whole projection as one block flattens that.
A “low-pass Muon” ablation that suppresses leading singular values failed to train at all, which is the cleanest result in the paper — it confirms the direction of the filter, not just that filtering helps.
The asterisks
Two caveats deserve more weight than the paper gives them.
First, LIBERO is a soft benchmark. LIBERO-PRO showed that models scoring >90% on standard LIBERO can drop to ~0% under modest lighting and camera perturbations, indicating heavy memorization 11. Pion’s 75.93% on the harder LIBERO-Plus is a more honest number than the 100% Object headline. The real-robot Franka result (85.6% vs. Muon’s 38.9% and AdamW’s 31.1% across three grasp-and-place tasks) is more convincing than either simulation number, but it’s also 200 teleop trajectories on three tasks — a thin slice.
Second, the RLVR wins are all on Qwen3, the one model family where “spurious rewards” work — Qwen3 improves on MATH-500 even when trained with random or deliberately incorrect rewards, an effect absent in Llama and OLMo 12. That doesn’t invalidate Pion, but it weakens the claim that Pion is learning more than Muon. It may just be failing to break a latent Qwen capability that Muon’s noise amplification disrupts.
Takeaway
Muon’s RL collapse is real and reproduced outside this paper. Pion is a credible fix, but it’s one of three (AdaMuon, DynMuon, this) racing to patch the same hole, and its strongest numbers lean on benchmarks and a model family both known to inflate gains. The per-head NS iteration is the part most likely to survive regardless of which optimizer wins.
ByteDance’s Shannon law extrapolates 12B Pythia from 6.9B fits
Source: hf-daily-papers · published 2026-05-21
TL;DR
- ByteDance Seed recasts LLM training as a Shannon-Hartley noisy channel, with parameters as bandwidth and tokens as signal power.
- R²=0.847 extrapolating a 12B Pythia run to 307B tokens from ≤6.9B fits, vs. Chinchilla’s 0.305 and OpenAI’s −0.082.
- R²=0.9602 under 2-bit GPTQ on Pythia, where Chinchilla manages only 0.7312.
- Critics flag 9 free constants as over-parameterized, with some variances exceeding 50%.
- No independent reproduction or public code release exists despite the ByteDance Seed affiliation.
The reframing
Xu Ouyang and collaborators (ByteDance Seed, UVA, Berkeley) propose that an LLM is a Shannon-Hartley channel. Parameters $N$ are bandwidth, training tokens $D$ are signal power, and three distinct noise terms — data noise ($dD^\delta$), model-interaction noise ($c(DN)^\gamma$), and an irreducible entropy floor $e$ — sit in the denominator. Test loss is the reciprocal of the resulting capacity. The payoff is that one equation now spans pretraining, additive Gaussian noise, post-training quantization, and SFT loss basins — regimes where the OpenAI and Chinchilla laws are strictly monotonic and therefore structurally incapable of fitting non-monotonic curves 13.
The image an independent breakdown reaches for is sharper than the math:
Larger models act like sensitive antennas — if not properly tuned, they simply amplify the static of their training data. 13
That’s the mechanism behind “backfiring scaling”: when the model-noise exponent $\gamma$ exceeds the bandwidth exponent $\alpha$, adding parameters hurts. The paper also finds the data-noise exponent $\delta$ consistently exceeds the signal exponent $\beta$, which would make catastrophic overtraining — already documented as a 2–3% post-SFT regression on OLMo-1B trained from 2.3T to 3T tokens 14 — an intrinsic, not incidental, property of token scaling.
Where it beats Chinchilla
The headline numbers are extrapolation and perturbation robustness. Fits are done on small models and short runs; the law is then asked to predict regimes it never saw.
| Regime | Shannon R² | Chinchilla R² | OpenAI R² |
|---|---|---|---|
| 12B Pythia, 307B tokens (fit on ≤6.9B / 180B) | 0.847 | 0.305 | −0.082 |
| 2-bit GPTQ on Pythia | 0.9602 | 0.7312 | — |
| Gaussian noise @ 10 dB SNR (Pythia) | 0.9555 | — | 0.0707 |
| GSM8K SFT (avg across LRs) | 0.936 | — | negative |
These aren’t marginal wins. On SFT with high learning rates, the OpenAI law’s average R² goes negative because it cannot represent the “loss basin” shape at all. Precision-aware scaling work from Kumar et al. had already shown low-bit training reduces effective parameter count predictably 15; SSL’s claim is that the same denominator term handles it without a separate equation.
Why some are skeptical
The strongest pushback is structural, not empirical. Critics note that nine free constants leave the full law over-parameterized: “multiple different parameter combinations can fit the same training data equally well… some parameters are jointly poorly constrained, with variances exceeding 50%” 16. That matters precisely because the extrapolation result depends on those constants being stable in regimes the fit never touched. A competing first-principles story — Bahri et al.’s derivation of exponents from the intrinsic dimension of the data manifold 17 — gets no engagement in the paper.
No independent re-fit on held-out checkpoints exists; the strongest external coverage paraphrases the paper rather than re-running it 13, and despite the ByteDance Seed affiliation there is no public code drop. The Gaussian-noise assumption baked into Shannon-Hartley also goes unstressed — real web-scale data noise is structured, not white.
What’s at stake
If SSL survives independent reproduction, the practical consequence is that the field stops treating overtraining, quantization-induced degradation, and SFT collapse as separate empirical curiosities and starts budgeting them against a single SNR. One Chinese-language summary calls this “a transition from brute-force compute allocation to a physics-based understanding of model limits” 18. That framing is doing work the nine constants haven’t yet earned — but the 0.847-vs-0.305 extrapolation gap is the kind of gap that’s hard to chalk up to flexibility alone.
Round-ups
Equilibrium Reasoners scale test-time compute via learned attractors
Source: hf-daily-papers
Equilibrium Reasoners treat inference as a latent dynamical system pulled toward task-conditioned attractors, letting models iterate at test time until convergence. The approach delivers large accuracy gains on Sudoku-Extreme and other reasoning tasks where stochastic trajectories settle into valid solutions.
Staged VLM training beats unified post-training on visual reasoning
Source: hf-daily-papers
Separating visual perception, visual reasoning, and textual reasoning into staged SFT and RL phases outperforms unified post-training for vision-language models. The curriculum approach lifts scores on visual math, RealWorldQA, and WeMath, suggesting capability decoupling matters more than joint optimization.
HINT-SD distills only failure-relevant actions for long-horizon agents
Source: hf-daily-papers
HINT-SD targets self-distillation at the specific actions inside a trajectory that caused failure, rather than imitating whole rollouts. This feedback-conditioned selection makes long-horizon LLM agent training more sample-efficient by focusing gradient updates on the decisions that actually matter.
ETCHR decouples visual reasoning from image generation in MLLMs
Source: hf-daily-papers
ETCHR adds a reasoning-aware image editor that handles visual manipulation separately from the language model’s chain of thought. Two-stage training with Reasoning Imitation then VLM-derived rewards lifts Pass@1 across visual reasoning benchmarks, beating Qwen3-VL-8B, Gemini-3.1-Flash-Lite, and Kimi K2.5.
CAD agents self-improve using finite element analysis as reward
Source: hf-daily-papers
Generative CAD systems learn from engineering validation by routing outputs through finite element analysis and geometric checks like Box-IoU on STEP files. The physics-grounded feedback loop aligns supervision with real design constraints rather than relying on pixel or token similarity alone.
StepAudio 2.5 unifies ASR, TTS and live speech in one model
Source: hf-daily-papers
StepAudio 2.5 collapses speech recognition, synthesis, and real-time spoken dialogue into a single audio-language model that rivals specialized systems. Task-tailored RLHF with verifiable multi-token decoding and generative reward modeling optimizes a shared representation across all three modes.
Zero-CoT Probe sniffs out hidden benchmark contamination in LLMs
Source: hf-daily-papers
A new black-box test, Zero-CoT Probe, truncates chain-of-thought to expose memorization in large language models. By comparing answers on original benchmarks against isomorphically perturbed copies, the method flags evasive contamination that standard evaluations miss, and ships with code on GitHub.
Footnotes
-
SkillLens project page (Microsoft, companion paper) — https://microsoft.github.io/SkillLens/
↩Negative transfer occurs in approximately 25% of cases, where adding a distilled skill actually lowers the success rate of the target model… while SpreadsheetBench shows failure rates as low as 13%, ALFWorld suffers from negative transfer in up to 47% of deployments.
-
SkillLens (Microsoft) — Strength Paradox finding — https://microsoft.github.io/SkillLens/
↩A model’s task-solving capability does not correlate with its ability to extract useful skills for others… lightweight models like Gemini-3.1-Flash outperformed the larger GPT-5.4 as extractors, suggesting that skill extraction is a distinct, non-linear capability.
-
Comet blog on GEPA vs SkillOpt — https://www.comet.com/site/blog/gepa-ai-optimization/
↩GEPA remains highly competitive in low-resource settings, matching traditional reinforcement learning performance with up to 35x fewer rollouts… while GEPA’s reflection quality reportedly collapses when using sub-frontier models, SkillOpt maintained stability and achieved a +10.4 point increase even in self-optimization modes.
-
Serenities AI — Agent Skills Guide 2026 — https://serenitiesai.com/articles/agent-skills-guide-2026
↩Over 16 major platforms, including Microsoft’s GitHub Copilot, OpenAI’s Codex, and Google’s Gemini CLI, have adopted [Anthropic’s SKILL.md] standard… MCP serves as the ‘USB-C for AI’ while Agent Skills act as the ‘playbooks’ teaching models how to use that data.
-
Medium — ‘It’s not about writing a better sentence’ (practitioner critique) — https://medium.com/jin-system-architect/its-not-about-writing-a-better-sentence-it-s-about-defining-a-better-optimization-problem-71177d41689d
↩Prompts optimized for a ‘teacher’ model like GPT-4o frequently fail to deliver proportional gains when deployed on smaller or different model families… failures in multi-agent systems are rarely caused by poor prompts at individual nodes, but rather by the degradation of information during ‘handoffs’ between agents.
-
Abi Varma (Medium) — practitioner reproduction notes — https://abivarma.medium.com/microsoft-skillopt-the-prompt-is-now-a-trainable-parameter-2283c8d45dd0
↩The actual ‘training’ often results in only 1–4 accepted edits, suggesting the optimizer is highly conservative… healthcare tasks see gains of over 50 points, [while] software engineering tasks often see marginal improvements of only ~4.5 points.
-
arXiv 2605.12492 — ‘Pion: A Spectrum-Preserving Optimizer’ — https://arxiv.org/abs/2605.12492
↩Pion updates weight matrices via coupled left and right orthogonal transformations, keeping the singular-value spectrum fixed throughout training.
-
Hugging Face blog — ‘Training RL with Muon’ (bird-of-paradise) — https://huggingface.co/blog/bird-of-paradise/training-rl-with-muon-3
↩Muon lacks a variance-based brake to handle the high-variance, low-SNR gradients typical of RL, and policies frequently degrade or ‘hollow out’ shortly after training begins.
-
arXiv 2507.11005 — AdaMuon — https://arxiv.org/html/2507.11005v1
↩AdaMuon integrates a per-parameter second-moment estimator into the Muon framework, providing the coordinate-wise adaptivity of Adam while maintaining orthogonal updates.
-
GitHub OPTML-Group/Pion (PerHeadPion implementation) — https://github.com/OPTML-Group/Pion
↩PerHeadPion reshapes Q/K/V/O tensors along the head dimension and runs the high-pass NS iteration independently per head, handling GQA by scaling Q, K, V heads consistently.
-
36Kr — LIBERO-PRO robustness study — https://eu.36kr.com/en/p/3456206492571014
↩Models trained to >90% success on standard LIBERO can collapse to near 0.0% on LIBERO-PRO under lighting and camera perturbations.
-
arXiv 2605.17109 — Spurious Rewards in Qwen RLVR — https://arxiv.org/pdf/2605.17109
↩Qwen3 models show substantial gains on MATH-500 even when trained with random or deliberately incorrect rewards, a phenomenon not observed in Llama or OLMo.
-
ArxivIQ Substack — independent paper breakdown — https://arxiviq.substack.com/p/llms-as-noisy-channels-a-shannon
↩ ↩2 ↩3Larger models act like sensitive antennas — if not properly tuned, they simply amplify the static of their training data… the law effectively shatters the fundamental math the industry relied on.
-
ZeroEntropy — ‘Scaling Laws’ explainer — https://zeroentropy.dev/concepts/scaling-laws/
↩A version of OLMo-1B trained on 3T tokens performed roughly 2–3% worse on standard benchmarks after instruction tuning than a checkpoint trained on only 2.3T tokens — catastrophic overtraining is attributed to a progressive sensitivity in parameters.
-
gopubby — illustrated guide to LLM quantization (covers Kumar et al. precision-aware laws) — https://ai.gopubby.com/from-32-bits-to-1-58-the-illustrated-guide-to-llm-quantization-fd25b6c782fd
↩Training in lower precision effectively reduces a model’s effective parameter count, creating a predictable loss in intelligence that standard scaling laws ignore.
-
ResearchGate discussion of the Shannon Scaling Law — https://www.researchgate.net/publication/405221224_LLMs_as_Noisy_Channels_A_Shannon_Perspective_on_Model_Capacity_and_Scaling_Laws
↩Critics argue that having nine free parameters leads to an over-parameterized system where multiple different parameter combinations can fit the same training data equally well; some parameters are jointly poorly constrained, with variances exceeding 50%.
-
Google Research — ‘Explaining Neural Scaling Laws’ (Bahri et al.) — https://research.google/pubs/explaining-neural-scaling-laws/
↩Scaling exponents are linked to the intrinsic dimension of the data manifold; performance improves as the model resolves the geometry of this low-dimensional manifold more accurately.
-
Toutiao — Chinese-language coverage — https://www.toutiao.com/article/7592780485013537321/
↩Marks a transition from brute-force compute allocation to a physics-based understanding of model limits, potentially ending the era of naive parameter scaling.