Sources

SkillOpt: Executive Strategy for Self-Evolving Agent Skills huggingface.co

SkillOpt introduces a systematic text-space optimizer for agent skills that trains skills as external agent state with stable updates and zero deployment inference overhead, achieving superior performance across multiple benchmarks and execution environments.

From Raw Experience to Skill Consumption: A Systematic Study of Model-Generated Agent Skills huggingface.co

Language agents benefit from reusable skills that encode domain-specific procedures, but their effectiveness varies significantly across different extraction and consumption scenarios, requiring careful evaluation and meta-skill guidance to optimize performance.

LLMs as Noisy Channels: A Shannon Perspective on Model Capacity and Scaling Laws huggingface.co

The Shannon Scaling Law models LLM training as information transmission over a noisy channel, explaining non-monotonic performance phenomena through signal-to-noise ratio interactions and demonstrating superior predictive accuracy over traditional scaling laws.

Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR huggingface.co

Muon’s spectral whitening approach in LLM pretraining is replaced by Pion, which uses a high-pass NS iteration to stabilize training in low-rank and low-SNR regimes while maintaining computational efficiency and supporting per-head updates.

The Illusion of Reasoning: Exposing Evasive Data Contamination in LLMs via Zero-CoT Truncation huggingface.co

A new black-box test, Zero-CoT Probe, truncates chain-of-thought to expose memorization in large language models. By comparing answers on original benchmarks against isomorphically perturbed copies, the method flags evasive contamination that standard evaluations miss, and ships with code on GitHub.

From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models huggingface.co

Separating visual perception, visual reasoning, and textual reasoning into staged SFT and RL phases outperforms unified post-training for vision-language models. The curriculum approach lifts scores on visual math, RealWorldQA, and WeMath, suggesting capability decoupling matters more than joint optimization.

Equilibrium Reasoners: Learning Attractors Enables Scalable Reasoning huggingface.co

Equilibrium Reasoners treat inference as a latent dynamical system pulled toward task-conditioned attractors, letting models iterate at test time until convergence. The approach delivers large accuracy gains on Sudoku-Extreme and other reasoning tasks where stochastic trajectories settle into valid solutions.

HINT-SD: Targeted Hindsight Self-Distillation for Long-Horizon Agents huggingface.co

HINT-SD targets self-distillation at the specific actions inside a trajectory that caused failure, rather than imitating whole rollouts. This feedback-conditioned selection makes long-horizon LLM agent training more sample-efficient by focusing gradient updates on the decisions that actually matter.

Self-Improving CAD Generation Agents with Finite Element Analysis as Feedback huggingface.co

Generative CAD systems learn from engineering validation by routing outputs through finite element analysis and geometric checks like Box-IoU on STEP files. The physics-grounded feedback loop aligns supervision with real design constraints rather than relying on pixel or token similarity alone.

StepAudio 2.5 Technical Report huggingface.co

StepAudio 2.5 collapses speech recognition, synthesis, and real-time spoken dialogue into a single audio-language model that rivals specialized systems. Task-tailored RLHF with verifiable multi-token decoding and generative reward modeling optimizes a shared representation across all three modes.

ETCHR: Editing To Clarify and Harness Reasoning huggingface.co

ETCHR adds a reasoning-aware image editor that handles visual manipulation separately from the language model’s chain of thought. Two-stage training with Reasoning Imitation then VLM-derived rewards lifts Pass@1 across visual reasoning benchmarks, beating Qwen3-VL-8B, Gemini-3.1-Flash-Lite, and Kimi K2.5.

Mix-MoE: Improving Multilingual Machine Translation of Large Language Models through Mixed MoEs huggingface.co

Mix-MoE, a mixed Mixture-of-Experts framework, addresses parameter interference in multilingual machine translation by separating language modeling and translation expertise across specialized expert groups with Fourier-transform-enhanced routing.

SciAtlas: A Large-Scale Knowledge Graph for Automated Scientific Research huggingface.co

SciAtlas presents a large-scale, multi-disciplinary knowledge graph that enables structured topological reasoning for academic research by integrating millions of papers and entities to support automated scientific discovery.

The Expense of Seeing: Attaining Trustworthy Multimodal Reasoning Within the Monolithic Paradigm huggingface.co

Vision-Language Models often fail to faithfully synthesize multimodal data due to reliance on language priors over visual representation, necessitating new evaluation frameworks that prioritize semantic sufficiency over traditional multimodal gain metrics.

LatentUMM: Dual Latent Alignment for Unified Multimodal Models huggingface.co

LatentUMM addresses multimodal consistency issues by constructing an enhanced shared latent space that explicitly aligns transformations between modalities and stabilizes latent dynamics during generation and re-encoding processes.

Good Token Hunting: A Hitchhiker’s Guide to Token Selection for Visual Geometry Transformers huggingface.co

Visual geometry transformers are accelerated through a two-stage token selection framework that reduces computational costs while maintaining performance.

Lens: Rethinking Training Efficiency for Foundational Text-to-Image Models huggingface.co

Lens is a compact 3.8B-parameter text-to-image model achieving superior performance with reduced training compute through dense caption datasets, multi-resolution batching, efficient architecture, and optimization techniques.

Rethinking Cross-Layer Information Routing in Diffusion Transformers huggingface.co

Diffusion Transformers suffer from inefficient cross-layer information flow that traditional residual connections cannot address, prompting the introduction of a learnable, timestep-adaptive routing mechanism that improves training efficiency and model quality.

PiD: Fast and High-Resolution Latent Decoding with Pixel Diffusion huggingface.co

PiD introduces a pixel diffusion decoder that reformulates latent decoding as conditional pixel diffusion, enabling fast and high-quality image synthesis at high resolutions with reduced computational requirements.

RankE: End-to-End Post-Training for Discrete Text-to-Image Generation with Decoder Co-Evolution huggingface.co

Discrete autoregressive text-to-image models suffer from latent covariate shift during policy optimization, which RankE addresses through end-to-end co-evolution of policy and decoder components.

See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding huggingface.co

SWIM is a training approach that aligns vision and language representations for fine-grained object understanding using only textual prompts by addressing cross-modal attention misalignment through mask supervision and a new dataset.

GenRecon: Bridging Generative Priors for Multi-View 3D Scene Reconstruction huggingface.co

A novel method for 3D scene reconstruction that integrates generative 3D priors with multi-view image conditioning to produce high-fidelity, editable mesh reconstructions of indoor environments.

Geo-Align: Video Generation Alignment via Metric Geometry Reward huggingface.co

Geo-Align presents a reinforcement learning framework for camera-controlled video re-rendering that improves generalization through scale-aware perceptual rewards and metric 3D estimation for camera trajectory extraction.

VaaWIT: Visual-Aware Adaptation of Large Language Models for Multilingual Web Image Translation huggingface.co

VaaWIT is an end-to-end framework that enhances Large Vision-Language Models for multilingual Web image translation by incorporating fine-grained visual perception through dual-stream attention and visual-aware adapters.

PhotoFlow: Agentic 3D Virtual Photography Missions huggingface.co

A Director-Reviewer-Reflector agent named PhotoFlow enables language-conditioned virtual photography by combining 3D spatial understanding with aesthetic judgment in arbitrary Blender scenes.

VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis huggingface.co

VGenST-Bench presents a video benchmark using generative models for active synthesis of controlled spatio-temporal reasoning scenarios with human quality control.

SCOPE: Simulating Cross-game Operations in Playable Environments for FPS World Models huggingface.co

SCOPE enables precise action response in FPS games by conditioning transformer blocks in video diffusion models to separate in-scope from out-of-scope visual effects without segmentation labels.

Next-Acceleration-Scale Prediction for Autoregressive MRI Reconstruction huggingface.co

Discrete autoregressive MRI reconstruction using privileged information distillation achieves superior performance under extreme undersampling by leveraging visual autoregressive modeling techniques.

References

SkillLens project page (Microsoft, companion paper) microsoft.github.io

Negative transfer occurs in approximately 25% of cases, where adding a distilled skill actually lowers the success rate of the target model… while SpreadsheetBench shows failure rates as low as 13%, ALFWorld suffers from negative transfer in up to 47% of deployments.

Comet blog on GEPA vs SkillOpt comet.com

GEPA remains highly competitive in low-resource settings, matching traditional reinforcement learning performance with up to 35x fewer rollouts… while GEPA’s reflection quality reportedly collapses when using sub-frontier models, SkillOpt maintained stability and achieved a +10.4 point increase even in self-optimization modes.

Serenities AI — Agent Skills Guide 2026 serenitiesai.com

Over 16 major platforms, including Microsoft’s GitHub Copilot, OpenAI’s Codex, and Google’s Gemini CLI, have adopted [Anthropic’s SKILL.md] standard… MCP serves as the ‘USB-C for AI’ while Agent Skills act as the ‘playbooks’ teaching models how to use that data.

Medium — ‘It’s not about writing a better sentence’ (practitioner critique) medium.com

Prompts optimized for a ‘teacher’ model like GPT-4o frequently fail to deliver proportional gains when deployed on smaller or different model families… failures in multi-agent systems are rarely caused by poor prompts at individual nodes, but rather by the degradation of information during ‘handoffs’ between agents.

Abi Varma (Medium) — practitioner reproduction notes abivarma.medium.com

The actual ‘training’ often results in only 1–4 accepted edits, suggesting the optimizer is highly conservative… healthcare tasks see gains of over 50 points, [while] software engineering tasks often see marginal improvements of only ~4.5 points.

arXiv 2605.12492 — ‘Pion: A Spectrum-Preserving Optimizer’ arxiv.org

Pion updates weight matrices via coupled left and right orthogonal transformations, keeping the singular-value spectrum fixed throughout training.

Hugging Face blog — ‘Training RL with Muon’ (bird-of-paradise) huggingface.co

Muon lacks a variance-based brake to handle the high-variance, low-SNR gradients typical of RL, and policies frequently degrade or ‘hollow out’ shortly after training begins.

arXiv 2507.11005 — AdaMuon arxiv.org

AdaMuon integrates a per-parameter second-moment estimator into the Muon framework, providing the coordinate-wise adaptivity of Adam while maintaining orthogonal updates.

36Kr — LIBERO-PRO robustness study eu.36kr.com

Models trained to >90% success on standard LIBERO can collapse to near 0.0% on LIBERO-PRO under lighting and camera perturbations.

arXiv 2605.17109 — Spurious Rewards in Qwen RLVR arxiv.org

Qwen3 models show substantial gains on MATH-500 even when trained with random or deliberately incorrect rewards, a phenomenon not observed in Llama or OLMo.

GitHub OPTML-Group/Pion (PerHeadPion implementation) github.com

PerHeadPion reshapes Q/K/V/O tensors along the head dimension and runs the high-pass NS iteration independently per head, handling GQA by scaling Q, K, V heads consistently.

ArxivIQ Substack — independent paper breakdown arxiviq.substack.com

Larger models act like sensitive antennas — if not properly tuned, they simply amplify the static of their training data… the law effectively shatters the fundamental math the industry relied on.

ZeroEntropy — ‘Scaling Laws’ explainer zeroentropy.dev

A version of OLMo-1B trained on 3T tokens performed roughly 2–3% worse on standard benchmarks after instruction tuning than a checkpoint trained on only 2.3T tokens — catastrophic overtraining is attributed to a progressive sensitivity in parameters.

gopubby — illustrated guide to LLM quantization (covers Kumar et al. precision-aware laws) ai.gopubby.com

Training in lower precision effectively reduces a model’s effective parameter count, creating a predictable loss in intelligence that standard scaling laws ignore.

ResearchGate discussion of the Shannon Scaling Law researchgate.net

Critics argue that having nine free parameters leads to an over-parameterized system where multiple different parameter combinations can fit the same training data equally well; some parameters are jointly poorly constrained, with variances exceeding 50%.

Toutiao — Chinese-language coverage toutiao.com

Marks a transition from brute-force compute allocation to a physics-based understanding of model limits, potentially ending the era of naive parameter scaling.

Google Research — ‘Explaining Neural Scaling Laws’ (Bahri et al.) research.google

Scaling exponents are linked to the intrinsic dimension of the data manifold; performance improves as the model resolves the geometry of this low-dimensional manifold more accurately.

Sources

References

Jack Sun, writing.