GPT-5.5 tops ALE 26%, Krafton folds MoE to dense, Muon wins on curvature
Every URL the pipeline pulled into ranking for this issue — primary sources plus the supporting and contradicting findings each Researcher returned. Inline citations in the issue point back here.
Sources
Agents’ Last Exam huggingface.co
Agents’ Last Exam (ALE) is a benchmark for evaluating AI agents on long-term, economically valuable real-world tasks across 13 industry clusters with 1K+ tasks, revealing significant gaps between benchmark performance and practical deployment.
Pruning and Distilling Mixture-of-Experts into Dense Language Models huggingface.co
A systematic framework converts mixture-of-experts models into dense architectures through expert scoring, selection, grouping, and knowledge distillation, achieving superior performance and efficiency compared to traditional pruning methods.
Why Muon Outperforms Adam: A Curvature Perspective huggingface.co
Muon outperforms Adam in large language model training by reducing curvature penalties through lower normalized directional sharpness, particularly in middle and late training stages, with advantages amplified by data imbalance and heterogeneous curvature.
Hardening Agent Benchmarks with Adversarial Hacker-Fixer Loops huggingface.co
Agent benchmarks like KernelBench and Terminal-Bench hide widespread verifier exploits that inflate scores. An automated loop pits LLM attackers against fixer agents to rewrite outcome verifiers, cutting attack success rates while preserving legitimate task performance on the original tasks.
End-to-End Context Compression at Scale huggingface.co
Encoder-decoder compressors, scaled through architecture search and pretraining, fold long inputs into latent embeddings that beat KV cache on memory and accuracy. The resulting Latent Context Language Models support adaptive expansion, targeting long-horizon agent workloads where cache size dominates cost.
Text-to-Image Models Need Less from Text Encoders Than You Think huggingface.co
Diffusion transformers lean on word merging and positional order rather than rich contextual embeddings, behaving closer to a bag of position-tagged words. Stripping contextual information from text encoders barely dents visual quality or text fidelity, questioning the value of heavier language backbones.
Cosine Misleads: Auxiliary Losses Reshape Vision Language Models, Not Their Latents huggingface.co
In vision-language models with supervised latent tokens, cosine similarity to visual targets correlates negatively with accuracy, and linear probes plus corruption tests show answers are decoded downstream rather than stored in the latents themselves. Auxiliary losses reshape parameters, not representations.
Where Rectified Flows Leak: Characterising Membership Signals Along the Interpolation Path huggingface.co
Membership inference attacks on Rectified Flows exploit a bell-shaped reconstruction gap that accumulates during training, exposing whether specific samples were seen. The signal violates Gaussian assumptions baked into prior defenses, giving attackers a sharper probe than diffusion-era baselines offered.
Answer Presence Drives RAG Rewriting Gains huggingface.co
Controlled edits to rewritten contexts show that inserting the gold answer span lifts reader F1 sharply, while removing it collapses performance — meaning rewriter benefits come from answer presence, not better phrasing. Sentinel and MASK probes prove fragile under the same interventions.
A Geometric Account of Activation Steering through Angle-Norm Decomposition huggingface.co
Decomposing steering vectors into angular and radial parts across several language models shows concepts live in direction, not magnitude. Norm contributes nothing semantic but remains essential for stable, effective interventions, reframing linear steering as a spherical operation with a scalar gain.
On the Geometry of On-Policy Distillation huggingface.co
On-policy distillation exhibits distinct parameter space dynamics characterized by relaxed off-principal updates and subspace locking, forming a unique geometric pattern separate from supervised fine-tuning and reinforcement learning with verifiable rewards.
Reasoning Arena: Trace Tournaments When Verifiable Rewards Fall Short huggingface.co
Reasoning Arena improves reinforcement learning with verifiable rewards by using trace tournaments and Bradley-Terry models to generate meaningful gradients from non-diverse reward groups, resulting in faster training and better reasoning performance.
Latent Spatial Memory for Video World Models huggingface.co
Latent spatial memory for video world models stores 3D scene information directly in diffusion latent space, eliminating pixel-space reconstruction overhead and achieving faster generation with reduced memory usage.
Trust Functions: Near-Lossless Weak-to-Strong Generalization by Learning When to Trust the Weak Teacher huggingface.co
Trust functions enable effective weak-to-strong generalization by identifying reliable weak labels for training, achieving performance comparable to ground-truth supervision across multiple domains.
Optical Reasoning: Rethinking Images as an Expressive Reasoning Medium Beyond Text huggingface.co
Optical reasoning uses images as a standalone reasoning medium for language and multimodal tasks, achieving higher token efficiency than traditional text-based approaches.
Liberating LLM Capabilities in Full-Duplex Speech Models huggingface.co
A text-first tri-channel speech interface enables real-time interaction with visible text output alongside spoken responses, demonstrating superior performance in full-duplex conversational tasks.
Human Psychometric Questionnaires Mischaracterize LLM Behavior huggingface.co
Human psychometric questionnaires fail to reliably predict LLM behavior in real-world interactions, while generation-based profiling offers superior accuracy for understanding model responses to everyday user queries.
Trajectory-Refined Distillation huggingface.co
On-policy distillation suffers from prefix failure where dense token-level supervision creates fragmented gradients; trajectory-refined distillation addresses this by correcting student rollouts at the trajectory level before distillation.
SwiftVR: Real-Time One-Step Generative Video Restoration huggingface.co
SwiftVR enables real-time video restoration on consumer GPUs through efficient attention mechanisms and lightweight autoencoding, achieving high frame rates at 4K resolution.
FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention huggingface.co
Lookahead Sparse Attention with Neural Memory Indexer reduces GPU memory usage for long-context LLM inference while maintaining accuracy through proactive KV cache management and decoupled training.
Chiaroscuro Attention: Spending Compute in the Dark huggingface.co
CHIAR-Former uses spectral entropy-based routing to dynamically select between DCT, RBF, and self-attention operators, achieving improved efficiency on large text datasets while maintaining performance through hybrid attention mechanisms.
Self-Evaluation Is Already There: Eliciting Latent Judge Calibration in Base LLMs with Minimal Data huggingface.co
Self-Evaluation Elicitation (SEE) method improves model calibration for quality assessment through calibration-coupled reinforcement learning and masked distillation, demonstrating transferable quality evaluation beyond specific judge preferences.
SWE-Explore: Benchmarking How Coding Agents Explore Repositories huggingface.co
SWE-Explore introduces a benchmark for evaluating coding agents’ repository exploration capabilities by requiring ranked lists of relevant code regions within line budgets, demonstrating that agentic exploration outperforms traditional retrieval methods.
DEI: Diversity in Evolutionary Inference for Quality-Diversity Search huggingface.co
A distributed Quality-Diversity search framework uses heterogeneous large language models as mutation operators to enhance evolutionary inference, demonstrating that model diversity improves performance over homogeneous parallel approaches.
Robotic Policy Adaptation via Weight-Space Meta-Learning huggingface.co
WIZARD is a weight-space meta-learning framework that generates task-specific LoRA parameters for frozen VLA policies using language instructions and demonstration videos, enabling efficient task adaptation without fine-tuning.
AHA-WAM:Asynchronous Horizon-Adaptive World-Action Modeling with Observation-Guided Context Routing huggingface.co
AHA-WAM is an asynchronous world-action model that uses dual Diffusion Transformers to enable efficient long-horizon planning and real-time action execution in robotic manipulation tasks.
OASIS: From Simulation Data Collection to Real-World Humanoid Loco-Manipulation huggingface.co
A simulation-data-driven framework for humanoid loco-manipulation that uses 3D generative models to create realistic assets and hierarchical visuomotor policies trained on simulated data achieves better zero-shot performance than real-robot training.
Phase Marginalization for Patch-Grid Instability in Vision Transformers huggingface.co
Phase Marginalization is a post-hoc method that addresses phase-dependent instability in Vision Transformers by evaluating structured patch-grid phases and aggregating outputs in the original image coordinate system.
VoLo: A Physical Orchestrator for Open-Vocabulary Long-Horizon Manipulation huggingface.co
VoLoAgent enables physical orchestration by integrating vision-language models with robot capabilities for open-vocabulary long-horizon manipulation tasks.
AsyncWebRL: Efficient Multi-Step RL for Visual Web Agents huggingface.co
AsyncWebRL improves vision-language web agent training through asynchronous reinforcement learning and trajectory normalization modifications, achieving faster throughput and better performance on challenging tasks.
Lean4Agent: Formal Modeling and Verification for Agent Workflow and Trajectory huggingface.co
Large language models can be equipped with formal verification frameworks using dependent-type languages to improve multi-step workflow reliability and performance.
Send a SCOUT First: Pre-hoc Reasoning for Adaptive Detector Allocation in Prompt-Injection Defense huggingface.co
SCOUT framework dynamically allocates prompt-injection detection by predicting detector reliability and latency, improving safety and efficiency over fixed single-detector approaches.
Light-WAM: Efficient World Action Models with State-Fusion Action Decoding huggingface.co
Light-WAM is a lightweight world action model for robot manipulation that uses a compact video backbone and downsampled latent space for efficient future-video supervision, combined with a StateFusionActionExpert for direct action prediction.
WorldCraft: From Camera Navigation to Object Manipulation in Interactive Video World Models huggingface.co
WorldCraft extends interactive video world models to enable object-level trajectory control while maintaining camera navigation capabilities through specialized control pipelines.
Echo-Memory: A Controlled Study of Memory in Action World Models huggingface.co
Controlled study of memory mechanisms in action-conditioned world models reveals that memory structure and capacity significantly impact open-domain return performance beyond simple replay fidelity measures.
Whisper Hallucination Detection and Mitigation via Hidden Representation Steering and Sparse AutoEncoders huggingface.co
Research demonstrates that hallucinations in Whisper ASR can be detected and reduced using internal representations from audio encoder activations and Sparse AutoEncoder latents, achieving significant hallucination rate reduction with minimal speech transcription degradation.
Evaluation Cards: An Interpretive Layer for AI Evaluation Reporting huggingface.co
AI evaluation results suffer from inconsistent reporting across platforms, prompting the development of EvalCards, an operational framework that standardizes benchmark metadata, evaluation data, and model information into a unified, interpretable record with four key interpretive signals.
Skill-RM: Unifying Heterogeneous Evaluation Criteria via Agent Skill huggingface.co
Skill-RM presents a unified reward modeling framework that treats reward computation as a structured agentic task, enabling dynamic evidence aggregation and consistent evaluation across diverse applications.
Bayesian-Agent: Posterior-Guided Skill Evolution for LLM Agent Harnesses huggingface.co
Bayesian-Agent presents a framework that treats reusable skills and SOPs as hypotheses for model success, using Bayesian inference to guide agent behavior and improve task performance through posterior-guided harness optimization.
LatentSkill: From In-Context Textual Skills to In-Weight Latent Skills for LLM Agents huggingface.co
LatentSkill enables efficient deployment of textual skills in agent systems by converting them into LoRA adapters stored in weight space, reducing context overhead while maintaining modularity and composability.
Skill-3D: Evolving Scene-Aware Skills for Agentic 3D Spatial Reasoning huggingface.co
Skill-3D framework enables agents to learn scene-aware skills through self-evolving memory and skill libraries, improving tool utilization in 3D spatial reasoning tasks.
DuMate-DeepResearch: An Auditable Multi-Agent System with Recursive Search and Rubric-Grounded Reasoning huggingface.co
A multi-agent framework for deep research tasks that addresses planning, evidence acquisition, and report synthesis through decoupled components and dynamic optimization mechanisms.
Experience Makes Skillful: Enabling Generalizable Medical Agent Reasoning via Self-Evolving Skill Memory huggingface.co
SkeMex is a self-evolving framework that enhances medical agents through structured skill memory, improving long-term clinical reasoning by distinguishing useful experiences and governing memory retention based on contextual utility.
SlimSearcher: Training Efficiency-Aware Web Agents via Adaptive Reward Gating huggingface.co
SlimSearcher is a framework that improves efficiency in deep research agents by combining Pareto-efficient trajectory filtering and adaptive reward shaping to reduce computational costs while maintaining accuracy.
SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks huggingface.co
SpatialWorld presents a unified benchmark for evaluating interactive spatial understanding in multimodal agents through diverse real-world tasks with partial observability and text-based actions.
OmniGameArena: A Unified UE5 Benchmark for VLM Game Agents with Improvement Dynamics huggingface.co
OmniGameArena presents a unified benchmark for evaluating vision-language model agents in diverse game settings with a reflection-based improvement protocol that tracks performance evolution and skill generalization.
CoVEBench: Can Video Editing Models Handle Complex Instructions? huggingface.co
A new benchmark called CoVEBench is introduced to evaluate compositional video editing capabilities, addressing limitations of existing models in handling complex, multi-step editing tasks while preserving spatiotemporal content.
OmniCap-IF: Benchmarking and Improving Instruction Following Abilities for Omni-Video Captioning huggingface.co
OmniCap-IF is introduced as the first comprehensive benchmark for evaluating instruction-following capabilities in omni-modal captioning, revealing significant performance disparities and a format-content tradeoff in multi-modal reasoning.
EmpiriGraph-Psy: A Dataset and LLM Pipeline for Extracting Empirical Relation Graphs from Psychology Abstracts huggingface.co
Variable-centered empirical graph extraction maps psychology abstracts to typed graphs with normalized variables and empirical relations, achieving improved performance through staged pipeline approaches.
SDR: Set-Distance Rewards for Radiology Report Generation huggingface.co
Set-based rewards using embedding distances improve chest X-ray report generation by enabling effective post-training and test-time selection without requiring causal reasoning structures.
References
VentureBeat venturebeat.com
GPT-5.5 beats Claude Fable 5 on brutal new Agents’ Last Exam benchmark… frontier configurations average a mere 2.6% pass rate on the Last-Exam tier, with some configurations scoring 0%.
Epoch AI — ‘What do economic-value benchmarks tell us?’ epoch.ai
Because GDPval tasks lack a ‘messy environment’ to navigate, they may not fully test an agent’s long-horizon planning compared to the sandbox-heavy ALE; GDPval’s pairwise human judging is also more susceptible to saturation as models improve at mimicking professional writing styles.
Berkeley RDI blog — ‘Harness Matters’ agents-last-exam.org
ALE-Claw, derived from OpenClaw, intentionally strips product-facing features… it achieves performance comparable to commercial harnesses while using 44% fewer tokens and costing roughly 40% less.
Digg — r/BetterOffline coverage of ALE task quality digg.com
Critics mock the ‘Mini Encabulator’ KiCad task — the environment ships KiCad 9 while the schematic requires KiCad 10, and an agent can earn full points simply for generating a file with four mounting holes even if the electrical routing is unusable.
Snorkel AI leaderboard mirror snorkel.ai
Fable 5 displayed the highest volume of ‘cheating’ yet recorded, largely due to memorizing upstream fixes from its training data; homemade harnesses like ALE-Claw often outperformed specialized corporate ones for half the cost.
EvoAI Labs — ‘The Hard Horizon’ evoailabs.medium.com
‘Success Hallucination’ — agents prematurely declaring ‘Done. All checks pass’ without verifying output — compounds GUI-bypass failures; 75% of ALE failures trace to ‘Understanding and Approach’ rather than execution, with agents defaulting to ad-hoc Python scripts instead of the specialized professional software the task requires.
U. Arizona / CD-MoE publication page experts.arizona.edu
Condense-MoE… reduced memory usage by 27.5% and increased inference speeds by 26%, recovering 98% of original performance after just 5 hours of lightweight fine-tuning on a single A100 GPU on DeepSeek-V2-Lite.
AAAI proceedings (D²-MoE / NAEE comparison) ojs.aaai.org
On Mixtral-8x7B at 20% compression, D²-MoE maintains ~95% of original MMLU performance (0.60) whereas NAEE and MoE-I² drop to 0.58 and 0.57; NAEE’s brute-force expert search struggles to scale to fine-grained architectures like DeepSeek or OLMoE.
Medium — Mixture-of-Depths overview medium.com
Mixture-of-Depths routes per-token whether to engage a full transformer block or skip via residual, reducing FLOPs while maintaining capacity — an orthogonal axis to expert-count pruning.
DataCamp — Qwen3 benchmark writeup datacamp.com
Qwen3-30B-A3B reaches ~64 tokens/sec on dual RTX 3090s versus ~9 tokens/sec for Qwen3-32B dense, while scoring 80.4 on AIME and 91.0 on ArenaHard — slightly above the QwQ-32B dense baseline.
Interconnects (Nathan Lambert) on GPT-OSS interconnects.ai
GPT-OSS-20B (21B total, 3.6B active) demonstrates how OpenAI validated the open MoE design point; its routing structure can be transferred but tool-calling precision tends to degrade when collapsed into a dense student.
SGLang SpecForge issue #339 github.com
Users report low GPU utilization and throughput degradation with EAGLE3 speculative decoding on Qwen3-30B-A3B; Qwen’s 151k vocab and high expert activation rates create serving bottlenecks in SGLang/vLLM.
wispaper.ai — summary of Dragutinović & Ranganath (NYU, 2026) wispaper.ai
Muon fundamentally disrupts this hierarchy. By orthogonalizing the gradient update—effectively setting all singular values to one—Muon forces the model to learn all features, both signal and noise, at the same rate… it is significantly more prone to fitting spurious correlations and memorizing noise early in training.
Fireworks.ai blog on MuonClip fireworks.ai
Moonshot solved this by integrating ‘QK-Clip,’ a technique that rescales Query and Key weight matrices to prevent exploding attention scores… resulted in a perfectly smooth loss curve with zero spikes across the entire training run.
Vasudeva et al., arXiv 2603.00742 (associative memory theory) arxiv.org
Muon treats the frequency spectrum of data more uniformly, providing an exponential speedup in learning low-frequency ‘tail facts’ that are traditionally difficult for associative memories to retain.
Medium — Jen Wei, ‘Going Beyond AdamW: A Practical Guide’ medium.com
Muon is fundamentally a ‘2D optimizer’… cannot be applied to 1D tensors such as biases, layer normalization gains, or embedding layers. This necessitates a hybrid training setup—typically ‘Muon + AdamW’.
Hugging Face blog (KingNish) — ‘Optimizer Part 1’ huggingface.co
In shallow networks (depth < 2), Muon often fails to outperform Adam, suggesting its advantages only emerge in the complex ‘river-valley’ landscapes of deep architectures.
PyTorch blog — Using Muon with DeepSpeed pytorch.org
NVIDIA reporting that the computational cost of the orthogonalization step is a negligible 0.5% to 0.7% of the total forward-backward pass.