SU-01 wins IMO gold, WildClawBench: 18pt harness gap, Lighthouse strips sparsity
Every URL the pipeline pulled into ranking for this issue — primary sources plus the supporting and contradicting findings each Researcher returned. Inline citations in the issue point back here.
Sources
Achieving Gold-Medal-Level Olympiad Reasoning via Simple and Unified Scaling huggingface.co
A systematic approach transforms post-trained reasoning models into rigorous olympiad-level solvers through reverse-perplexity curriculum, two-stage reinforcement learning, and test-time scaling, achieving gold-medal performance on mathematical and physics competitions.
WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation huggingface.co
WildClawBench evaluates language and vision-language models on realistic long-horizon tasks using actual CLI environments with real tools instead of synthetic sandboxes.
Long Context Pre-Training with Lighthouse Attention huggingface.co
Lighthouse Attention enables efficient training of causal transformers at long sequences by using hierarchical selection-based attention that reduces computational complexity while maintaining model performance.
Orchard: An Open-Source Agentic Modeling Framework huggingface.co
Orchard ships as a framework for building autonomous agents with task-specific recipes spanning SWE-bench coding, WebVoyager GUI navigation, and personal assistance. It bundles sandbox lifecycle management, credit-assignment SFT, and a Balanced Adaptive Rollout scheme to handle multi-turn tool use at scale.
BEAM: Binary Expert Activation Masking for Dynamic Routing in MoE huggingface.co
Mixture-of-Experts models typically pick experts via Top-K, but BEAM learns per-token binary masks through a straight-through estimator and auxiliary regularizer. A custom CUDA kernel integrated with vLLM delivers token-adaptive sparsity, cutting compute while preserving downstream accuracy.
Learning to Build the Environment: Self-Evolving Reasoning RL via Verifiable Environment Synthesis huggingface.co
Rather than generating training data, the method has the model build verifiable Python environments and solve them, exploiting the asymmetry between solving and verifying. Staged validation, semantic self-review, and difficulty calibration keep rewards informative as Qwen3-4B-Thinking improves under zero-data RLVR.
SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer huggingface.co
The hybrid linear diffusion transformer pairs Gated DeltaNet with softmax attention and dual camera branches for 6-DoF trajectory control. A two-stage pipeline plus NVFP4-quantized distilled variant pushes minute-scale video synthesis to industrial fidelity while slashing compute against dense baselines.
FutureSim: Replaying World Events to Evaluate Adaptive Agents huggingface.co
The framework grounds agent evaluation in chronological replays of actual world events, testing search, memory, and uncertainty reasoning under test-time adaptation. Current frontier forecasters show large gaps against the timeline-grounded ground truth, exposing weak long-horizon prediction.
Forcing-KV: Hybrid KV Cache Compression for Efficient Autoregressive Video Diffusion Models huggingface.co
Autoregressive video diffusion bloats memory with redundant key-value caches across frames. Forcing-KV classifies attention heads as static or dynamic, applying structured pruning to the former and segment-wise similarity pruning to the latter, restoring scalability for streaming Self Forcing generation.
Causal Forcing++: Scalable Few-Step Autoregressive Diffusion Distillation for Real-Time Interactive Video Generation huggingface.co
Both works attack the history-supervision gap in few-step autoregressive video models. Causal Forcing++ introduces causal consistency distillation for frame-wise generation, while RAVEN pairs causal extrapolation with CM-GRPO, applying reinforcement learning to consistency-model sampling for interactive latency.
RAVEN: Real-time Autoregressive Video Extrapolation with Consistency-model GRPO huggingface.co
RAVEN enables real-time video generation through causal autoregressive extrapolation with improved training alignment, while CM-GRPO enhances performance via reinforcement learning applied to consistency model sampling.
ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both huggingface.co
ATLAS presents a visual reasoning framework that combines agentic operations and latent representations using functional tokens, enabling efficient training and improved performance on complex benchmarks.
Dynamic Latent Routing huggingface.co
Temporal composition of sub-policies in MDPs with time-varying rewards enables optimal policy recovery through generalized Dijkstra search, which inspires a dynamic latent routing method for language model fine-tuning that outperforms traditional supervised approaches.
Quantitative Video World Model Evaluation for Geometric-Consistency huggingface.co
A quantitative framework called PDI-Bench is introduced for evaluating geometric coherence in generated videos through monocular reconstruction and projective-geometry residuals, revealing geometry-specific failure modes in video generators.
Self-Distilled Agentic Reinforcement Learning huggingface.co
SDAR enhances reinforcement learning for multi-turn agent training by integrating self-distillation through a sigmoid gate that selectively strengthens positive token-level guidance while mitigating negative teacher rejections.
FrontierSmith: Synthesizing Open-Ended Coding Problems at Scale huggingface.co
FrontierSmith automates the creation of open-ended coding problems from closed-ended tasks, improving LLM coding performance on benchmarks through diverse problem variants and enhanced agent interactions.
Darwin Family: MRI-Trust-Weighted Evolutionary Merging for Training-Free Scaling of Language-Model Reasoning huggingface.co
The Darwin Family framework enables training-free evolutionary merging of large language models through gradient-free weight-space recombination, achieving superior reasoning performance without additional training.
CurveBench: A Benchmark for Exact Topological Reasoning over Nested Jordan Curves huggingface.co
CurveBench presents a benchmark for hierarchical topological reasoning using visual inputs, demonstrating significant challenges in exact topology-aware visual reasoning even with advanced models.
Warp-as-History: Generalizable Camera-Controlled Video Generation from One Training Video huggingface.co
A novel approach called Warp-as-History enables camera-controlled video generation by transforming camera-induced warps into pseudo-history representations, achieving zero-shot capability without training or test-time optimization.
MemLens: Benchmarking Multimodal Long-Term Memory in Large Vision-Language Models huggingface.co
A new benchmark evaluates memory capabilities in vision-language models through multi-session conversations, revealing limitations of both long-context and memory-augmented approaches.
MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory huggingface.co
MemEye framework evaluates multimodal agent memory by measuring visual evidence granularity and retrieval usage complexity across 8 life-scenario tasks.
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation huggingface.co
Research demonstrates that current omni-modal benchmarks may inflate performance through visual shortcuts, and shows that post-training techniques can improve model performance on a cleaned benchmark with reduced visual leakage.
Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems huggingface.co
Multi-agent systems face challenges in sustained coordination and error propagation, requiring integrated approaches that enable continuous diagnosis, reorganization, and behavioral refinement across structured collaboration stages.
PanoWorld: Towards Spatial Supersensing in 360^circ Panorama World huggingface.co
PanoWorld with spherical spatial cross-attention enables panoramic reasoning by leveraging equirectangular projection structure and geometry-aware supervision.
PhyMotion: Structured 3D Motion Reward for Physics-Grounded Human Video Generation huggingface.co
PhyMotion introduces a physics-grounded reward system for human motion generation that evaluates kinematic plausibility, contact consistency, and dynamic feasibility to improve video quality.
VGGT-Edit: Feed-forward Native 3D Scene Editing with Residual Field Prediction huggingface.co
VGGT-Edit enables text-conditioned 3D scene editing through depth-synchronized text injection and direct geometric displacement prediction, achieving superior quality and efficiency over 2D-lifting approaches.
RewardHarness: Self-Evolving Agentic Post-Training huggingface.co
RewardHarness is a self-evolving framework that improves image edit evaluation by iteratively developing tools and skills from limited human demonstrations, achieving superior performance compared to existing models.
IntentVLA: Short-Horizon Intent Modeling for Aliased Robot Manipulation huggingface.co
IntentVLA is a history-conditioned visual-language action framework that improves robot imitation learning stability by encoding short-horizon intents from visual observations, addressing challenges from partial observability and ambiguous observations.
Boosting Reinforcement Learning with Verifiable Rewards via Randomly Selected Few-Shot Guidance huggingface.co
FEST is a few-shot demonstration-guided reinforcement learning algorithm that achieves strong performance with minimal supervised fine-tuning data by combining supervised signals, on-policy learning, and weighted training to prevent overfitting.
Nexus : An Agentic Framework for Time Series Forecasting huggingface.co
Nexus is a multi-agent forecasting framework that decomposes time series prediction into specialized stages, enabling effective integration of numerical patterns and contextual information for improved forecasting accuracy and explainability.
Topology-Preserving Neural Operator Learning via Hodge Decomposition huggingface.co
Physical field equations on geometric meshes are analyzed through Hodge theory to develop a hybrid Eulerian-Lagrangian architecture that improves accuracy and efficiency by separating topological and geometric components.
Unlocking Complex Visual Generation via Closed-Loop Verified Reasoning huggingface.co
A closed-loop visual reasoning framework integrates visual-language planning with diffusion generation to improve complex image synthesis while addressing latency and optimization challenges.
Aligning Latent Geometry for Spherical Flow Matching in Image Generation huggingface.co
Geodesic flow matching improves image generation by projecting latents onto fixed radius spheres and using spherical linear interpolation instead of linear paths, preserving semantic content through angular components.
Sat3DGen: Comprehensive Street-Level 3D Scene Generation from Single Satellite Image huggingface.co
Sat3DGen addresses the challenge of generating street-level 3D scenes from satellite images by employing a geometry-first approach that improves both geometric accuracy and photorealism through novel constraints and training strategies.
RouteProfile: Elucidating the Design Space of LLM Profiles for Routing huggingface.co
LLM profiling design significantly impacts routing performance, with structured profiles and query-level signals demonstrating superior reliability and generalization compared to flat profiles and domain-level signals.
DiffusionOPD: A Unified Perspective of On-Policy Distillation in Diffusion Models huggingface.co
DiffusionOPD enables efficient multi-task training for diffusion models through online policy distillation, outperforming existing reinforcement learning approaches in both training efficiency and final performance.
ViMU: Benchmarking Video Metaphorical Understanding huggingface.co
Video understanding models lack capability to interpret implicit meanings and social contexts beyond literal visual comprehension, necessitating new benchmarking approaches.
WildTableBench: Benchmarking Multimodal Foundation Models on Table Understanding In the Wild huggingface.co
WildTableBench is introduced as the first question-answering benchmark for real-world table images, revealing significant challenges in structural perception and numerical reasoning for existing multimodal models.
EvolveMem:Self-Evolving Memory Architecture via AutoResearch for LLM Agents huggingface.co
EvolveMem enables adaptive memory systems for LLM agents through self-evolving retrieval mechanisms that autonomously optimize configuration parameters via diagnostic modules and iterative research cycles.
STALE: Can LLM Agents Know When Their Memories Are No Longer Valid? huggingface.co
Large language models struggle to update personalized memories when new evidence emerges, requiring contextual inference and commonsense reasoning to detect implicit conflicts, as demonstrated by a comprehensive benchmark and evaluation of state-aware memory systems.
PREPING: Building Agent Memory without Tasks huggingface.co
Preping is a framework for pre-task memory construction that uses proposer-guided synthetic practice to improve agent performance in new environments with reduced deployment costs.
Learning to Communicate Locally for Large-Scale Multi-Agent Pathfinding huggingface.co
Multi-agent pathfinding solver enhanced with learnable communication module improves coordination and performance while maintaining scalability.
BOOKMARKS: Efficient Active Storyline Memory for Role-playing huggingface.co
BOOKMARKS is a search-based memory framework that improves role-playing agents by actively managing task-relevant information through structured bookmarks that capture detailed character behaviors and story elements.
Adaptive Teacher Exposure for Self-Distillation in LLM Reasoning huggingface.co
Adaptive Teacher Exposure for Self-Distillation (ATESD) improves large language model reasoning by dynamically adjusting teacher exposure during training through a learnable policy controller.
PRISM: Prior Rectification and Uncertainty-Aware Structure Modeling for Diffusion-Based Text Image Super-Resolution huggingface.co
PRISM is a diffusion-based text super-resolution framework that improves accuracy under severe degradation by using flow-matching prior rectification and uncertainty-aware residual encoding.
LLM-based Detection of Manipulative Political Narratives huggingface.co
A computational framework combining prompt-based filtering and unsupervised clustering identifies manipulative political narrative clusters from social media posts without requiring predefined categories.
PreScam: A Benchmark for Predicting Scam Progression from Early Conversations huggingface.co
PreScam benchmark enables modeling of scam progression through multi-turn conversations by structuring real-world reports according to a scam kill chain and annotating psychological actions and victim responses.
Realiz3D: 3D Generation Made Photorealistic via Domain-Aware Learning huggingface.co
Realiz3D addresses the domain gap between synthetic renders and real images in 3D-consistent image generation by decoupling visual domain from control signals through residual adapters and layer-specific denoising strategies.
Ideology Prediction of German Political Texts huggingface.co
A transformer-based model projects political orientation on a continuous spectrum using multiple corpora, achieving high accuracy in detecting political bias across different text sources.
Does Synthetic Layered Design Data Benefit Layered Design Decomposition? huggingface.co
Synthetic layered image data improves graphic design decomposition by enabling scalable training and better layer distribution control compared to traditional methods.
References
Hugging Face — PRIME-RL/P1-30B-A3B model card huggingface.co
P1-30B-A3B… built upon Qwen3-30B-A3B-Thinking, refined through multi-stage RL for Olympiad-level physics; secured 8 gold, 4 silver, and 1 bronze across 13 physics contests on the HiPhO benchmark.
DeepSeek — DeepSeekMath-V2 release blog deepseek.ai
Meta-Verifier evaluates the verifier’s own reasoning… a common failure mode in reward models [is] ‘hallucinating’ issues to justify a score; without it, verifiers can be rewarded for predicting failure for the wrong reasons.
Hacker News discussion on DeepSeekMath-V2 news.ycombinator.com
Skepticism… suggests that the 2024 Putnam problems may have been present in the model’s reinforcement learning data, potentially inflating its perceived reasoning capabilities.
MathArena (ETH Zurich / INSAIT) leaderboard matharena.ai
MathArena specifically addresses ‘data contamination’ by evaluating models on freshly released competition problems from the IMO, USAMO, and Putnam exams.
MLX-Community — Simplified-Reasoning SU-01 collection huggingface.co
The MLX community successfully converted and released quantized versions ranging from 4-bit to 8-bit for Apple Silicon environments.
ArxivDaily thread on SU-01 (practitioner discussion) arxivdaily.com
Independent researchers have proposed alternative benchmarks, such as EEFSUVA, curated from less-circulated Eastern European competitions, to better evaluate genuine ‘nonstandard’ problem-solving.
byteiota — coverage of UC Berkeley BenchJack audit byteiota.com
Terminal-Bench and SWE-bench were found to have 100% exploitation rates because agents could manipulate the testing environment (e.g., dropping a conftest.py file to force pytest to pass).
Berkeley RDI blog on trustworthy benchmarks rdi.berkeley.edu
GAIA and OSWorld were also vulnerable; agents could download reference answers directly from public URLs embedded in the task metadata.
explainx.ai — Terminal-Bench 2.0 analysis explainx.ai
Infrastructure-only gains of 13.7 percentage points were reported on Terminal-Bench 2.0 by simply improving self-verification hooks and prompt restructuring, dwarfing the 2–4 point gains typically seen from model weight updates.
r/Bard discussion on Gemini 3.1 Pro benchmark fairness reddit.com
LiveBench… implemented a ‘High Unseen Question Bias’ toggle, effectively accusing the model of ‘benchmaxxing’—optimizing specifically for known test patterns rather than general reasoning.
tovren.com — WildClawBench writeup tovren.com
CNCERT issued warnings regarding OpenClaw’s ‘inherently weak default security configurations,’ noting that privileged system access could allow threat actors to exfiltrate data via prompt injection or deploy malware through malicious skill repositories like ‘ClawHub’.
r/LLMDevs — ‘LLM-as-a-judge is not enough’ reddit.com
while LLM judges are useful for ‘smoke-testing’ in development, they are often too ‘noisy’ for CI/CD gates… many now advocate for verifiable metrics—such as snapshotting an agent’s tool-call trajectory.
Medium — Adithya Giridharan, ‘Lighthouse Attention and the Case for Removable Sparsity’ medium.com
Lighthouse operates as a training-time wrapper… the model experiences a transient loss spike (1.12–1.57 nats) before stabilizing, recovering within 1,000–1,500 steps after the switch to dense SDPA.
MarkTechPost coverage of Lighthouse Attention marktechpost.com
delivers 1.4×–1.7× pretraining speedup at long context… 17.3× faster forward+backward at 512K on B200
VentureBeat — DeepMind’s Michelangelo benchmark on long-context LLMs venturebeat.com
models that appear to pass needle-in-a-haystack tests collapse when asked to synthesize multiple facts across the same context length
ResearchGate — MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention researchgate.net
dynamic sparse attention patterns for long-context prefilling; cited alongside Lighthouse as part of the same wave of selection-based sparse attention research
smol.ai newsletter (AI News issue 26-05-12) news.smol.ai
Nous’s Lighthouse Attention… gradient-free selection logic sits outside the kernel so it inherits stock FlashAttention bit-for-bit
Nous Research project page — Lighthouse Attention nousresearch.com
validated up to 1M tokens across 32 Blackwell GPUs using standard context parallelism, without sparse-aware collectives