Anthropic agent survey, IBM SRE benchmark at 47%, HRM-Text trained for $1,500
Every URL the pipeline pulled into ranking for this issue — primary sources plus the supporting and contradicting findings each Researcher returned. Inline citations in the issue point back here.
Sources
Coding agents in the social sciences anthropic.com
ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks — by Artificial Analysis and IBM huggingface.co
HRM-Text: Efficient Pretraining Beyond Scaling huggingface.co
A Hierarchical Recurrent Model architecture with specialized training on instruction-response pairs achieves competitive language modeling performance with significantly reduced computational requirements compared to traditional Transformer-based approaches.
On the limits and opportunities of AI reviewers: Reviewing the reviews of Nature-family papers with 45 expert scientists huggingface.co
GPT-5.2, Gemini 3.0 Pro and Claude Opus 4.5 outperformed human reviewers at identifying valid criticisms across Nature-family submissions, in a study with 45 expert scientists. The models still trailed humans on subfield depth and managing long-context manuscript material.
You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories huggingface.co
Parameter updates during reinforcement learning with verifiable rewards trace a near rank-1 path, so a simple linear regression on early checkpoints extrapolates the final model. The RELEX method matches full RLVR performance while cutting compute and denoising stochastic optimization noise.
The Unlearnability Phenomenon in RLVR for Language Models huggingface.co
Hard prompts in RLVR resist learning even when correct rollouts exist, because cross-example gradient analysis shows their representations conflict with the rest of the batch. Standard optimizers and data augmentation fail to close the gap, pointing to a representation-level bottleneck.
Toto 2.0: Time Series Forecasting Enters the Scaling Era huggingface.co
The time-series foundation model sets state-of-the-art on BOOM, GIFT-Eval and TIME by scaling parameters under a u-muP hyperparameter transfer pipeline. Forecasting accuracy improves predictably with size, bringing the scaling-law playbook from language models to numerical sequence data.
Stable Audio 3 huggingface.co
The latent diffusion model handles variable-length generation, editing and inpainting over a semantic-acoustic autoencoder, with adversarial post-training collapsing inference to a handful of steps. Stability is releasing open weights alongside the paper for artistic experimentation.
DynMuon: A Dynamic Spectral Shaping View of Muon huggingface.co
DynMuon adjusts the spectral shaping of Muon’s polar-factor update across training stages instead of using a fixed transform, reaching lower validation loss in fewer steps. The dynamic schedule adapts to changing stochastic gradient statistics as optimization progresses.
Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment huggingface.co
DPO and RLHF optimize the same objective only when the reference policy meets a specific condition; otherwise they diverge and DPO exhibits failure modes. The authors propose Constrained Preference Optimization, a soft margin ranking variant with provable alignment guarantees.
SpecBench: Measuring Reward Hacking in Long-Horizon Coding Agents huggingface.co
Research examines reward hacking in long-horizon coding agents by comparing performance on visible validation tests versus held-out tests to identify genuine solutions versus test-game strategies.
RiT: Vanilla Diffusion Transformers Suffice in Representation Space huggingface.co
Flow matching in representation spaces with improved statistical properties enables efficient diffusion model training with reduced parameters and fast sampling.
Decoupling the Benefits of Subword Tokenization for Language Model Training via Byte-level Simulation huggingface.co
Research investigates subword tokenization’s impact on LLM training efficiency and performance through controlled byte-level pretraining experiments, revealing key factors in training throughput and linguistic priors.
Learning from Language Feedback via Variational Policy Distillation huggingface.co
Variational Policy Distillation enables reinforcement learning from language feedback by co-evolving teacher and student policies through variational expectation-maximization, overcoming limitations of passive distillation in complex reasoning tasks.
Mem-π: Adaptive Memory through Learning When and What to Generate huggingface.co
Mem-π is a framework for adaptive memory in LLM agents that generates context-specific guidance using a separate language or vision-language model trained with decision-content decoupled reinforcement learning.
MINTEval: Evaluating Memory under Multi-Target Interference in Long-Horizon Agent Systems huggingface.co
Current memory-augmented agents demonstrate poor performance in long-horizon, interference-heavy environments requiring accurate recall and aggregated reasoning across evolving information.
Generative Recursive Reasoning huggingface.co
Generative Recursive reAsoning Models (GRAM) introduce probabilistic multi-trajectory computation for neural reasoning systems, enabling multiple hypotheses and parallel inference through stochastic latent trajectories.
SaaSBench: Exploring the Boundaries of Coding Agents in Long-Horizon Enterprise SaaS Engineering huggingface.co
SaaSBench introduces a comprehensive benchmark for evaluating AI agents in enterprise SaaS development, addressing the limitations of existing benchmarks by incorporating multi-component system integration challenges.
Capturing LLM Capabilities via Evidence-Calibrated Query Clustering huggingface.co
Query clustering algorithm ECC improves LLM capability evaluation by aligning semantic embeddings with latent capability demands through posterior model comparisons and Bradley-Terry modeling.
UniT: Unified Geometry Learning with Group Autoregressive Transformer huggingface.co
UniT presents a unified feed-forward model for geometry perception using a Group Autoregressive Transformer that integrates multiple paradigms while maintaining metric-scale accuracy through scale-adaptive loss and queue-style KV caching.
PlanningBench: Generating Scalable and Verifiable Planning Data for Evaluating and Training Large Language Models huggingface.co
PlanningBench introduces a framework for generating scalable and verifiable planning data that enables better evaluation and training of large language models’ planning capabilities through structured taxonomies and constraint-driven synthesis.
MOCHA: Multi-Objective Chebyshev Annealing for Agent Skill Optimization huggingface.co
LLM agents organize behavior through skills - structured natural-language specifications governing how an agent reasons, retrieves, and responds. Unlike monolithic prompts, skills are multi-field artifacts subject to hard platform constraints: description fields are truncated for routing, instruction bodies are compacted via progressive disclosure, and co-resident skills compete for limited context windows. These constraints make skill optimization inherently multi-objective: a skill must simult
OScaR: The Occam’s Razor for Extreme KV Cache Quantization in LLMs and Beyond huggingface.co
OScaR is a novel KV cache compression framework that addresses token norm imbalance through canalized rotation and omni-token scaling, achieving significant improvements in memory efficiency and decoding speed for extended context language models.
OCTOPUS: Optimized KV Cache for Transformers via Octahedral Parametrization Under optimal Squared error quantization huggingface.co
OCTOPUS achieves efficient key-value cache compression through structured random rotations and optimized quantization of coordinate triplets, enabling high-quality reconstruction with reduced memory bandwidth usage.
Mix-Quant: Quantized Prefilling, Precise Decoding for Agentic LLMs huggingface.co
Mix-Quant is a phase-aware quantization framework that accelerates long-context, multi-turn LLM inference by applying high-throughput NVFP4 quantization to the prefilling phase while maintaining BF16 precision for decoding.
TIDE: Efficient and Lossless MoE Diffusion LLM Inference with I/O-aware Expert Offload huggingface.co
Diffusion large language models face deployment challenges on resource-constrained devices, but a new inference system called TIDE addresses this by leveraging temporal stability of expert activations and optimizing expert placement to reduce I/O overhead and computation.
It Takes Two: Complementary Self-Distillation for Contextual Integrity in LLMs huggingface.co
SELFCI is a self-distillation framework that separates information suppression from task resolution to achieve better privacy-utility balance in large language models without external supervision.
Evaluating Temporal Semantic Caching and Workflow Optimization in Agentic Plan-Execute Pipelines huggingface.co
Industrial asset operations workflows face latency challenges due to complex coordination needs, addressed through novel caching and workflow optimization techniques that improve execution speed while maintaining correctness in parameter-rich environments.
A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook huggingface.co
Large Audio Language Models exhibit significant trustworthiness challenges despite performance advances, requiring comprehensive frameworks addressing security vulnerabilities and defensive strategies.
Safety Alignment as Continual Learning: Mitigating the Alignment Tax via Orthogonal Gradient Projection huggingface.co
Orthogonal Gradient Projection for Safety Alignment (OGPSA) addresses the safety-utility trade-off in LLM alignment by preserving general capabilities during sequential safety training through low-rank gradient projection.
LLMEval-Logic: A Solver-Verified Chinese Benchmark for Logical Reasoning of LLMs with Adversarial Hardening huggingface.co
A Chinese logical reasoning benchmark for large language models is introduced, featuring expert-verified natural-language items with formal annotations and adversarial hardening to better evaluate rule-governed reasoning capabilities.
Learn-by-Wire Training Control Governance: Bounded Autonomous Training Under Stress for Stability and Efficiency huggingface.co
Learn-by-Wire Guard (LBW-Guard) enhances language model training stability and efficiency by providing bounded autonomous control over optimizer execution without altering the underlying training objective.
Video2GUI: Synthesizing Large-Scale Interaction Trajectories for Generalized GUI Agent Pretraining huggingface.co
A large-scale GUI dataset was created by automatically extracting interaction trajectories from internet videos, enabling improved performance in GUI agents through pre-training on this diverse collection.
CutVerse: A Compositional GUI Agents Benchmark for Media Post-Production Editing huggingface.co
Current GUI agents show limited effectiveness in professional media post-production tasks despite advances in spatial grounding and multimodal alignment.
IndusAgent: Reinforcing Open-Vocabulary Industrial Anomaly Detection with Agentic Tools huggingface.co
IndusAgent is a tool-augmented agentic framework for open-vocabulary industrial anomaly detection that improves performance through structured visual reasoning and dynamic tool utilization.
Stitched Value Model for Diffusion Alignment huggingface.co
StitchVM efficiently transfers pretrained pixel-space reward models to noisy latent spaces for diffusion model alignment through a lightweight model stitching framework.
Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning huggingface.co
Uni-Edit introduces an intelligent image editing task that simultaneously enhances unified multimodal models’ understanding, generation, and editing capabilities through a single training stage and dataset, utilizing an automated data synthesis pipeline for complex editing instructions.
Rethinking Visual Attribution for Chest X-ray Reasoning in Large Vision Language Models huggingface.co
A causal evaluation framework is developed to verify visual evidence grounding in chest X-ray vision-language models, leading to the proposal of MedFocus, a concept-based attribution method that improves clinical trustworthiness through anatomical region localization and causal effect measurement.
PanoWorld: A Generative Spatial World Model for Consistent Whole-House Panorama Synthesis huggingface.co
PanoWorld generates consistent VR tours by combining 3D geometric guidance with dynamic visual memory, enabling high-quality multi-room panoramas with spatial coherence.
OcclusionFormer: Arranging Z-Order for Layout-Grounded Image Generation huggingface.co
OcclusionFormer addresses inter-object occlusion challenges in layout-to-image generation by modeling explicit Z-order priority through diffusion transformers and volume rendering techniques.
Enhancing Train-Free Infinite-Frame Generation for Consistent Long Videos huggingface.co
MIGA addresses long video generation challenges by reducing training-inference gaps and enhancing temporal consistency through dual consistency mechanisms.
Mega-ASR: Towards In-the-wild^2 Speech Recognition via Scaling up Real-world Acoustic Simulation huggingface.co
Mega-ASR framework improves robustness in real-world speech recognition through compound-data construction and progressive acoustic-to-semantic optimization techniques.
iTryOn: Mastering Interactive Video Virtual Try-On with Spatial-Semantic Guidance huggingface.co
Interactive Video Virtual Try-On addresses active human-garment interaction by introducing a multi-level injection mechanism and action-aware positional embeddings within a video diffusion Transformer framework.
DrawMotion: Generating 3D Human Motions by Freehand Drawing huggingface.co
DrawMotion is an efficient diffusion-based framework that generates human motions using both text and hand-drawn sketches, reducing user effort by 46.7% while maintaining motion fidelity.
Lost in the Folds: When Cross-Validation Is Not a Deep Ensemble for Uncertainty Estimation huggingface.co
Deep ensembles trained with fixed data and varying seeds outperform cross-validation ensembles in calibration and failure detection for medical image segmentation, while cross-validation ensembles better approximate inter-rater variability.
References
Jha et al., ICML 2025 ITBench paper (PMLR) raw.githubusercontent.com
state-of-the-art models resolved only ~11.4% of SRE scenarios
IBM Research — ITBench & MAST blog (Hugging Face) huggingface.co
Multi-Agent System Failure Taxonomy (MAST) … distinguishes between ‘fatal’ flaws like incorrect verification and ‘non-fatal’ behaviors such as benign step repetition
OpenReview reviewer comments on ITBench (ICML 2025) openreview.net
FinOps category originally contained too few tasks to support broad difficulty claims … ‘push-button’ Kubernetes workflows requires significant infrastructure resources
Melethil, ‘AI Agent Benchmarks Are Broken’ (Medium) medium.com
approximately 44% of ITBench’s mitigation problems could be ‘solved’ by a generic pod-restart loop … clears the alert but fails to address the underlying defect
Artificial Analysis ITBench-AA leaderboard artificialanalysis.ai
Gemma 4 31B achieves 37% at $0.14 per task … Claude Opus 4.7 the most expensive at $5.38 per task … Gemini 3.1 Pro averaged 83 turns for 30% vs GPT-5.5’s 31 turns
kav.co.id coverage of ITBench-AA news.kav.co.id
positioned as a potential industry standard for ‘mission-critical IT operations’ … IBM plans to extend the framework to cover FinOps and CISO tasks
Henry Farrell, Programmable Mutter programmablemutter.com
Claude Code exactly replicated all 12 primary coefficients to three decimal places… but when extending the data, it failed to calculate 2022 turnout correctly and missed 1 out of 30 California counties in its treatment timing.
Niskanen Center (Grossmann interview with Andy Hall) niskanencenter.org
Hall has reorganized his lab around AI agents, arguing these tools let researchers function like ‘firm managers’ directing a team of a hundred — though critics warn of an industrialization of social science where quantity replaces rigorous, intuition-led inquiry.
Tom Pepinsky, Substack tompepinsky.substack.com
Agentic AI tends to ‘guess what the user wants to hear,’ interpreting results in provocative but incorrect ways — creating a risk of p-hacking where the agent unknowingly cherry-picks results or modifies code to produce a finding that ‘vibes’ with the researcher’s prompt.
REPRO-Bench (Hu et al., ACL 2025 Findings) aclanthology.org
The best-performing baseline (CORE-Agent) achieved only 21.4% accuracy on assessing reproducibility of social science papers — barely better than random guessing; their specialized REPRO-Agent improved this to 36.6%.
Forbes — The Wiretap forbes.com
Newer versions such as Claude Opus 4.7 introduced security defects in 52% of tested tasks, a sharp decline from previous iterations, raising concerns about autonomous agents that might inadvertently exfiltrate sensitive research data.
LogRocket Blog blog.logrocket.com
Senior engineers spend an average of 4.3 minutes reviewing AI-generated pull requests compared to 1.2 minutes for human-written code, and AI-authored code surfaces 1.7x more issues — a ‘peer review tax’ that may explain why agent users start more projects but don’t finish more papers.
ARC Prize Foundation leaderboard / analysis of original HRM arcprize.org
External verification on the semi-private hold-out recorded 32% on ARC-AGI-1 (vs. 41% self-reported) and only 2% on ARC-AGI-2; an ablation found a similarly sized standard Transformer reached nearly identical performance with the same training pipeline — the ‘outer loop’ refinement and ~300x data augmentation were the true drivers.
KuCoin news flash on Kyle Kastner reproduction kucoin.com
Kyle Kastner reported a successful full-scale reproduction of HRM-Text XL (1B) on 16 H200 GPUs in ~38 hours, matching the paper’s MATH and DROP numbers.
r/LocalLLaMA discussion thread reddit.com
Commenters argue scores like MATH 84.7% and DROP 82.3% are statistically improbable for a 40B-token model unless the curriculum was ‘curriculated’ to mirror the test sets; the model is also reportedly fragile to inference harness — wrong token_type_ids cause ‘silent destruction’ of performance.
Tay et al., UL2 / PrefixLM literature (arxiv 2204.05832) arxiv.org
Ablation reported alongside HRM-Text: vanilla 1B Transformer on the same 40B instruction set scores MMLU 40.55%; adding response-only loss lifts it to 47.72%; adding the PrefixLM mask reaches 53.15%; only the final HRM recurrent architecture pushes it to 60.73% — meaning the objective/mask changes account for more of the gain than the recurrence itself.
CMU 10-423 lecture notes on PrefixLM theory cs.cmu.edu
PrefixLMs can converge to the optimal solution of the underlying task distribution as sample size increases, whereas causal LMs behave like online gradient descent that may never reach a stationary point — a theoretical basis for HRM-Text’s sample-efficiency claim.
Medium / ‘Data Science in Your Pocket’ review medium.com
Comparing a model trained exclusively on curated instruction-response pairs (OpenMathInstruct2, NuminaMath, FLAN) against general-purpose base models like Llama 3.2 and Qwen 2.5 is a ‘specialist vs. generalist’ mismatch; HRM-Text is a ‘hollow shell’ on broad world knowledge despite its reasoning scores.