SIA fuses scaffold+LoRA, RCBench caps agents at 21.5/50, ToolMaze lags 3.66×
Every URL the pipeline pulled into ranking for this issue — primary sources plus the supporting and contradicting findings each Researcher returned. Inline citations in the issue point back here.
Sources
SIA: Self Improving AI with Harness & Weight Updates huggingface.co
A self-improving AI framework simultaneously updates both model weights and task-specific agent architecture through a language-model feedback agent across legal classification, GPU optimization, and biological data denoising tasks.
ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research huggingface.co
ResearchClawBench evaluates autonomous scientific research capabilities across 40 tasks from 10 domains using expert-curated criteria and reveals current limitations in re-discovery accuracy among AI agents and LLMs.
When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents huggingface.co
ToolMaze benchmark reveals that real-world tool failures significantly degrade TIR performance, with implicit semantic failures causing the most severe drops and dynamic replanning emerging as a key bottleneck.
OpenSkill: Open-World Self-Evolution for LLM Agents huggingface.co
OpenSkill lets agents grow their own skills and verification signals from open-world resources, skipping target-task labels entirely. The framework runs a bootstrapped learning loop over self-built virtual tasks, then transfers the learned skills to downstream benchmarks with high automated pass rates.
Socratic-SWE: Self-Evolving Coding Agents via Trace-Derived Agent Skills huggingface.co
Socratic-SWE turns historical solving traces into structured repair patterns, then uses execution-based validation and a solver-gradient alignment reward to curate a curriculum. The closed-loop approach beats prior self-evolving baselines across SWE-bench Verified, Lite, Pro and Terminal-Bench 2.0.
HarnessForge: Joint Harness and Policy Evolution for Adaptive Agent Systems huggingface.co
HarnessForge argues that updating model weights alone fails when tasks demand different execution paradigms. The system pairs fault-guided harness tailoring with harness-conditioned policy alignment, co-evolving the harness-policy pair so agents adapt at the system level rather than per component.
Thinking with Imagination: Agentic Visual Spatial Reasoning with World Simulators huggingface.co
Astra augments vision-language models with action-conditioned visual imagination, calling a Bagel-based world simulator to render novel views during reasoning. An RL-trained policy decides when to imagine, lifting scores on MMSI-Bench and MindCube through view-consistency tuning and tool-use exploration.
Critic-R: Improving Agentic Search using Instruction-tuned Retrievers with Natural Language Introspective Feedback huggingface.co
Critic-R wires a critic model between reasoning agents and retrieval systems, producing introspective reasoning traces that drive both query refinement and retriever fine-tuning. The dual optimization yields automatic supervision for agentic search without hand-labeled relevance data.
Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation huggingface.co
The self-OPD recipe retrains causal-attention LLMs as bidirectional diffusion language models by distilling on the student’s own confidence-based decoding trajectories. Removing the train-inference mismatch cuts the token budget needed versus standard knowledge distillation while preserving downstream quality.
Stream3D-VLM: Online 3D Spatial Understanding with Incremental Geometry Priors huggingface.co
Stream3D-VLM treats 3D scene understanding as autoregressive next-token prediction over streaming frames, fusing Visual-Spatial Feature Integration with Geometry-Adaptive Voxel Compression to keep token counts tractable. Training uses spatio-temporal 3D QA pairs designed for online, incremental geometry priors.
Measuring Model Robustness via Fisher Information: Spectral Bounds, Theoretical Guarantees, and Practical Algorithms huggingface.co
A novel attack-agnostic robustness metric based on Fisher Information Matrix spectral norm is proposed, providing theoretical bounds and scalable evaluation methods for deep neural network robustness assessment.
The Distillation Game: Adaptive Attacks & Efficient Defenses huggingface.co
Distillation attacks create a trade-off for model providers, where useful outputs also enable imitation, addressed through a minimax game framework with adaptive evaluation and defensive strategies.
LLM Explainability with Counterfactual Chains and Causal Graphs huggingface.co
Causal graphs are used to model large language model inference processes, enabling transparent visualization of how models perceive and organize high-level concepts for predictions through a four-phase method involving concept discovery, mapping, and MCMC-inspired counterfactual augmentation.
PaperFlow: Profiling, Recommending, and Adapting Across Daily Paper Streams huggingface.co
PaperFlow is a framework for scientific paper recommendation that processes user profiles, daily paper streams, and interest drift through three stages: profiling, recommending, and adapting, using a longitudinal benchmark with 24 users, 50 daily streams, and 1,200 episodes.
CORE: Contrastive Reflection Enables Rapid Improvements in Reasoning huggingface.co
Contrastive Reflection (CORE) improves language model reasoning by analyzing differences between successful and unsuccessful attempts to generate concise, interpretable insights that enable faster and more efficient self-improvement compared to traditional parametric and non-parametric approaches.
Empirical Study on the Characteristics and Evolution of AI-usage in GitHub Repositories: Evidence from Code Comments huggingface.co
AI tools are increasingly integrated into software development workflows, with developers primarily using LLMs for code implementation and enhancement while maintaining ongoing oversight through refactoring and bug fixes, showing a shift from direct code generation to conceptual support over time.
Your UnEmbedding Matrix is Secretly a Feature Lens for Text Embeddings huggingface.co
Text embeddings from large language models are enhanced by EmbedFilter, a linear transformation that reduces the influence of high-frequency tokens and improves semantic representations while enabling dimensionality reduction.
Physics in 2-Steps: Locking Motion Priors Before Visual Refinement Erases Them huggingface.co
PhaseLock is a training-free framework that improves physical consistency in image-to-video diffusion models by preserving motion priors from early-step inference throughout the denoising process.
Reinforcement Learning from Rich Feedback with Distributional DAgger huggingface.co
Forward cross-entropy objective with distributional imitation learning enables monotonic policy improvement and better performance in reasoning tasks compared to traditional reinforcement learning methods.
LayerRoute: Input-Conditioned Adaptive Layer Skipping via LoRA Fine-Tuning for Agentic Language Models huggingface.co
LayerRoute is a lightweight adapter that selectively skips transformer blocks during inference based on input type, achieving compute savings while maintaining or improving model quality through gated routing and LoRA adaptation.
UnpredictaBench: A Benchmark for Evaluating Distributional Randomness in LLMs huggingface.co
UnpredictaBench evaluates large language models’ capacity to sample from target distributions, revealing significant gaps in their ability to simulate unpredictable systems despite recent advances in output diversity.
Towards Retrieving Interaction Spaces for Agentic Search huggingface.co
RISE framework constructs bounded interaction spaces for agentic search by combining BM25 retrieval with preprocessed document indexing to enable efficient corpus exploration while maintaining high accuracy at scale.
Augmenting Attention with Exponentially Decaying Memory Improves Query-Aware KV Sparsity huggingface.co
RAT+ memory module enhances query-aware sparse inference methods by improving accuracy in long-context language models across various sparse budgets.
Imaginative Perception Tokens Enhance Spatial Reasoning in Multimodal Language Models huggingface.co
Imaginative Perception Tokens (IPT) enhance vision-language models’ spatial reasoning by providing intermediate perceptual representations that externalize what the model would perceive from alternative viewpoints, outperforming traditional text-based reasoning methods.
Robots Need More than VLA and World Models huggingface.co
Robot intelligence advancement requires integrating unstructured behavioral data through specialized interfaces for labeling, embodiment mapping, world modeling, and reward inference rather than relying solely on policy scaling.
SubtleMemory: A Benchmark for Fine-Grained Relational Memory Discrimination in Long-Horizon AI Agents huggingface.co
SubtleMemory benchmark evaluates AI agents’ ability to handle complex relational memory structures that emerge during prolonged interactions, revealing limitations in current memory systems for preserving and utilizing nuanced memory relationships.
SPACENUM: Revisiting Spatial Numerical Understanding in VLMs huggingface.co
Vision-language models struggle to genuinely understand spatial numerical concepts, relying instead on shallow visual cues rather than developing robust coordinate-aware representations.
Watch, Remember, Reason: Human-View Video Understanding with MLLMs huggingface.co
Multimodal large language models for video understanding are structured around three core capabilities—watching, remembering, and reasoning—with applications spanning multiple video domains and addressing challenges in perception, memory, and reasoning.
dots.tts Technical Report huggingface.co
A 2B-parameter continuous autoregressive text-to-speech model trained on a multilingual corpus achieves state-of-the-art performance on multiple benchmarks while enabling efficient low-latency speech generation through specialized distillation techniques.
A Cookbook of 3D Vision: Data, Learning Paradigms, and Application huggingface.co
3D vision research is organized through a taxonomy connecting geometric representations, datasets, learning frameworks, and applications across reconstruction, generation, and video modeling tasks.
WorldBench: A Challenging and Visually Diverse Multimodal Reasoning Benchmark huggingface.co
WorldBench is introduced as a visually diverse reasoning benchmark for evaluating multimodal large language models, revealing significant limitations in current models’ visual understanding capabilities.
Streaming Video Generation with Streaming Force Control huggingface.co
StreamForce is a causal, unified video generation model that provides real-time, physically grounded responses to time-varying forces through a distillation pipeline and autoregressive architecture.
Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation huggingface.co
Interactive ASR framework integrates semantic correction and reasoning-based editing to reduce semantic errors through multi-turn refinement, validated by a new sentence-level semantic error rate metric and interactive simulation system.
GENEB: Why Genomic Models Are Hard to Compare huggingface.co
GENEB presents a comprehensive benchmark for evaluating genomic foundation models across diverse tasks and architectures under a unified protocol.
MMAE: A Massive Multitask Audio Editing Benchmark huggingface.co
MMAE presents a comprehensive benchmark for instruction-based audio editing across multiple modalities and complexity levels, revealing significant gaps in current model capabilities.
When Gradients Collide: Failure Modes of Multi-Objective Prompt Optimization for LLM Judges huggingface.co
Multi-objective LLM judge customization using textual gradients faces challenges from gradient dilution and instruction interference that limit optimization effectiveness.
Compress-Distill: Reasoning Trace Compression for Efficient Knowledge Distillation huggingface.co
Post-hoc compression of reasoning traces reduces computational costs and inference lengths while maintaining high accuracy, offering an accuracy-efficiency trade-off in knowledge distillation.
AnchorWorld: Embodied Egocentric World Simulation with View-based Evolution Customization huggingface.co
AnchorWorld advances egocentric simulation through enhanced interaction integrity and flexible world customization using 3D human motion and anchor view definitions.
Direct 3D-Aware Object Insertion via Decomposed Visual Proxies huggingface.co
DIFFUSION-BASED OBJECT INSERTION FRAMEWORK WITH POSE CONTROL THROUGH DECOMPOSED GUIDANCE COMPONENTS
TBD-VLA: Temporal Block Diffusion Vision Language Action Model huggingface.co
TBD-VLA is a discrete vision-language-action framework that combines block diffusion with autoregressive generation to achieve efficient temporal action modeling and faster inference.
SoCRATES: Towards Reliable Automated Evaluation of Proactive LLM Mediation across Domains and Socio-cognitive Variations huggingface.co
SoCRATES presents a realistic multi-domain benchmark for evaluating proactive LLM mediators across various socio-cognitive adaptation axes, demonstrating that even top-performing models only resolve about one-third of the consensus gap in conflict resolution.
LIMMT: Less is More for Motion Tracking huggingface.co
Training with high-quality motion data improves tracking policy optimization trajectories, with minimal data subsets outperforming full datasets in physics-based humanoid motion tracking.
UniSHARP: Universal Sharp Monocular View Synthesis huggingface.co
UniSHARP extends SHARP for universal monocular rendering across different camera systems by aligning images in an omnidirectional latent space through joint feature and Gaussian space alignment.
Almieyar-Oryx-BloomBench: A Bilingual Multimodal Benchmark for Cognitively Informed Evaluation of Vision-Language Models huggingface.co
BloomBench presents a cognitively grounded bilingual multimodal benchmark for Vision-Language Models, revealing significant cognitive asymmetries and cross-lingual performance gaps in current models.
ECI_{sem}: Semantic Residual Effective Contrastive Information for Evaluating Hard Negatives huggingface.co
ECI_sem, a semantic residual variant of Effective Contrastive Information, ranks negative sources for dense retrieval using frozen embeddings without requiring training, achieving strong performance on MS MARCO and BEIR benchmarks.
Parametric Social Identity Injection and Diversification in Public Opinion Simulation huggingface.co
Large language models suffer from reduced social diversity in public opinion simulation due to identity indistinction in hidden representations, which is addressed through a parametric injection framework that enhances demographic representation fidelity and diversity.
Entropy as a Structural Prior: How a Log-Barrier on DiT Belief Space Drives Musical Diversity and Development huggingface.co
Confidence-based loss weighting via entropy-derived log-barrier enables improved audio generation through adaptive gradient scaling in supervised diffusion training.
How Far Can Chord-Symbol Time-Series Adaptation Carry Genre Identity? Capabilities and Boundaries in Multi-Genre Chord-Symbol Modeling huggingface.co
Small adaptation interfaces extend a frozen Music Transformer model to multiple genres, showing consistent improvement in harmonic prediction but limited genre identity representation.
References
Weights & Biases report on OpenAI PaperBench wandb.ai
even the most advanced models—such as Claude 3.5 Sonnet and OpenAI’s o1—achieve replication scores of only 21% to 24%, while human ML PhDs reach approximately 41%
MedResearchBench (arXiv 2606.07591v1 discussion) arxiv.org
RCBench may be limited by its focus on well-defined computational experiments… real-world research—particularly in clinical and medical fields—requires navigating messier variables like missing data and confounding factors
Cerebras blog: ‘How to stop your autoresearch loop from cheating’ cerebras.ai
agents with loose guardrails often abandon their original research goals to pursue unintended ‘side quests’… an agent tasked with memory optimization might instead spend hours investigating model weight limits
GitHub Discussion #406, karpathy/autoresearch github.com
most improvements found by agents—such as batch size or learning rate adjustments—could be achieved more efficiently through classical Bayesian optimization
PMC study on GPT-5.1 as scientific judge pmc.ncbi.nlm.nih.gov
GPT-5.1 often fails to recognize retracted articles, frequently evaluating discredited research as high-quality… producing different verdicts for the same prompt in 27% of test cases
Augment Code ‘Auggie tops SWE-bench Pro’ blog augmentcode.com
agentic scaffolding—the instructions and tools surrounding the model—often impacts performance more than the model weights alone
MarkTechPost marktechpost.com
When the Feedback-Agent determines that scaffold edits have hit an accuracy plateau, it initiates weight updates using Low-Rank Adaptation (LoRA)… The implementation defaults to a LoRA rank of 32… uses gpt-oss-120b for the Target Agent, while the orchestrating Meta and Feedback agents run on Claude Sonnet.
Techstrong.ai (Hexo Labs founder interview) techstrong.ai
If just three or four companies own this kind of technology, I’m not sure it’s a good idea — founder Kunal Bhatia framing the MIT-licensed release as ‘sovereignty’ against frontier-lab concentration.
BriefGlance briefglance.com
The ‘350x’ claim refers to the accelerated rate of improvement compared to human-in-the-loop development cycles… at the time of Hexo Labs’ announcement, the official MLE-bench leaderboard was taken offline for maintenance to address fairness and comparability issues.
Hacker News thread on Hexo Labs SIA news.ycombinator.com
compute isn’t free… hard to imagine which organizations would fund the massive electricity bills required for recursive training outside of big tech; other commenters dismissed the release as part of an ‘IPO roadshow.’
Sakana AI — Darwin Gödel Machine sakana.ai
DGM iteratively modifies its own Python codebase, evaluates variants on SWE-bench, and archives successful mutations — improving SWE-bench performance from 20.0% to 50.0% without touching model weights.
OpenReview — AgentBreeder paper openreview.net
Evolutionary search can find ‘blue’ scaffolds that improve performance by nearly 80%, but ‘red’ adversarial scaffolds also emerge that are highly capable yet structurally vulnerable — evidence that co-evolving scaffold and weights against a fixed verifier produces fragile equilibria.
Cleanlab blog on τ-bench cleanlab.ai
Even state-of-the-art models like GPT-4o may succeed on less than 50% of tasks, with consistency dropping sharply in complex domains… the pass^k metric reveals a ‘reliability gap’.
Sierra blog: τ²-bench sierra.ai
τ²-bench introduces ‘dual-control’ environments where both the agent and a (simulated) user can modify the world state, testing the agent’s ability to guide a human collaborator through tasks the agent cannot resolve alone.
OpenReview: ToolFailBench openreview.net
ToolFailBench uses a four-part taxonomy: Tool-Skip, Result-Ignore, Output-Fabrication, and Unnecessary-Tool-Use… Llama-3.1 models exhibit an ‘Always-Call’ pattern even when tools are unnecessary, whereas Qwen2.5-72B shows significantly more discipline.
LangChain blog: Fault Tolerance in LangGraph langchain.com
RetryPolicy and TimeoutPolicy can be attached to individual nodes… an agent can crash, be restarted from a state checkpoint, and resume exactly at the failed tool-call node rather than restarting the entire workflow.
Palo Alto Unit 42: AI Agent Supply Chain Risks unit42.paloaltonetworks.com
Behavioral Integrity Verification (BIV) compares an agent’s claimed behavior against its executable code and natural-language instructions to catch ‘malicious or sloppy’ deviations in third-party skills.
Towards AI: Building Retries in Agents pub.towardsai.net
Circuit breakers stop the agent after a set number of failed iterations (typically 3–5) to prevent ‘token snowballing,’ where the cost of a single task scales exponentially due to a growing error history.