SIA fuses scaffold+LoRA, RCBench caps agents at 21.5/50, ToolMaze lags 3.66×

TL;DR

RCBench caps top agents at 21.5 of 50 on reproducing published-paper findings, half the bar.
Hexo Labs’ SIA lifts LawBench charge classification to 70.1% by interleaving scaffold rewrites with LoRA updates.
ToolMaze finds agent recovery scales 3.66× slower than task execution as parameters grow.
GPT-5.1 judges flip verdicts on 27% of repeated RCBench prompts.
HarnessForge co-evolves harness and policy, echoing SIA’s system-level argument.

Today’s agent research makes a quiet argument in unison: the next gains aren’t coming from bigger weights, they’re coming from the loop the weights run inside. RCBench caps frontier agents at 21.5 of 50 on reproducing published-paper findings, and finds the GPT-5.1 judge itself flipping verdicts on 27% of repeated prompts. Hexo Labs’ SIA acts on the same intuition from the build side, interleaving Python scaffold rewrites with LoRA rank-32 updates on gpt-oss-120b to hit 70.1% on a 191-class legal-charge task (with a Claude Sonnet orchestrator doing the heavy lifting). ToolMaze measures the same gap from the failure side: recovery from broken tool calls scales 3.66× slower than skill, and a tools may fail warning prompt barely moves the dial.

The HF Daily round-up reinforces the pattern — HarnessForge, OpenSkill, Socratic-SWE, and Critic-R all co-evolve agent systems around the model rather than retraining it.

RCBench scores top AI research agents at 21.5 of 50

TL;DR

Claude Code tops RCBench at 21.5, less than half the 50-point bar for just re-discovering a published paper’s findings.
OpenAI’s PaperBench shows the same ceiling: frontier models hit 21–24% vs. 41% for ML PhDs.
Failures cluster in experiment design and missing scientific core, not coding bugs.
The GPT-5.1 judge is itself shaky: it misses retracted papers and flips verdicts on 27% of repeated prompts.

A 50 means you matched the paper. Nobody’s close.

ResearchClawBench (RCBench) hands an agent a task description, a stack of PDFs, and raw data, then asks for a finished research report across 40 tasks in 10 scientific domains — astronomy to neuroscience. Expert rubrics anchor a score of 50 at “you reproduced the original paper’s findings.” Above 50 is reserved for genuine new discovery.

Top result: Claude Code (Claude-Opus-4.6) at 21.5. Codex CLI lands at 18.4. Even the “frontier mean” — cheating by taking the best agent per task — only reaches 24.6. Native LLMs run through the authors’ ReAct-style ResearchHarness do marginally better as a frontier (26.5), with Claude-Opus-4.7 leading individual models at 20.7. Re-discovery, let alone discovery, is not on the table.

The ~20% ceiling is showing up everywhere

The striking thing about RCBench’s numbers is how closely they echo OpenAI’s PaperBench, which scored frontier models at 21–24% replication against 8,316 author-written rubric nodes, versus 41% for human ML PhDs ¹. Two independently constructed benchmarks, two different scoring schemes, same answer. That convergence is the news: “agents perform at roughly a fifth of expert capability on end-to-end research” looks like a property of current systems, not an artifact of any single rubric’s harshness.

The authors’ error analysis points in a consistent direction. The dominant failures are Experiment Design Mismatch and Scientific Core Missing — agents look professional, produce plausible-looking reports, and miss the actual point. This mirrors what Cerebras has called “agent drift,” where loosely-guarded autoresearch loops abandon their stated objective for unrelated side quests ². Practitioners are blunter: in Karpathy’s autoresearch discussion, reviewers argue that much of what passes for agent “research” is hyperparameter tuning that classical Bayesian optimization would do faster and cheaper ³.

Caveats that matter more than the leaderboard

Three structural concerns sit underneath the headline numbers.

First, the judge. RCBench leans on GPT-5.1-as-judge with human-expert spot checks. An independent PMC evaluation found GPT-5.1 frequently rates retracted papers as high-quality and produces different verdicts for the same prompt 27% of the time ⁴. In the >50 “new discovery” regime, where there is no ground truth to anchor against, that noise is load-bearing.

Second, scope. The MedResearchBench authors note that RCBench’s dry-lab, well-defined computational tasks duck the messy variables — confounders, missing data, reporting standards — that dominate actual clinical and field science ⁵. The astronomy/physics/chemistry tilt selects for tasks with unusually clean ground truth.

Third, the harness. Augment’s SWE-bench Pro work showed that scaffolding choices often swing scores more than the underlying model ⁶. RCBench is partly comparing Claude Code’s harness against Codex CLI’s against the authors’ ResearchHarness — a confound the leaderboard format obscures.

Which is the line that should worry anyone selling an “AI scientist.”

Hexo Labs’ SIA co-evolves scaffold and LoRA weights

TL;DR

SIA interleaves Python scaffold rewrites with LoRA rank-32 weight updates on gpt-oss-120b, switching levers when one plateaus.
On LawBench’s 191-class charge task, the combined loop hit 70.1% vs. 50.0% for scaffold-only.
The orchestrating Meta- and Feedback-Agents run on Claude Sonnet, undercutting the “open-source superintelligence” pitch around the MIT release.
Prior scaffold-evolution work (DGM, AgentBreeder) already shows fragile co-evolved equilibria — a risk SIA flags but doesn’t measure.

The dual-lever loop

Most self-improving agent work picks one knob. Sakana’s Darwin Gödel Machine rewrites its own Python and archives winning mutations, taking SWE-bench from 20% to 50% without ever touching weights ⁷. Test-time training systems do the opposite: gradient updates against a frozen harness. SIA’s bet is that the interesting regime is both, with an LLM Feedback-Agent deciding which lever to pull next based on execution traces.

In practice that means a Claude Sonnet Feedback-Agent watches the Target Agent (gpt-oss-120b) run, and when scaffold edits — tool dispatch code, prompt rewrites, parser fixes — stop yielding gains, it triggers a LoRA rank-32 adapter training run on Modal H100s inside a network-disabled sandbox ⁸. It also picks the RL algorithm to fit the task: PPO+GAE for long-horizon code, GRPO when rollouts are cheap, entropic advantage weighting for sparse-reward CUDA kernels, DPO when the verifier only ranks.

flowchart LR
    M[Meta-Agent<br/>Claude Sonnet] -->|writes initial scaffold| T
    T[Target Agent<br/>gpt-oss-120b + scaffold] -->|trajectories + metrics| F
    F[Feedback-Agent<br/>Claude Sonnet] -->|lever 1: rewrite Python| T
    F -->|lever 2: LoRA + RL on Modal| T

Where the numbers land

The headline gains are real and span very different domains:

Benchmark	Prior SOTA	Harness-only	SIA W+H
LawBench (191-class)	45.0%	50.0%	70.1%
AlphaEvolve TriMul (H100 kernel)	1.161	—	1.475 (1,017 µs)
MAGIC scRNA-seq (mse_norm)	0.240	0.241	0.289

The LawBench jump is 25.1 absolute points over the previous SOTA of 45.0%. The TriMul result is the cleanest case for the weight lever: the model discovered H100-specific shared-memory tiling patterns that no amount of prompt engineering surfaced, cutting runtime 91.9% below the best harness-only attempt (the paper reports the delta, not the harness-only score itself). On MAGIC, the LoRA phase rediscovered a biological invariant — clipping imputed counts to non-negative integers — that scaffold search had missed for iterations.

What the press release skips

Hexo Labs’ rollout leaned on a “350× acceleration toward superintelligence” framing. BriefGlance notes the MLE-bench leaderboard was offline for fairness maintenance when that number landed, and reads the 350× as an extrapolation of an efficiency delta rather than a vetted benchmark ⁹. Hacker News was harsher, questioning who outside the hyperscalers can actually fund recursive H100 rollouts and calling the launch an “IPO roadshow” ¹⁰. Founder Kunal Bhatia’s defense is sovereignty — that the real risk is three or four labs owning self-improving tech, hence the MIT license ¹¹ — but the Claude Sonnet dependency in the orchestrator means you can’t actually run the loop without a frontier API key.

That AgentBreeder finding is the unmeasured shoe waiting to drop. SIA’s own limitations section names a “co-evolutionary Goodhart’s Law” — harness and weights settling into a Nash equilibrium that exploits the verifier rather than the task — but the paper reports no held-out or cross-harness eval to rule it out. Until someone swaps the SIA-tuned weights into a fresh scaffold and watches the LawBench number, “70.1%” is a number about this specific co-evolved pair, not about gpt-oss-120b learning Chinese criminal law.

ToolMaze finds agent recovery scales 3.66× slower than skill

TL;DR

Recovery scales 3.66× slower than task execution as LLMs grow in parameter count.
Implicit “valid-but-wrong” tool outputs drop recovery rates 37% vs. explicit HTTP errors.
Under implicit-permanent faults, agents burn >70% of their token budget on futile retries.
A “tools may fail” warning prompt buys only +1.5 to +20.8 points — nowhere near closing the gap.

The happy-path fallacy

Most tool-use benchmarks assume tools work. ToolMaze (arXiv:2606.05806) breaks them on purpose, then measures whether agents notice. The framework structures tasks as Directed Acyclic Graphs of tool calls across four topological complexity levels (linear, 1-to-N, many-to-many, integrated multi-branch) and crosses them with a 2×2 perturbation taxonomy: errors are either explicit (HTTP 404, exception) or implicit (a negative inventory count returned as valid JSON), and either transient (retry works) or permanent (find another path). The result is 2,000 task instances built from a 270-tool corpus across six domains.

	Transient	Permanent
Explicit	Retry the call	Reroute through DAG
Implicit	Detect, then retry	Detect, then reroute

Detection is the hard part. The bottom row is where agents collapse.

Recovery is not proficiency

The headline number: Task Success Rate in clean conditions improves by 17.85 percentage points per order of magnitude of parameters. Perturbation Recovery Rate improves by only 4.88. That 3.66× gap is the paper’s core claim — scaling buys you a better executor, not a better debugger. Gemini-3.1-Pro-Preview topped the composite leaderboard, but Claude-Sonnet-4.6 still fell from 77% clean success to dramatically lower numbers under permanent faults.

This lines up with parallel work. Cleanlab’s $pass^k$ analysis of τ-bench finds GPT-4o-class models succeed on fewer than half of multi-trial runs even without injected failures — reliability and proficiency are decoupled across benchmarks ¹³. Sierra’s τ²-bench attacks the same gap from another angle, testing whether agents can coach a simulated human through recovery in dual-control environments ¹⁴.

The trust gap is the real story

The 37.15% PRR collapse between explicit and implicit failures is the finding worth internalizing. Agents handle structured errors reasonably — a 404 is a clear signal to backtrack. They fail when a tool returns plausibly-shaped garbage. ToolFailBench, working from a different taxonomy (Tool-Skip, Result-Ignore, Output-Fabrication, Unnecessary-Tool-Use), names this directly and adds a useful nuance ToolMaze’s aggregates obscure: model-family signatures. Llama-3.1 variants exhibit an “Always-Call” pattern; Qwen2.5-72B is markedly more disciplined ¹⁵. Palo Alto’s Unit 42 reframes the same pathology as a supply-chain security issue and proposes Behavioral Integrity Verification — checking an agent’s claimed actions against actual tool execution — as a mitigation ¹⁶.

Under implicit-permanent faults, the Recovery Cost metric exceeds 70%, meaning the agent spends most of its budget on redundant calls before failing. Prompting it that tools might break helps marginally (+1.5 to +20.8 points) but doesn’t restore baseline performance.

The deterministic-shell caveat

ToolMaze evaluates frontier models as bare reasoners. Production deployments don’t. LangGraph attaches RetryPolicy and TimeoutPolicy at the node level and resumes from state checkpoints at the exact failed call rather than restarting the workflow ¹⁷. Practitioners cap iterations with circuit breakers (typically 3–5) explicitly to prevent the “token snowballing” ToolMaze quantifies as >70% Recovery Cost ¹⁸. The fair question the paper doesn’t engage: how much of the 3.66× gap closes when agents are evaluated inside their orchestration shells rather than as raw policies? Until someone runs that experiment, ToolMaze is best read as a lower bound on agent robustness — a useful one, but a floor, not a forecast.

Round-ups

HarnessForge co-evolves agent harness and policy for mixed tasks

HarnessForge argues that updating model weights alone fails when tasks demand different execution paradigms. The system pairs fault-guided harness tailoring with harness-conditioned policy alignment, co-evolving the harness-policy pair so agents adapt at the system level rather than per component.

OpenSkill bootstraps LLM agent skills with zero task supervision

OpenSkill lets agents grow their own skills and verification signals from open-world resources, skipping target-task labels entirely. The framework runs a bootstrapped learning loop over self-built virtual tasks, then transfers the learned skills to downstream benchmarks with high automated pass rates.

Socratic-SWE mines solving traces to self-improve coding agents

Socratic-SWE turns historical solving traces into structured repair patterns, then uses execution-based validation and a solver-gradient alignment reward to curate a curriculum. The closed-loop approach beats prior self-evolving baselines across SWE-bench Verified, Lite, Pro and Terminal-Bench 2.0.

Critic-R closes the loop between search agents and retrievers

Critic-R wires a critic model between reasoning agents and retrieval systems, producing introspective reasoning traces that drive both query refinement and retriever fine-tuning. The dual optimization yields automatic supervision for agentic search without hand-labeled relevance data.

Astra pairs VLMs with world simulator for spatial reasoning

Astra augments vision-language models with action-conditioned visual imagination, calling a Bagel-based world simulator to render novel views during reasoning. An RL-trained policy decides when to imagine, lifting scores on MMSI-Bench and MindCube through view-consistency tuning and tool-use exploration.

On-policy distillation converts autoregressive LLMs into diffusion models

The self-OPD recipe retrains causal-attention LLMs as bidirectional diffusion language models by distilling on the student’s own confidence-based decoding trajectories. Removing the train-inference mismatch cuts the token budget needed versus standard knowledge distillation while preserving downstream quality.

Stream3D-VLM brings real-time 3D understanding to streaming video

Stream3D-VLM treats 3D scene understanding as autoregressive next-token prediction over streaming frames, fusing Visual-Spatial Feature Integration with Geometry-Adaptive Voxel Compression to keep token counts tractable. Training uses spatio-temporal 3D QA pairs designed for online, incremental geometry priors.

Weights & Biases report on OpenAI PaperBench — https://wandb.ai/byyoung3/ml-news/reports/OpenAI-s-new-research-agent-benchmark-Paperbench---VmlldzoxMjEyODQ5NA

even the most advanced models—such as Claude 3.5 Sonnet and OpenAI’s o1—achieve replication scores of only 21% to 24%, while human ML PhDs reach approximately 41%

↩
Cerebras blog: ‘How to stop your autoresearch loop from cheating’ — https://www.cerebras.ai/blog/how-to-stop-your-autoresearch-loop-from-cheating

agents with loose guardrails often abandon their original research goals to pursue unintended ‘side quests’… an agent tasked with memory optimization might instead spend hours investigating model weight limits

↩
GitHub Discussion #406, karpathy/autoresearch — https://github.com/karpathy/autoresearch/discussions/406

most improvements found by agents—such as batch size or learning rate adjustments—could be achieved more efficiently through classical Bayesian optimization

↩
PMC study on GPT-5.1 as scientific judge — https://pmc.ncbi.nlm.nih.gov/articles/PMC12923865/

GPT-5.1 often fails to recognize retracted articles, frequently evaluating discredited research as high-quality… producing different verdicts for the same prompt in 27% of test cases

↩
MedResearchBench (arXiv 2606.07591v1 discussion) — https://arxiv.org/html/2606.07591v1

RCBench may be limited by its focus on well-defined computational experiments… real-world research—particularly in clinical and medical fields—requires navigating messier variables like missing data and confounding factors

↩
Augment Code ‘Auggie tops SWE-bench Pro’ blog — https://www.augmentcode.com/blog/auggie-tops-swe-bench-pro

agentic scaffolding—the instructions and tools surrounding the model—often impacts performance more than the model weights alone

↩
Sakana AI — Darwin Gödel Machine — https://sakana.ai/dgm/

DGM iteratively modifies its own Python codebase, evaluates variants on SWE-bench, and archives successful mutations — improving SWE-bench performance from 20.0% to 50.0% without touching model weights.

↩
MarkTechPost — https://www.marktechpost.com/2026/05/29/hexo-labs-open-sources-sia-a-self-improving-agent-that-updates-both-the-harness-and-the-model-weights/

When the Feedback-Agent determines that scaffold edits have hit an accuracy plateau, it initiates weight updates using Low-Rank Adaptation (LoRA)… The implementation defaults to a LoRA rank of 32… uses gpt-oss-120b for the Target Agent, while the orchestrating Meta and Feedback agents run on Claude Sonnet.

↩
BriefGlance — https://briefglance.com/articles/hexo-labs-open-sources-ai-claiming-a-350x-leap-to-superintelligence

The ‘350x’ claim refers to the accelerated rate of improvement compared to human-in-the-loop development cycles… at the time of Hexo Labs’ announcement, the official MLE-bench leaderboard was taken offline for maintenance to address fairness and comparability issues.

↩
Hacker News thread on Hexo Labs SIA — https://news.ycombinator.com/item?id=48511908

compute isn’t free… hard to imagine which organizations would fund the massive electricity bills required for recursive training outside of big tech; other commenters dismissed the release as part of an ‘IPO roadshow.’

↩
Techstrong.ai (Hexo Labs founder interview) — https://techstrong.ai/articles/superintelligence-must-be-open-source-hexo-labs/

If just three or four companies own this kind of technology, I’m not sure it’s a good idea — founder Kunal Bhatia framing the MIT-licensed release as ‘sovereignty’ against frontier-lab concentration.

↩
OpenReview — AgentBreeder paper — https://openreview.net/pdf?id=ikrQWGgxYg

Evolutionary search can find ‘blue’ scaffolds that improve performance by nearly 80%, but ‘red’ adversarial scaffolds also emerge that are highly capable yet structurally vulnerable — evidence that co-evolving scaffold and weights against a fixed verifier produces fragile equilibria.

↩
Cleanlab blog on τ-bench — https://cleanlab.ai/blog/tau-bench/

Even state-of-the-art models like GPT-4o may succeed on less than 50% of tasks, with consistency dropping sharply in complex domains… the pass^k metric reveals a ‘reliability gap’.

↩
Sierra blog: τ²-bench — https://sierra.ai/blog/benchmarking-agents-in-collaborative-real-world-scenarios

τ²-bench introduces ‘dual-control’ environments where both the agent and a (simulated) user can modify the world state, testing the agent’s ability to guide a human collaborator through tasks the agent cannot resolve alone.

↩
OpenReview: ToolFailBench — https://openreview.net/forum?id=JhaxRN8QDV

ToolFailBench uses a four-part taxonomy: Tool-Skip, Result-Ignore, Output-Fabrication, and Unnecessary-Tool-Use… Llama-3.1 models exhibit an ‘Always-Call’ pattern even when tools are unnecessary, whereas Qwen2.5-72B shows significantly more discipline.

↩
Palo Alto Unit 42: AI Agent Supply Chain Risks — https://unit42.paloaltonetworks.com/ai-agent-supply-chain-risks/

Behavioral Integrity Verification (BIV) compares an agent’s claimed behavior against its executable code and natural-language instructions to catch ‘malicious or sloppy’ deviations in third-party skills.

↩
LangChain blog: Fault Tolerance in LangGraph — https://www.langchain.com/blog/fault-tolerance-in-langgraph

RetryPolicy and TimeoutPolicy can be attached to individual nodes… an agent can crash, be restarted from a state checkpoint, and resume exactly at the failed tool-call node rather than restarting the entire workflow.

↩
Towards AI: Building Retries in Agents — https://pub.towardsai.net/building-retries-in-agents-how-to-build-ai-agents-that-survive-failures-32eedd2623f0

Circuit breakers stop the agent after a set number of failed iterations (typically 3–5) to prevent ‘token snowballing,’ where the cost of a single task scales exponentially due to a growing error history.

↩

SIA fuses scaffold+LoRA, RCBench caps agents at 21.5/50, ToolMaze lags 3.66×

TL;DR

RCBench scores top AI research agents at 21.5 of 50

TL;DR

A 50 means you matched the paper. Nobody’s close.

The ~20% ceiling is showing up everywhere

Caveats that matter more than the leaderboard

Hexo Labs’ SIA co-evolves scaffold and LoRA weights

TL;DR

The dual-lever loop

Where the numbers land

What the press release skips

ToolMaze finds agent recovery scales 3.66× slower than skill

TL;DR

The happy-path fallacy

Recovery is not proficiency

The trust gap is the real story

The deterministic-shell caveat

Round-ups

HarnessForge co-evolves agent harness and policy for mixed tasks

OpenSkill bootstraps LLM agent skills with zero task supervision

Socratic-SWE mines solving traces to self-improve coding agents

Critic-R closes the loop between search agents and retrievers

Astra pairs VLMs with world simulator for spatial reasoning

On-policy distillation converts autoregressive LLMs into diffusion models

Stream3D-VLM brings real-time 3D understanding to streaming video

Footnotes

Jack Sun, writing.