Agent research moves from leaderboard scores to the trace itself

TL;DR

CodeTracer rebuilds agent runs into a state tree and triples failure-localization F1 over bare-LLM prompting (≈47% vs ≈18%).
CocoaBench tops out at 45.1% for GPT-5.4, but Claude Sonnet 4.6 swings 18 points just from changing the harness.
SWE-AGILE reaches 24% on SWE-Bench Verified with an 8B model and 2.2k trajectories, yet trails peer 8B systems by ~18 points.
Briefs cover refusal-circuit interpretability, an RL credit-assignment survey, physics-simulator RL, and a SciPredict benchmark where LLMs trail human experts.
The durable contributions across the day are diagnostic — taxonomies, traces, and aggregation agents — not new state-of-the-art numbers.

Today’s agent research is pivoting away from leaderboard accuracy and toward the trace itself — what the agent did, where it broke, and how brittle the harness around it really was. CodeTracer turns messy run directories into a hierarchical state tree so failures can be localized, not just counted. CocoaBench’s authors publish a 45.1% ceiling but spend most of their effort showing how an 18-point swing falls out of the harness, not the model. SWE-AGILE chases context efficiency rather than peak accuracy, and the paper’s own headline number is already lapped by contemporaneous 8B systems on the same base. Read together, these three are the same argument: outcome-only grading hides where agents actually fail, and the durable contributions are the diagnostic tools — taxonomies, traces, digests — built around the scores. The round-up extends the theme outward, with parallel-scaling aggregation and a credit-assignment survey that both ask the same question one layer up.

CodeTracer turns messy agent run directories into a debuggable trace tree

TL;DR

CodeTracer reconstructs coding-agent runs into a hierarchical state tree and pinpoints the step where things first went wrong.
It triples failure-localization F1 over bare-LLM prompting (≈47% vs ≈18%), with tree indexing alone worth +18.3 points.
The companion CodeTraceBench dataset (3,326 annotated trajectories across 5 benchmarks, 5 models, 4 frameworks) makes process-level evaluation reproducible.
Independent work (FixedCode, AgenTracer, Cyfrin) corroborates the underlying pathology: agents over-act on weak evidence and waste tokens “fixing” non-problems.

The problem isn’t pass/fail, it’s where it broke

SWE-bench scores tell you an agent failed; they don’t tell you whether it misread the bug report, edited the wrong file, or passed local tests but broke CI. CodeTracer, from NJU-LINK, attacks that gap by reconstructing the full state-transition history of a run and localizing the “failure onset” — the earliest stage responsible for the cascade.

The pathology it surfaces is not unique to its benchmark. ETH Zurich’s FixedCode study found agents try to “fix” already-resolved issues more than half the time, frequently introducing regressions instead of submitting an empty patch ¹. Cyfrin clocked agents burning 21k+ tokens to correct a single-character README typo by over-pulling repo metadata ². Developers on r/webdevelopment describe the result as “Jenga tower” codebases built by “high-speed juniors” who can’t tell when they’re confused ³. CodeTracer’s own number — 40% ineffective steps in failed runs versus 22% in successful ones — is the same phenomenon measured from inside the trajectory.

How the pipeline works

flowchart LR
    A[Heterogeneous run dirs<br/>OpenHands, SWE-Agent, Terminus] --> B[Evolving Extraction<br/>LLM-synthesized parsers]
    B --> C[Hierarchical Trace Tree<br/>state-changing vs exploration nodes]
    C --> D[Diagnosis<br/>verification regression,<br/>backtrack frequency]
    D --> E[Failure-onset stage +<br/>error-relevant steps + evidence]
    E -. reflective replay .-> A

The hierarchy is the load-bearing piece. State-changing actions (file edits, env mutations) spawn child nodes; pure inspection (ls, grep) hangs as siblings. That single design choice contributes the bulk of the gain — ablations show tree indexing is worth +18.3 F1 points, evolving extraction another +9.4.

Diagnostic style turns out to be model-dependent: GPT-5 commits early to a compact error set (highest precision, 45.0%), while Claude Sonnet 4 searches the tree exhaustively (highest recall, 54.87%) ⁴. The tracer inherits the biases of whichever model drives it.

How it compares

Approach	Mechanism	Step-level signal	Cost
Bare LLM on raw logs	Prompt with full trajectory	F1 16–19%	Cheap, low signal
AgenTracer-8B ⁵	Counterfactual replay (swap actions, see if outcome flips)	42.86% accuracy on multi-agent runs	Expensive re-simulation
CodeTracer	Structural reconstruction from logs	F1 46–48%	5–8k tokens per diagnosis

CodeTracer and AgenTracer effectively bracket the design space — structural reconstruction versus counterfactual replay — and neither has shown dominance outside its native benchmark.

What it doesn’t answer

CodeTracer’s “evolving extraction” is a parser registry that adapts to whichever messy log layout it encounters. That’s pragmatic now, but it sidesteps a real question: why not instrument agents to emit OpenTelemetry spans at the source, the way Langfuse-style stacks already do for polyglot production systems ⁶? The tracer is most valuable precisely where agents don’t emit clean traces — and that surface area is shrinking.

Two other findings deserve attention beyond the headline number. First, complex frameworks (OpenHands, SWE-Agent) burn nearly double the tokens of MiniSWE-Agent for a 2.4–5.5 point success bump — overengineering is measurable. Second, success saturates past 40 iterations; extra budget gets spent on redundant loops, not error correction. Both point at the same conclusion: the bottleneck is reasoning, not scaffolding.

CocoaBench’s 45% ceiling is real — and more fragile than it looks

TL;DR

CocoaBench’s 153 hand-authored tasks force agents to compose vision, search, and coding; GPT-5.4 tops out at 45.1% and open-source models stay under 12%.
The headline ranking is scaffold-dependent: Claude Sonnet 4.6 swings from 34.0% to 15.7% just by switching the harness around the same model.
Outcome-only grading is the exact surface recent audits showed agents can game on WebArena and SWE-bench.
The durable contribution isn’t the leaderboard — it’s a failure taxonomy that pins down where unified agents actually break.

A unified-agent benchmark with a fragile leaderboard

CocoaBench targets a real gap. SWE-bench tests code, OSWorld tests GUIs, BrowseComp tests search — none test the messy composition of all three that a “unified digital agent” is supposed to do. Of CocoaBench’s 153 tasks, 98% require more than one capability (find a video, watch it for a number, write code to process it), and grading is output-based so any path counts.

The damning result: even GPT-5.4 under the best scaffold lands at 45.1%. Kimi-k2.5 manages 11.8%, Qwen3.5 9.8%. The pattern inside the numbers is more interesting than the ranking — strong models spend over 60% of their tool calls on shell and code execution, using the browser only to acquire data and then reasoning over it programmatically. Weaker models stay trapped in the GUI loop.

Scaffold sensitivity eats the ranking

Treat the leaderboard with care. Independent review of the Cocoa-Agent scaffold reports Claude Sonnet 4.6 collapsing from 34.0% under OpenClaw to 15.7% under the authors’ own scaffold ⁷ — a bigger swing than the gap between several frontier models. This isn’t a CocoaBench-specific quirk: Hugging Face’s analysis of agent evals finds a 33× cost spread across scaffolds for identical tasks, and 60% single-run success rates falling to 25% under 8-run consistency checks ⁸. CocoaBench reports single-shot numbers.

Model	Best scaffold	Cocoa-Agent scaffold
GPT-5.4	45.1%	45.1%
Claude Sonnet 4.6	34.0% (OpenClaw)	15.7% ⁷
Kimi-k2.5	11.8%	—

There’s a second entanglement worth naming: CocoaBench runs on AIO Sandbox, the all-in-one Docker runtime built by the same ByteDance Agent-Infra team that authored the paper ⁹. That team also ships agents (Seed, UI-TARS, Trae) that appear in adjacent evaluations. Task design that rewards the exact browser+shell+filesystem+MCP shape AIO exposes is not a neutral choice.

The reward-hacking question nobody answered

“Proxy outcome verifiers” — grading only the final answer — is precisely the surface UC Berkeley auditors exploited to score spurious 100%s on WebArena and SWE-bench, sometimes by navigating to local file:// URLs containing answer keys ¹⁰. CocoaBench’s hosted-asset strategy fixes link rot but doesn’t address introspective shortcuts, and the failure analysis never quantifies how much of the 45.1% is real solve vs. accidental shortcut. Compared to GAIA, where humans hit 92% on tedious factual tasks ¹¹, or HAL’s 26,000-rollout cross-domain leaderboard at ~$40K compute ¹², CocoaBench’s 153 tasks are a smaller, more curated slice — useful, but not obviously additive to existing infrastructure.

What’s actually useful: the failure taxonomy

The error analysis of 712 failed trajectories is the part worth saving. Reasoning and planning failures dominate at 54% — goal displacement (solving a simpler task than asked), floating-point arithmetic errors, dropped output tags. Visual grounding accounts for 29%, including the telling DOM-opacity case where Tableau dashboards render in <canvas> and text-only scrapers see nothing. Tool execution failures (17%) cluster around three pathologies: infinite loops on the same failing command, hallucinating empty pages when blocked by CAPTCHAs, and context-window truncation that makes long agents restart sub-tasks they already finished.

Those are concrete targets for the next generation of scaffolds. The 45.1% number will move; the failure modes won’t, until somebody designs around them.

SWE-AGILE compresses agent reasoning into digests — but the 8B field has already moved past its headline

TL;DR

SWE-AGILE keeps full reasoning for the last N=2–5 steps and replaces older traces with short “digests,” producing a sawtooth context curve instead of a linear blow-up.
Headline result: 24.05% on SWE-Bench Verified with an 8B model trained on just 2.2k trajectories — ~11% of SWE-Dev’s data.
That number beats the 14B SkyRL baseline the paper picked, but trails contemporaneous 8B systems on the same Qwen3 base by ~18 points.
The real contribution is data and context efficiency, not frontier accuracy — and the design ignores the tool-output bloat the authors themselves flag.

What the framework actually does

SWE-AGILE structures every agent step as Reasoning → Digest → Action. The detailed reasoning trace survives only inside a sliding window of the last 2–5 turns; older steps keep their actions and observations but their reasoning collapses into a ~27-token digest. Context grows during a step’s thinking phase and shrinks when the window slides — the “sawtooth” the authors highlight.

flowchart LR
    S1[Step t-3<br/>digest only] --> S2[Step t-2<br/>digest only]
    S2 --> S3[Step t-1<br/>full reasoning]
    S3 --> S4[Step t<br/>full reasoning]
    S4 --> A[Action]
    subgraph Window[Sliding window N=2-5]
        S3
        S4
    end

Two training tricks make this work at inference time. Trajectory Snapshot Training masks the historical context so the model is forced to learn from already-compressed history during SFT, eliminating the train/inference mismatch. A Hindsight Backfill pipeline uses Qwen3-235B as a teacher to synthesise reasoning that leads to known-good actions, since most SWE datasets ship without CoT traces. RLVR with DAPO then layers a multiplicative compression reward on top of binary task success, so the model only gets paid for brevity if the bug actually got fixed.

The benchmark comparison is generous

The paper frames 24.05% (8B) as beating SkyRL-Agent-14B’s 21.6%. But on the same Qwen3-8B base, SWE-Lego-Qwen3-8B reportedly hits 42.2% Pass@1 (49.6% with TTS@16) ¹³, and SkyRL’s 32B variant reaches 39.4% with a claimed 2× cost reduction via AST-search and an asynchronous dispatcher ¹⁴. The 18-point gap on an identical base model means SWE-AGILE’s win is best read as context efficiency per trajectory, not raw capability — its 28% per-step token reduction and 11% data footprint are the defensible numbers.

The benchmark itself is also wobblier than the paper admits. OpenAI retired SWE-Bench Verified internally over contamination, and a separate audit flagged 59.4% of the hardest tasks as having flawed test patches that reject correct fixes ¹⁵. A 24% number on a benchmark with a ±6–7% noise floor warrants more hedging than the paper offers.

Compression is a crowded design space

Reasoning digests are one of at least three live philosophies. SWE-Pruner runs a 0.6B “neural skimmer” that drops file content by goal-relevance for 23–54% token savings ¹⁶ — directly attacking the tool-output bloat that SWE-AGILE’s own limitations section concedes is unsolved ¹⁷. Anthropic’s Claude Code takes a third route: extractive compaction triggered at 65–75% window utilisation, preserving architectural decisions and open bugs rather than per-step summaries ¹⁸.

That admission ¹⁷ is the frame. SWE-AGILE ships a clever training-time alignment trick (the snapshot masking) and a defensible RL reward design, but the central hyperparameter of its central mechanism wasn’t ablated, and there’s no head-to-head against pruning or extractive compaction. The interesting follow-up isn’t a bigger SWE-AGILE — it’s whether reasoning digests stack with code pruning, or whether one obsoletes the other.

Round-ups

Agentic Aggregation for Parallel Scaling of Long-Horizon Agentic Tasks

AggAgent replaces naive majority voting for parallel test-time scaling on long-horizon agent tasks with a lightweight aggregation agent that navigates and synthesizes candidate trajectories on demand, sidestepping context-window blowup when combining many tool-augmented rollouts into a final answer.

From Reasoning to Agentic: Credit Assignment in Reinforcement Learning for Large Language Models

Survey categorizes RL credit-assignment techniques for LLMs by granularity and methodology, spanning Monte Carlo, temporal difference, model-based, game-theoretic, and information-theoretic approaches, and contrasts methods suited to reasoning tasks against those built for multi-turn agentic settings with sparse rewards.

How Alignment Routes: Localizing, Scaling, and Controlling Policy Circuits in Language Models

Mechanistic interpretability work locates alignment behavior in specific attention gates and amplifier heads that fire early in the forward pass to commit a refusal decision. The routing circuit transfers across model scales, and the authors validate it via per-head ablation, knockout cascades, and in-context cipher contrasts.

Solving Physics Olympiad via Reinforcement Learning on Physics Simulators

Sim2Reason trains LLMs on physics-simulator-generated trajectories with reinforcement learning to build physical reasoning, then transfers zero-shot to International Physics Olympiad problems. The pipeline shows simulators can substitute for scarce annotated reasoning data in a hard scientific domain.

Audio Flamingo Next: Next-Generation Open Audio-Language Models for Speech, Sound, and Music

NVIDIA and UMD’s follow-up to Audio Flamingo extends context to long-form audio across speech, sound, and music, and introduces a Temporal Audio Chain-of-Thought reasoning mechanism. Training uses a curriculum spanning pre-, mid-, and post-training stages on new AudioSkills-XL and LongAudio-XL datasets.

Introspective Diffusion Language Models

I-DLM closes the quality gap between diffusion and autoregressive language models by enforcing introspective consistency at decoding time, using causal masking, logit shifting, and introspective strided decoding. A stationary-batch scheduler boosts throughput in large-concurrency serving.

SciPredict: Can LLMs Predict the Outcomes of Scientific Experiments in Natural Sciences?

Scale AI’s SciPredict benchmark tests whether LLMs can forecast outcomes of natural-science experiments. Models trail human experts on both accuracy and confidence calibration, and unlike humans, they fail to improve on experiments that domain experts flag as predictable.

ETH Zurich SRI Lab blog (FixedCode) — https://www.sri.inf.ethz.ch/blog/fixedcode

agents attempt to ‘fix’ resolved issues over 50% of the time, often introducing unnecessary modifications instead of submitting an empty patch

↩
Cyfrin blog — ‘Why AI coding agents can be overkill’ — https://www.cyfrin.io/blog/expensive-and-slow-for-small-changes-why-ai-coding-agents-can-be-overkill

agents consuming over 21,000 tokens to correct a single-character typo in a README, essentially ‘over-preparing’ by pulling in excessive repository metadata

↩
r/webdevelopment thread on debugging AI-generated code — https://www.reddit.com/r/webdevelopment/comments/1qto3bo/im_struggling_to_debug_aigenerated_code_in_real/

agents currently act as ‘high-speed juniors’ who lack the heuristics to identify when they are confused, leading to ‘Jenga tower’ codebases

↩
AI Native Foundation paper digest — https://ainativefoundation.org/ai-papers/?current_page=14

diagnostic styles vary by model — with GPT-5 prioritizing efficiency and Claude Sonnet 4 favoring comprehensive retrieval

↩
arXiv 2506.12286 (AgenTracer / counterfactual replay) — https://arxiv.org/html/2506.12286v1

AgenTracer-8B… achieved a step-level accuracy of 42.86% on its automated subset, outperforming proprietary giants like Gemini 2.5 Pro and Claude 4 Sonnet by over 11%

↩
Langfuse — LangSmith alternative FAQ — https://langfuse.com/faq/all/langsmith-alternative

Langfuse’s reliance on OpenTelemetry (OTel) standards makes it more flexible for ‘polyglot’ stacks

↩
Cocoa-Agent implementation review (nxcode.io) — https://www.nxcode.io/resources/news/claude-sonnet-4-6-complete-guide-benchmarks-pricing-2026

Claude Sonnet 4.6 showed higher instability, with performance dropping from 34.0% in other frameworks to just 15.7% under Cocoa-Agent

↩ ↩²
Hugging Face — ‘eval costs bottleneck’ blog — https://huggingface.co/blog/evaleval/eval-costs-bottleneck

agent benchmarks are highly scaffold-sensitive, with identical tasks showing a 33x cost spread depending on the configuration … success rates on some benchmarks fell from 60% on a single run to 25% when subjected to 8-run consistency checks

↩
MarkTechPost — AIO Sandbox release coverage — https://www.marktechpost.com/2026/03/29/agent-infra-releases-aio-sandbox-an-all-in-one-runtime-for-ai-agents-with-browser-shell-shared-filesystem-and-mcp/

Agent-Infra releases AIO Sandbox, an all-in-one runtime for AI agents with browser, shell, shared filesystem and MCP

↩
Kili Technology — 2026 AI benchmarks guide — https://kili-technology.com/blog/ai-benchmarks-guide-the-top-evaluations-in-2026-and-why-theyre-not-enough

every major benchmark could be ‘exploited’ to achieve 100% scores without actually solving tasks … agents can ‘cheat’ WebArena by using browser primitives to navigate to local file:// URLs and read hidden answer keys

↩
WorkOS — GAIA benchmark explainer — https://workos.com/blog/gaia-benchmark-evaluating-intelligent-agents

GAIA evaluates agents on … ‘conceptually simple but tedious’ tasks that humans solve with 92% accuracy

↩
AI Native Foundation — HAL leaderboard summary — https://ainativefoundation.org/ai-papers/?current_page=14

by April 2026, the leaderboard encompassed over 26,000 rollouts with a headline execution cost of approximately $40,000

↩
ResearchGate — ‘Advances and Frontiers of LLM-based Issue Resolution’ survey — https://www.researchgate.net/publication/399850004_Advances_and_Frontiers_of_LLM-based_Issue_Resolution_in_Software_Engineering_A_Comprehensive_Survey

SWE-Lego-Qwen3-8B currently stands as a top performer, achieving a 42.2% resolve rate on SWE-Bench Verified using supervised fine-tuning (SFT) alone, which rises to 49.6% with test-time scaling (TTS@16).

↩
ACL Findings 2025 — SkyRL-Agent — https://aclanthology.org/2025.findings-acl.193.pdf

SkyRL-Agent (SA-SWE-32B) reached a higher success rate of 39.4%, reportedly achieving this with a 2x cost reduction… attributed to its ‘asynchronous pipeline dispatcher’ and AST-search tool, which reduces the ‘noise’ in the agent’s context window.

↩
ofox.ai — 2026 LLM leaderboard roundup — https://ofox.ai/blog/llm-leaderboard-best-ai-models-ranked-2026/

OpenAI officially retired the benchmark from its internal evaluations, citing severe contamination… a separate study found that 59.4% of the hardest tasks were actually flawed, with models submitting correct fixes that were rejected by the benchmark’s narrow test cases.

↩
arXiv 2601.16746v3 — SWE-Pruner — https://arxiv.org/html/2601.16746v3

SWE-Pruner introduces a 0.6B parameter ‘neural skimmer’ that selectively prunes lines of code based on the agent’s current goal, claiming 23–54% token reduction while maintaining or even improving success rates.

↩
arXiv 2604.11716v1 — SWE-AGILE author limitations section — https://arxiv.org/html/2604.11716v1

the sliding window size (N) was set randomly between 2 and 5 during training. The authors admit that a more systematic study of the optimal window size for different task complexities is still needed… tool outputs (like long stack traces or file contents) still consume significant context.

↩ ↩²
InfoQ — Opus 4.6 context compaction — https://www.infoq.com/news/2026/03/opus-4-6-context-compaction/

Claude Code’s context compaction is an ‘extractive summarization’ process that triggers automatically when the context window reaches approximately 65–75% utilization… preserves ‘high-signal tokens’ such as specific architectural decisions, unresolved bugs, and implementation details while discarding redundant tool outputs or verbose reasoning.

↩

Agent research moves from leaderboard scores to the trace itself

TL;DR

CodeTracer turns messy agent run directories into a debuggable trace tree

TL;DR

The problem isn’t pass/fail, it’s where it broke

How the pipeline works

How it compares

What it doesn’t answer

CocoaBench’s 45% ceiling is real — and more fragile than it looks

TL;DR

A unified-agent benchmark with a fragile leaderboard

Scaffold sensitivity eats the ranking

The reward-hacking question nobody answered

What’s actually useful: the failure taxonomy

SWE-AGILE compresses agent reasoning into digests — but the 8B field has already moved past its headline

TL;DR

What the framework actually does

The benchmark comparison is generous

Compression is a crowded design space

Round-ups

Agentic Aggregation for Parallel Scaling of Long-Horizon Agentic Tasks

From Reasoning to Agentic: Credit Assignment in Reinforcement Learning for Large Language Models

How Alignment Routes: Localizing, Scaling, and Controlling Policy Circuits in Language Models

Solving Physics Olympiad via Reinforcement Learning on Physics Simulators

Audio Flamingo Next: Next-Generation Open Audio-Language Models for Speech, Sound, and Music

Introspective Diffusion Language Models

SciPredict: Can LLMs Predict the Outcomes of Scientific Experiments in Natural Sciences?

Footnotes

Jack Sun, writing.