Code harness lifts GPT-5.5 26pt, Llama skips tool calls, TDDev’s 65.8% rebutted

TL;DR

UIUC, Meta, Stanford lift GPT-5.5 from 61.5% to 87.2% by swapping the native loop for Cursor’s harness.
Llama and Qwen skip a needed tool call 26-54% on arithmetic even when probes show the necessity signal.
Probe&Prefill decoder intervention cuts unnecessary API calls 48-56% without retraining the model.
WebGen-Agent authors rescore TDDev’s 65.8% headline at 44.0% on the full 101-task benchmark.
Paraphrase attacks cut code-reasoning 42%, undercutting the executability-equals-verifiability pitch.

Today’s three research features all swing on something other than the model. UIUC, Meta, and Stanford formalize Code as Agent Harness and show the same GPT-5.5 jumps 26 points just by moving from a native loop into Cursor. Cheng et al. find Llama and Qwen reliably encode I need a tool in their hidden states, but the necessity direction goes nearly orthogonal at the final readout token — so a decoder-side prefill beats retraining. And the WebGen-Agent authors rescore TDDev’s 65.8% headline at 44.0%, with the gap living in benchmark protocol rather than agent skill.

The pattern matters because vendor benchmarks increasingly report harnessed scores, and reproducibility now turns on which scaffold ran which prompt against which eval. The three drops show why the harness, the decoder, and the eval protocol each deserve their own diagnostics — and why a clean weight-level number, on its own, is no longer the whole story.

Code as Agent Harness reframes runtime as agent bottleneck

TL;DR

UIUC, Meta, and Stanford formalize “Code as Agent Harness” — code as the executable, inspectable, stateful substrate.
Same GPT-5.5 jumps from 61.5% to 87.2% moving from a native loop into Cursor’s harness.
A Plan-Execute-Verify loop with an “Evolution Agent” rewrites tool specs, retries, and orchestration from telemetry.
Paraphrase attacks cut code-reasoning 42%, undercutting the “executability ⇒ verifiability” pitch.
“Harness debt”: self-rewriting infrastructure becomes its own opaque failure surface, harder to debug than the model bugs it replaces.

The harness is the new variable

The survey’s claim is blunt: natural language is too sloppy for long-horizon agents, so the field is migrating to code as the operational medium — because code is executable (verifiable outcomes), inspectable (logs, variable states, control-flow graphs), and stateful (the program itself is the running record of task progress). The interesting move is what that implies: the gating variable for autonomy is now the harness — sandboxes, APIs, memory, retry policies — not the next model checkpoint.

Practitioner data backs the diagnosis hard. MindStudio’s benchmark sweep shows the same GPT-5.5 jumps from 61.5% to 87.2% just by being lifted out of a native ReAct loop and dropped into Cursor’s harness ¹ — a delta bigger than most generational model upgrades. AutoAgent goes further: letting a meta-agent rewrite its own tool descriptions and orchestration logic for 24 hours produced #1 finishes on SpreadsheetBench (96.5%) and TerminalBench (55.1%) ². These are exactly the “Evolution Agent” numbers the survey gestures at but does not itself measure.

What the taxonomy actually says

The paper organizes the literature into three layers, with the Plan-Execute-Verify loop as the connective tissue:

flowchart TB
    subgraph L1["Layer 1 — Harness Interface"]
        R[Code for Reasoning<br/>PoT, Chain-of-Code, Lean4]
        A[Code for Acting<br/>scripts, GUI, APIs]
        E[Code for Environment<br/>world models, room programs]
    end
    subgraph L2["Layer 2 — PEV Loop"]
        P[Plan: contract formation] --> X[Execute: sandboxed run]
        X --> V[Verify: tests, telemetry]
        V -.evolve.-> H[Harness rewrite]
        H --> P
    end
    subgraph L3["Layer 3 — Multi-Agent Scaling"]
        B[Shared blackboard state]
        T[Adaptive DAG topologies]
    end
    L1 --> L2 --> L3

Memory gets reframed as “context engineering” — structured state summaries and repository evidence rather than naked vector retrieval. Planning becomes “contract formation”: identify the files and invariants before touching anything. The headline metrics are domain-specific but consistent: QualityFlow hits 98%+ precision predicting MBPP test outcomes via “imagined execution”; LingmaAgent autonomously resolves 16.9% of in-house issues end-to-end; El Agente Q clears 87% of a chemistry suite by wrapping simulators as callable functions.

Where the framing strains

The survey is honest about open problems, but outside voices push harder. The Chain-of-Code Collapse study shows that surface-level prompt perturbations preserving logic can drop accuracy by up to 42% ³ — a direct challenge to the “executability ⇒ verifiability” pitch, since the executed program itself is unstable under paraphrase. Multi-agent failure attribution remains rough: SDBL pushes step-level accuracy from 8% to 32% ⁴, real progress, but nowhere near production-usable against the survey’s own 14–53% baseline.

The sharpest dissent comes from Pappas, who coins harness debt:

That maps onto the operational reality Modal documents — SWE-bench-scale harnesses now run on gVisor or Firecracker microVMs and clear 500 tasks in ~7 minutes, but patches routinely pass the sandbox and fail the real repo ⁶. The survey calls this the “oracle adequacy crisis”; in production it is where most of the bleeding happens.

Takeaway

The community consensus the survey both reflects and accelerates is that harness > model. The honest reading of the surrounding literature is that self-evolving harnesses are not dissolving the autonomy bottleneck so much as relocating it — from model reasoning into infrastructure that is increasingly opaque, adversarially fragile, and unattributable when it breaks.

Llama and Qwen know they need a tool, then don’t call it

TL;DR

Cheng et al. (UMD) show LLMs’ hidden states reliably encode “I need a tool” on arithmetic and TruthfulQA.
The same models then skip the call 26.5–54% of the time on arithmetic, 30.8–41.8% on TruthfulQA.
Necessity and action directions go nearly orthogonal at the final readout token that drives generation.
Probe&Prefill, a sibling intervention, cuts unnecessary API calls by 48–56% — the fix is decoding, not training.
Tool-selection accuracy drops 100%→42% when descriptions are shuffled, so probes may be reading lexical cues.

The gap

A new paper from the University of Maryland argues that the standard story about LLM tool use — small models don’t know when they need help — is wrong. Across Qwen3-4B/8B and Llama-3.1-8B/3.2-3B, the models’ hidden states quite reliably encode whether a problem exceeds their unaided capability. They just don’t act on it. On a 4,000-instance arithmetic benchmark, the mismatch between what the model “knows” it needs and what it actually invokes runs from 26.5% to 54.0%. On TruthfulQA, 30.8% to 41.8%.

Crucially, the authors redefine ground truth. Rather than letting a human label “this needs a calculator,” they run each model 10× at T=0.7 without tools: if it ever fails, the task is tool-necessary for that model. This model-adaptive grounding exposes the gap as a property of execution, not perception.

Where the signal dies

Linear probes on the residual stream recover necessity with MCC > 0.4 and action intent even more cleanly. The interesting result is geometric. The necessity direction $w_c$ and the action direction $w_a$ show some alignment in middle layers, then rotate to cosine similarity ≈ 0 at the last layer of the last query token — the position that actually drives generation. The model’s belief and the model’s behavior are encoded in subspaces that don’t talk to each other at readout.

This is not a low-confidence artifact. The orthogonality persists when the necessity probe outputs probabilities near 0 or 1. The authors call it a “confidence paradox”: the cleaner the internal signal, the more striking the failure to translate it into a tool call.

Convergent evidence

The pattern lines up with parallel findings elsewhere in mech-interp. The When2Tool group independently shows tool necessity is decodable from the last input token with AUROC > 0.90, and their Probe&Prefill intervention — bypass the decoder, prefill the tool-call prefix when the probe fires — cuts unnecessary calls by 48–56% with <2% accuracy hit ⁷. Apollo Research’s deception probes recover honest-vs-deceptive reasoning at >90% accuracy from hidden states the readout suppresses ⁸. A recent thermodynamic analysis of sycophancy describes the same architecture: deep representational truth, late-layer preference shift that overrides it ⁹. Cheng et al.’s orthogonality result is the cleanest geometric account so far.

The caveat the paper soft-pedals

Mechanistic work on tool selection (as opposed to invocation) is less flattering. Shuffling tool descriptions while preserving names drops selection accuracy from 100% to 42% ¹⁰, implying the “decision” is often lexical lookup rather than semantic reasoning. If the cognition probe is reading similar surface features, the framing of a sophisticated-but-paralyzed model gets weaker. Practitioner reports also note sub-7B models simply fail the “detour problem” — they can’t break out of free-text completion to emit structured JSON, regardless of prompt ¹¹. Some fraction of the Llama-3.2-3B gap may be missing competence, not blocked translation.

Why it matters for benchmarks

The portable idea is the methodological one. BFCL’s AST validation and ToolBench’s LLM-as-judge both treat necessity as a static prompt label ¹². Cheng et al. show that label is incoherent across model scales — the same query is mandatory for a 3B and trivial for a 70B. The next generation of agent benchmarks needs capability-relative ground truth, or it will keep scoring large models for “correctly skipping” tools they never needed and punishing small ones for failures that are actually two distinct bugs.

TDDev’s 65.8% web-gen score faces a 44% counter-claim

TL;DR

WebGen-Agent authors dispute TDDev’s headline, scoring it at just 44.0% on the full 101-task WebGen-Bench.
TDDev’s own paper reports 34–48pp gains, peaking at 65.8% with Claude Sonnet 4.6.
The framework wraps coding agents in a Playwright-driven test-and-repair loop over the browser accessibility tree.
Worst configuration (Qwen + Incremental) burned 2,327K tokens per percentage point of accuracy gained.

What TDDev actually builds

TDDev is a three-stage TDD harness for full-stack web generation: persona-driven acceptance tests are generated from requirements, a lightweight agent executes them via Playwright against the browser’s accessibility tree, and failures are translated into structured repair reports the coding agent can act on. The substrate choices are sensible — accessibility snapshots run 200–400 tokens vs. thousands for raw DOM, and role/name locators survive UI refactors that would shatter CSS selectors ¹³.

The paper’s most interesting finding isn’t the topline accuracy; it’s that model generation style dictates protocol fit. Claude Sonnet rewrites files holistically and thrives in the agentic loop (65.8% vs. 31.3% baseline) but gains nothing from the Incremental protocol because it overwrites earlier features. Qwen patches conservatively, hits 71.4% under Incremental, and collapses into buggy spaghetti under Agentic. There is no universal TDD wrapper — the loop has to match the model’s editing temperament.

The headline number is contested

The 65.8% figure is the load-bearing claim, and it’s being challenged in public. The WebGen-Agent submission reports that on the full 101-instruction WebGen-Bench test set, TDDev scores only 44.0%, while WebGen-Agent with Qwen3-Coder reaches 58.2% ¹⁴. The same critique alleges TDDev’s published headline was obtained on a 10-instruction subsample with golden design images supplied as hints ¹⁴. If that holds, the 34–48pp lift narrative needs an asterisk.

Context makes the gap matter more. On the broader benchmark, DeepSeek-R1 + Bolt.diy lands at 27.8%, and the fine-tuned WebGen-LM-32B already hits 38.2% — beating proprietary models without any agent loop ¹⁵. FullStack-Agent (Feb 2026) uses TDDev as a baseline and beats it by adding repository back-translation, noting TDDev “produces relatively simple codebases that struggled with complex database implementations” ¹⁶. TDD infrastructure helps; it is no longer the discriminator.

The cost and false-negative tax

TDDev’s own ablation shows the testing agent false-fails 25% of correct applications because UI strings or labels don’t exactly match the test’s expected text ¹⁷. The authors frame this as a conservative bias — better than false passes — but every false negative triggers another repair round. Independent cost analysis finds agentic loops burn ~1000× the tokens of single-shot chat, with 85% of spend on input tokens because each turn resends the full codebase, logs, and history ¹⁸. That matches the paper’s own 2,327K-tokens-per-point figure for Qwen+Incremental.

Takeaway

TDDev is plausible engineering with a useful empirical finding about model-protocol fit. But the “zero manual intervention, 65.8% accuracy” framing carries two attached invoices: a contested benchmark result that may shrink to 44% under independent scoring ¹⁴, and a token bill the paper acknowledges but the “hands-off developer” pitch elides. Read it as evidence that the TDD wrapper is now table stakes — not as a leaderboard result.

Round-ups

AgentKernelArena benchmarks GPU kernel agents on unseen configs

AMD’s open-source AgentKernelArena scores coding agents on full GPU kernel optimization workflows, not just one-shot outputs. It tests HIP-to-HIP, Triton-to-Triton, and PyTorch-to-HIP translation while checking compilation, correctness, and performance on configurations the agent never saw during development.

OProver unifies agentic theorem proving in Lean 4

OProver trains theorem-proving agents through continued pretraining plus iterative SFT and RL on compiler-verified proofs. The framework folds Lean 4 verifier feedback directly into the loop, letting proof synthesis improve from its own successful attempts rather than static demonstration data.

Probe trajectories track reasoning dynamics inside chain-of-thought

Watching hidden-state probes evolve across a reasoning trace beats static snapshots for predicting model behavior. Signal-processing features over probe trajectories—replacing simple average- or max-pooling—raise AUROC on safety monitoring of large reasoning models by exploiting temporal structure in the internal monologue.

Contrastive neuron search isolates a refusal gate in MLPs

Contrasting harmful and benign prompts surfaces a small set of MLP neurons that act as a targetable refusal gate in instruction-tuned models. Modulating them shifts jailbreak refusal rates without degrading output quality, exposing how alignment fine-tuning concentrates discrimination structure.

Agent Bazaar stress-tests LLMs as economic agents

Agent Bazaar simulates LLM agents trading in marketplaces and measures systemic risks like algorithmic instability and Sybil deception. Training with REINFORCE++ on an adaptive curriculum raises the framework’s Economic Alignment Score, curbing destabilizing behavior without hand-coded rules.

Symmetry-aware optimizers beat Adam on embeddings and MoE routers

Matching optimizer updates to each weight block’s equivariance—permutation symmetry in embeddings and LM heads, shared-shift in SwiGLU and MoE routers—improves pretraining stability over coordinate-wise Adam. The hybrid row-norm and spectral updates extend Muon and Scion across Qwen3-0.6B, Gemma 3 1B, OLMoE, and gpt-oss.

Auto-research roadmap maps where AI helps and fails in science

AI handles structured research tasks like literature synthesis and analysis reliably but stumbles on novel ideation and scientific judgment, the roadmap finds. The accompanying benchmark suite and tool inventory argue for human-governed collaboration rather than end-to-end automation across long-horizon research agents.

MindStudio — ‘Agent Harnesses Beat Model Upgrades’ — https://www.mindstudio.ai/blog/agent-harnesses-beat-model-upgrades-5-benchmarks

the same GPT-5.5 model achieved an 87.2% success rate in a specialized coding harness (Cursor) compared to only 61.5% in a native environment

↩
DecodeTheFuture — AutoAgent / Meta-Harness writeup — https://decodethefuture.org/en/autoagent-self-improving-ai-agents-meta-harness/

AutoAgent … achieved #1 rankings on SpreadsheetBench (96.5%) and TerminalBench (55.1%) after only 24 hours of autonomous optimization

↩
Preprints.org — Chain-of-Code Collapse study — https://www.preprints.org/manuscript/202503.1808

adversarial prompt perturbations—surface-level changes that preserve core logic—can cause accuracy to plummet by up to 42%

↩
arXiv — Scope Delineation Before Localization (SDBL) — https://arxiv.org/html/2603.25001v1

boosting step-level accuracy from 8% to over 32% in expert-curated logs

↩
Medium (E. Pappas) — ‘The Optimizer in the Loop’ — https://medium.com/@epappas/the-optimizer-in-the-loop-when-your-agent-starts-editing-its-own-harness-7e4a5fd6f2a8

self-evolving systems create a ‘harness debt,’ where the infrastructure intended to catch errors becomes its own source of complex, hard-to-debug failures

↩
Modal — ‘Best Sandboxes for SWE-bench Coding Agents’ — https://modal.com/resources/best-sandboxes-swe-bench-coding-agents

production-grade systems like Modal and E2B employ gVisor or Firecracker microVMs for hardware-level isolation … parallelizing 500-task benchmarks now takes roughly seven minutes

↩
When2Tool benchmark (Weng Lab) — https://lilywenglab.github.io/when2tool/

A linear probe on the hidden state of the last input token predicts tool necessity with AUROC > 0.90 across model families; Probe&Prefill reduces unnecessary API calls by 48-56% while staying within 1.7% of baseline accuracy.

↩
LessWrong / Apollo Research — Detecting strategic deception with linear probes — https://www.lesswrong.com/posts/9pGbTz6c78PGwJein/detecting-strategic-deception-using-linear-probes

Linear probes can distinguish honest from deceptive reasoning patterns with over 90% accuracy, acting as neural circuit breakers that catch misalignment before action.

↩
ResearchGate — Thermodynamic Analysis of Sycophancy in LLMs — https://www.researchgate.net/publication/398807300_Internal_Reasoning_vs_External_Control_A_Thermodynamic_Analysis_of_Sycophancy_in_Large_Language_Models

Sycophancy emerges in two stages: a deeper representational divergence where the model distinguishes truth, followed by a late-layer preference shift that overrides this internal knowledge during the final decoding phase.

↩
Agent4Science — Mechanistic interpretability of tool routing — https://agent4science.org/sciencesub/mechanistic-interpretability

Accuracy in tool selection drops from 100% to 42% when tool descriptions are shuffled but names remain the same, suggesting the ‘decision’ is often a sophisticated lookup rather than deep semantic reasoning.

↩
Medium — ‘LLMs Can’t Calculate’ (practitioner blog) — https://medium.com/@manucet439/llms-cant-calculate-why-you-should-use-tools-for-math-53c205bd5e0b

Models with fewer than 7B parameters often fail the ‘detour problem’ — resisting the urge to complete a sentence and instead format a structured tool call — independent of any prompt instruction.

↩
EmergentMind — BFCL v4 overview — https://www.emergentmind.com/topics/berkeley-function-calling-leaderboard-v4-bfclv4

BFCL focuses on deterministic AST-based validation of function-call syntax, while ToolBench relies on LLM-as-a-judge ToolEval; both treat necessity as a static property of the prompt rather than a function of model capability.

↩
Playwright MCP documentation — https://playwright.dev/mcp/introduction

Standard accessibility snapshots typically consume 200–400 tokens, compared to thousands for raw DOM or pixel-based vision models… locators based on roles and names are more resilient to UI refactors than brittle CSS selectors.

↩
WebGen-Agent OpenReview submission (Shi et al.) — https://openreview.net/forum?id=q2VpjD7k1V&referrer=%5Bthe%20profile%20of%20Hongsheng%20Li%5D(%2Fprofile%3Fid%3D~Hongsheng_Li3)

In a direct head-to-head comparison on the full 101-instruction WebGen-Bench test set, WebGen-Agent (using Qwen3-Coder) achieved 58.2% accuracy, significantly outperforming TDDev’s 44.0%… [TDDev’s headline] results were inflated by testing on a limited subset of only 10 instructions and using ‘golden’ design images as hints.

↩ ↩² ↩³
WebGen-Bench review (themoonlight.io) — https://www.themoonlight.io/en/review/webgen-bench-evaluating-llms-on-generating-interactive-and-functional-websites-from-scratch

DeepSeek-R1 paired with the Bolt.diy framework emerged as the functional leader, achieving an accuracy of 27.8%… WebGen-LM-32B, a model fine-tuned on the benchmark’s companion dataset, achieved a functional accuracy of 38.2%, surpassing all proprietary models.

↩
FullStack-Agent (ResearchGate, 2026) — https://www.researchgate.net/publication/400415776_FullStack-Agent_Enhancing_Agentic_Full-Stack_Web_Coding_via_Development-Oriented_Testing_and_Repository_Back-Translation

While TDDev proved that TDD infrastructure alone could improve generation quality by 34–48 percentage points… it was noted for producing relatively simple codebases that struggled with complex database implementations.

↩
themoonlight.io review of TDDev paper — https://www.themoonlight.io/en/review/from-runnable-to-shippable-multi-agent-test-driven-development-for-generating-full-stack-web-applications-from-requirements

The testing agent achieved 100% accuracy in detecting actual bugs (no false positives) but incorrectly flagged 25% of functional applications as broken due to Playwright selector timeouts (false negatives).

↩
Vantage ‘Agentic Coding Costs’ blog — https://www.vantage.sh/blog/agentic-coding-costs

Agentic coding consumes roughly 1,000x more tokens than standard code chat… input tokens, rather than output, drive 85% of total costs because agents must resend the entire codebase context, test logs, and conversation history at every turn.

↩

Code harness lifts GPT-5.5 26pt, Llama skips tool calls, TDDev's 65.8% rebutted

Code harness lifts GPT-5.5 26pt, Llama skips tool calls, TDDev’s 65.8% rebutted

TL;DR

Code as Agent Harness reframes runtime as agent bottleneck

TL;DR

The harness is the new variable

What the taxonomy actually says

Where the framing strains

Takeaway

Llama and Qwen know they need a tool, then don’t call it

TL;DR

The gap

Where the signal dies

Convergent evidence

The caveat the paper soft-pedals

Why it matters for benchmarks

TDDev’s 65.8% web-gen score faces a 44% counter-claim

TL;DR

What TDDev actually builds

The headline number is contested

The cost and false-negative tax

Takeaway

Round-ups

AgentKernelArena benchmarks GPU kernel agents on unseen configs

OProver unifies agentic theorem proving in Lean 4

Probe trajectories track reasoning dynamics inside chain-of-thought

Contrastive neuron search isolates a refusal gate in MLPs

Agent Bazaar stress-tests LLMs as economic agents

Symmetry-aware optimizers beat Adam on embeddings and MoE routers

Auto-research roadmap maps where AI helps and fails in science

Jack Sun, writing.

Code harness lifts GPT-5.5 26pt, Llama skips tool calls, TDDev’s 65.8% rebutted

TL;DR

Code as Agent Harness reframes runtime as agent bottleneck

TL;DR

The harness is the new variable

What the taxonomy actually says

Where the framing strains

Takeaway

Llama and Qwen know they need a tool, then don’t call it

TL;DR

The gap

Where the signal dies

Convergent evidence

The caveat the paper soft-pedals

Why it matters for benchmarks

TDDev’s 65.8% web-gen score faces a 44% counter-claim

TL;DR

What TDDev actually builds

The headline number is contested

The cost and false-negative tax

Takeaway

Round-ups

AgentKernelArena benchmarks GPU kernel agents on unseen configs

OProver unifies agentic theorem proving in Lean 4

Probe trajectories track reasoning dynamics inside chain-of-thought

Contrastive neuron search isolates a refusal gate in MLPs

Agent Bazaar stress-tests LLMs as economic agents

Symmetry-aware optimizers beat Adam on embeddings and MoE routers

Auto-research roadmap maps where AI helps and fails in science

Footnotes

Jack Sun, writing.