JS Wei (Jack) Sun

When the scaffold outweighs the model: a day of harness-defined results

Across today's model launches and agent benchmarks, the harness, evaluation rubric, and licensing frame are doing more work than the weights themselves.

When the scaffold outweighs the model: a day of harness-defined results

TL;DR

  • Qwen3.5-Omni Plus and Flash are API-only; only the Light variant gets weights, and the agentic OmniGAIA gap with Gemini 3.1 Pro runs 12 points.
  • GTA-2’s top model finishes 14.4% of long-horizon tasks, but swapping ReAct for OpenClaw moved Claude-Sonnet-4.5 from 0% to 50% on a subset.
  • On the Amazing Agent Race, navigation drives 52% of hard-trial failures while tool-use errors stay under 17% — the wall is finding the page.
  • Failing AAR agents issue 56% more searches than successful ones, suggesting web agents still cannot backtrack effectively.
  • Round-ups push agent evaluation toward real workflows (PRL-Bench, Mind DeepResearch) and trace post-training diversity collapse to training data, not format.

Today’s research reads like a referendum on what we’ve been measuring. Alibaba’s Qwen3.5-Omni posts category-leading audio numbers, but the comparison frame, the licensing posture, and the agentic gap matter more than the WER digits. Two agent benchmarks — GTA-2 and the Amazing Agent Race — independently land on the same uncomfortable point: swap the harness, and the model’s score moves more than swapping the model does. GTA-2 saw a 50-point jump on a subset just by replacing ReAct with OpenClaw; AAR finds that tool-use error is rare and navigation is where agents actually break.

The unifying read: in 2026, leaderboard position is increasingly an artifact of scaffolding choices — prompt format, harness, evaluation rubric, search loop — and the field is starting to admit it on the page. The round-ups echo the theme, with PRL-Bench and Mind DeepResearch pushing agent evaluation toward real research workflows, and a diversity-collapse paper localizing post-training damage to data composition rather than format.

Qwen3.5-Omni ships category-leading audio — and quietly closes the weights

Source: hf-daily-papers · published 2026-04-16

TL;DR

  • Qwen3.5-Omni Plus and Flash are API-only via Alibaba DashScope; only the smaller Light variant gets open weights.
  • Audio wins are real (Fleurs WER 6.55%, MMAU 82.2), but the “matches Gemini 3.1 Pro” framing hides a 12-point gap on the agentic OmniGAIA benchmark.
  • Three-second voice cloning ships without watermarking; SGLang users on Ascend 910B report infinite-loop hallucinations in production.

The headline number, and the asterisk

Qwen3.5-Omni claims SOTA on 215 subtasks across text, vision, audio, and video, with a 256k context window that swallows 10 hours of audio or 400 seconds of 720p video. The two flagship tiers, Plus and Flash, run on a Hybrid MoE Thinker-Talker stack with a new Adaptive Rate Interleave Alignment (ARIA) module that fixes the long-standing speech-vs-text tokenization rate mismatch — the root cause of skipped words and broken prosody in earlier streaming TTS.

The audio results back the hype. On Fleurs ASR, Plus posts a 6.55% WER against Gemini 3.1 Pro’s 7.32%. SEED-TTS test-en drops to 1.26 WER, beating F5-TTS (1.83) and the previous Qwen3-Omni (1.39). First-packet latency on Flash hits 235ms for audio input — genuinely real-time if your serving stack cooperates.

Where the report gets slippery is everywhere else.

Benchmark choreography

Independent reviewers flagged that the comparison baseline rotates chart-by-chart between Gemini 3.1 Pro, GPT-5.2, and Claude Opus — characterized as a “misleading” tactic to surface wins and bury losses 1. The cleanest example is agentic tool use: the report proudly cites Plus’s 57.2% on OmniGAIA, but the public OmniGAIA leaderboard shows Gemini 3.1 Pro at 68.9% 2. That’s not “matching” — that’s an 11.7-point deficit on the benchmark that matters most for real-world agent deployment.

BenchmarkQwen3.5-Omni PlusGemini 3.1 Pro
MMAU (audio understanding)82.281.1
Fleurs ASR (WER ↓)6.55%7.32%
OmniGAIA (tool use)57.2%68.9% 2
MMMU (vision reasoning)80.1— (not directly compared)

Read the report as: Qwen leads on audio, trails on agency, and gets to pick the chart.

The open-weight retreat

For a team whose Apache-2.0 releases built the “open-source champion” reputation, the licensing posture here is a sharp turn. Plus and Flash are proprietary, served only through Alibaba’s DashScope API; only the smaller Light variant lands on Hugging Face with downloadable weights 3. r/LocalLLaMA’s read is blunt:

the ‘open-source champion’ appears to be ‘closing the door’ on its most advanced multimodal tech 4

The practical consequence: nobody outside Alibaba can independently reproduce the 215-SOTA claim at the Plus tier. You’re trusting a hosted endpoint and a marketing PDF.

Deployment is bumpier than the latency chart suggests

A live SGLang issue documents severe hallucination and infinite generation loops when serving Qwen3.5-Omni on Ascend 910B hardware 5. The problem looks Ascend-specific, but it’s a reminder that the advertised 235ms first-packet number assumes a blessed stack (vLLM + FlashAttention on NVIDIA) and a cooperative MoE router.

The voice stack is ahead of its safety story

The Talker can clone a speaker from roughly three seconds of reference audio. There is no built-in watermark, no consent handshake, no detector — flagged by independent analysis as “a high security risk without standardized industry guardrails” 6. Pair that with the report’s own showcase capability — Audio-Visual Vibe Coding, where the model emits executable code directly from a screen-plus-voice prompt with no intervening text — and the attack surface looks like this:

flowchart LR
    A[3s voice sample] --> B[Talker: zero-shot clone]
    C[Screen capture + spoken instruction] --> D[Thinker: Vibe Coding]
    B --> E((Social-engineering<br/>call/voicemail))
    D --> F((Auto-generated<br/>executable code))
    E -.-> G[Victim runs code]
    F -.-> G

Neither path has a disclosed mitigation in the technical report.

Net read

The ARIA + Thinker-Talker work is genuinely interesting research, and the audio numbers are probably category-leading. But strip the choreography and Qwen3.5-Omni is a closed flagship that trails Gemini on agency, ships a weaponizable voice cloner, and asks you to take its 215-SOTA claim on faith.


GTA-2 puts the agent capability cliff at 14% — but the benchmark itself is part of the story

Source: hf-daily-papers · published 2026-04-16

TL;DR

  • GTA-2’s top scorer, Gemini-2.5-Pro, completes just 14.39% of long-horizon workflow tasks; GPT-5 lands at 11.36%.
  • Swapping the default ReAct harness for OpenClaw took Claude-Sonnet-4.5 from 0% to 50% on a 30-task subset — the scaffold dominates the model.
  • Over 40% of harness failures are formatting/artifact compliance, a category independent work suggests is partly an evaluation artifact, not pure capability.
  • LLM-as-judge scoring (GPT-5.2) and no pass^k stress test leave the headline numbers more fragile than they look.

The cliff is real, and it extends a known one

GTA-2 is the long-horizon sequel to the NeurIPS 2024 GTA benchmark, where GPT-4/4o already topped out below 50% on atomic tool calls and most open models couldn’t clear 25% 7. The new contribution is GTA-Workflow: 132 open-ended tasks sourced from Manus, CrewAI, Reddit and Stack Exchange, decomposed into 1,156 verifiable checkpoints across 37 tools and deliverables ranging from PPTX to video. Performance collapses there.

ModelWorkflow Root SR
Gemini-2.5-Pro14.39%
GPT-511.36%
Llama-4-Scout10.61%
Claude-Sonnet-4.59.09%

A 14% ceiling on tasks the authors describe as ordinary knowledge-work productivity is the most defensible thing the paper says. It is consistent with a 2026 Berkeley reliability study finding that frontier agents lose roughly 60% of their headline scores once you require success across eight consecutive runs 8 — and GTA-2 doesn’t run that test either.

The harness is doing most of the work

The most striking single number isn’t a model score. Replacing Lagent (a vanilla ReAct loop) with the OpenClaw “agent OS” lifted Claude-Sonnet-4.5 from 0% to 50% on a 30-task slice. OpenClaw and Manus take very different routes to long-horizon coherence — OpenClaw uses a Markdown filesystem plus a “heartbeat” that survives session restarts, Manus runs a Planner/Executor/Verifier multi-agent loop 9 — so collapsing them into one “advanced harness” bar hides the mechanism that actually matters. The authors flag this causality problem themselves; it deserves more than a caveat.

The takeaway for builders is blunt: in late-2026 agent work, your scaffolding choices likely move the success rate more than swapping in next quarter’s frontier model.

The “formatting bottleneck” is partly the benchmark’s fault

GTA-2 reports that >40% of harness failures are artifact compliance — wrong file type, malformed JSON, schema mismatches — and frames this as an execution weakness. Two independent threads complicate that read. Strict output schemas force models to commit to an answer key before a reasoning key, measurably degrading reasoning quality 10. And a widely-shared reproduction found that swapping JSON tool-calls for plain-English ones boosted accuracy by up to 18 points by removing exactly that burden 11.

Plain English outperforms JSON for LLM tool calling — by up to 18 percentage points.

So the 40% formatting failure rate is probably measuring two distinct things smushed together: agents that genuinely can’t produce a valid PDF, and agents whose reasoning was strangled by the schema before they got the chance.

What to trust, and what’s missing

The recursive checkpoint design is a real improvement over trajectory-matching, but it leans on GPT-5.2 as judge. LLM judges carry documented position, verbosity and self-preference biases — GPT-family judges have been shown to favor GPT-family outputs 12 — which is awkward when the judge is awarding partial credit across a weighted tree of sub-goals. There’s no exploit-resistance check of the kind Berkeley flagged as table-stakes for 2026 benchmarks 8, and no pass^k consistency reporting.

GTA-2 is the most honest picture yet of where general tool agents stand on real workflows. Read the 14% as “frontier models cannot reliably ship a multi-file deliverable without a good harness.” Don’t read it as a clean model leaderboard — the harness, the schema, and the judge are all on the scoreboard too.


The Amazing Agent Race exposes navigation as the real agent bottleneck

Source: hf-daily-papers · published 2026-04-16

TL;DR

  • AAR’s DAG-shaped Wikipedia tasks cap the best agents at 37.2% Finish-line Accuracy, with navigation errors in 52% of hard trials.
  • Tool-use error rates stay under 17% — the wall is finding the right page, not calling the right function.
  • Failing agents issue 56% more searches than successful ones, consistent with prior findings that web agents can’t backtrack.
  • The flashy GPT-OSS-120B collapse (3.1%) and Claude’s 6× token-efficiency win are likely scaffold artifacts, not pure model facts.

The compositionality gap

The Amazing Agent Race (AAR) is a 1,400-leg benchmark that breaks the linear-chain mold of GAIA and ToolBench by structuring tasks as directed acyclic graphs. Each “leg” forces an agent to pull a fact from Wikipedia, fan that fact into parallel tool calls (geocoding, elevation, historical weather, stock prices), then merge the results through modular arithmetic to produce a single-digit answer. The authors’ audit of six existing benchmarks found them 55–100% linear; AAR-DAG is 100% non-linear by construction.

flowchart LR
    A[Wikipedia seed page] --> B[Extract entity:<br/>e.g. city name]
    B --> C[Roadblock 1:<br/>elevation API]
    B --> D[Roadblock 2:<br/>population lookup]
    B --> E[Detour:<br/>digit-sum transform]
    C --> F[Modular arithmetic<br/>finish line]
    D --> F
    E --> F
    F --> G[Single-digit answer 0-9]

That diamond shape is where agents fall apart. Moving the same underlying tasks from linear to DAG form drops navigation accuracy by 13–18 points while tool-use scores stay flat — the dependency structure, not the API surface, is what confuses the planner.

The headline number is 37.2% Finish-line Accuracy at the top of the leaderboard, independently reproduced in The Moonlight’s review 13. The decomposed metrics are the more interesting story: on extreme-difficulty legs, navigation errors show up in 52% of trials while roadblock (tool-use) errors stay below 17% 13. Incorrect trials also issue 56% more searches and fetch 18% more pages than successful ones — agents spiral into over-exploration rather than backing out of dead ends.

This matches what the web-agent literature has been saying for a year. INFOGENT and the “Failure is Feedback” line of work argue that mainstream scaffolds are forward-only: once an agent walks off a useful page, it has no state-space memory to walk back 14. AAR is the cleanest quantitative demonstration of that failure mode to date.

The GPT-OSS-120B number deserves an asterisk

The paper reports GPT-OSS-120B at 3.1% FA and attributes it to the reasoning model burning its token budget on internal thinking before issuing a tool call 13. That story is incomplete. On TAU-bench Retail and BFCL-v3, the same model lands at 67–68% — behind GLM-4.5 and Qwen3 Thinking, but nowhere near a 3% floor 15. Fireworks has documented that GPT-OSS’s Harmony format breaks in many agent harnesses: reasoning tokens aren’t echoed back across turns, and an October regression specifically broke parallel tool calling 16. AAR’s collapse is plausibly a harness-format mismatch as much as a capability gap.

The Claude-Code-vs-Codex-CLI efficiency comparison (near-identical accuracy, 6× fewer tokens) is similarly scaffold-bound — these are different agent loops, not just different models.

The leakage problem hasn’t gone away

AAR’s own audit catches 14–21% of DAG trials solved without visiting the required pages — agents inferring tool arguments straight from the natural-language clue. That’s the same family of shortcut Berkeley RDI used to “break” WebArena and GAIA via DOM injection and prompt extraction 17. The modular-arithmetic finish line helps (20.5% of trials are arithmetic near-misses 13), but riddle-decoding remains the soft underbelly of any text-clue benchmark.

One piece of good news for reproducibility: AAR runs on Laude Institute’s Harbor harness, the same CI-style sandbox now backing Terminal-Bench 2.0 and SWE-Bench 18. Third parties can rerun it without building bespoke infrastructure — which is exactly what the GPT-OSS result needs.

Round-ups

PRL-Bench: A Comprehensive Benchmark Evaluating LLMs’ Capabilities in Frontier Physics Research

Source: hf-daily-papers

PRL-Bench evaluates LLMs on end-to-end theoretical and computational physics research workflows drawn from frontier problems, finding current systems fall well short of autonomous scientific exploration. The benchmark targets agentic science capability rather than isolated problem-solving, exposing gaps in domain knowledge and multi-step research execution.

Mind DeepResearch Technical Report

Source: hf-daily-papers

MindDR proposes a three-agent deep research framework trained through a four-stage pipeline combining SFT cold-start, Search-RL, Report-RL, and preference alignment. The system is evaluated on real-world Chinese queries using a multi-dimensional rubric and reports strong results across multiple research benchmarks.

Where does output diversity collapse in post-training?

Source: hf-daily-papers

Empirical study finds that diversity collapse in post-trained LLMs traces primarily to training data composition rather than generation format, with SFT, DPO, and chain-of-thought distillation each affecting diversity differently across tasks. The authors decompose loss into quality-control and residual components to localize where collapse happens.

Cut Your Losses! Learning to Prune Paths Early for Efficient Parallel Reasoning

Source: hf-daily-papers

STOP introduces learnable token-level path pruning for parallel reasoning in large reasoning models, cutting redundant prefixes early to save compute. The authors compare learnable against non-learnable pruning strategies across budgets and report both efficiency and accuracy gains, with code and a project page released.

Elucidating the SNR-t Bias of Diffusion Probabilistic Models

Source: hf-daily-papers

Diffusion models accumulate an SNR-timestep mismatch between training and inference denoising. The proposed differential correction method handles frequency components separately during the reverse process, improving generation quality across multiple diffusion backbones with negligible added compute, with code released as DCW.

ArtifactNet: Detecting AI-Generated Music via Forensic Residual Physics

Source: hf-daily-papers

ArtifactNet detects AI-generated music by training a compact UNet plus CNN on codec-specific residuals in magnitude spectrograms after harmonic-percussive separation. The authors release ArtifactBench and report better cross-codec robustness than prior detectors through codec-aware training that targets forensic compression artifacts.

Maximal Brain Damage Without Data or Optimization: Disrupting Neural Networks via Sign-Bit Flips

Source: hf-daily-papers

1P-DNL shows that flipping a single sign bit in network parameters can catastrophically degrade models without any data or optimization, demonstrated on ResNet-50 ImageNet classification, Mask R-CNN and YOLOv8-seg detection and segmentation, and the Qwen3-30B-A3B-Thinking language model, with targeted bit protection as mitigation.

Footnotes

  1. Digital Applied benchmark reviewhttps://www.digitalapplied.com/blog/qwen-3-5-omni-vs-gemini-3-1-vs-gpt-5-4-comparison

    the technical report frequently shifted its comparison models—swapping between Gemini 3.1 Pro, GPT-5.2, and Claude Opus—which some termed a ‘misleading’ tactic to highlight specific wins while obscuring broader deficiencies

  2. OmniGAIA leaderboard (qwen.ai)https://qwen.ai/blog?id=qwen3.5-omni

    Gemini-3.1 Pro currently leads the OmniGAIA tool-use evaluation with a score of 68.9%, followed by Qwen3.5-Omni-Plus at 57.2%

    2
  3. MarkTechPosthttps://www.marktechpost.com/2026/03/30/alibaba-qwen-team-releases-qwen3-5-omni-a-native-multimodal-model-for-text-audio-video-and-realtime-interaction/

    the flagship 3.5-Omni Plus and Flash variants were launched as proprietary models available only via Alibaba’s DashScope API… Only the Light variant has been officially released with open weights

  4. r/LocalLLaMA discussionhttps://www.reddit.com/r/singularity/comments/1ra7agg/sound_on_gemini_31_pro_surpassed_every/

    the ‘open-source champion’ appears to be ‘closing the door’ on its most advanced multimodal tech

  5. SGLang GitHub issue #19822https://github.com/sgl-project/sglang/issues/19822

    severe hallucination and repetitive output loops when serving the model on Ascend 910B hardware via SGLang… the model frequently falls into infinite generation cycles

  6. LLMBase comparisonhttps://llmbase.ai/compare/qwen3-5-omni-plus,gemini-3-1-pro-preview/

    Qwen3.5-Omni’s native voice cloning—requiring only three seconds of audio to mimic a speaker—poses a high security risk without standardized industry guardrails

  7. Wang et al., ‘GTA: A Benchmark for General Tool Agents’ (NeurIPS 2024)https://proceedings.neurips.cc/paper_files/paper/2024/file/8a75ee6d4b2eb0b777f549a32a5a5c28-Paper-Datasets_and_Benchmarks_Track.pdf

    GPT-4 and GPT-4o achieved success rates of less than 50%, while the majority of mainstream LLMs failed to complete even 25% of tasks; smaller models often failed to follow the rigid ‘Thought-Action-Action Input’ formatting required for successful tool invocation.

  8. rapidclaw.dev — ‘AI Agent Benchmarks 2026’https://rapidclaw.dev/blog/ai-agent-benchmarks-2026

    A 2026 Berkeley study found that several leading agentic benchmarks could be exploited to achieve near-perfect scores without actually solving the underlying tasks… many ‘frontier’ models see a 60% performance drop when required to succeed across eight consecutive runs.

    2
  9. manus.im — ‘OpenClaw vs Manus Desktop’https://manus.im/blog/openclaw-vs-manus-desktop

    Manus utilizes a multi-agent architecture where a ‘Planner’ delegates sub-tasks to specialized ‘Executors’ and ‘Verifiers’… while OpenClaw addresses long-horizon state through a ‘Heartbeat’ mechanism and a local Markdown-based memory system that survives session restarts.

  10. MarkTechPost — ‘Balancing Act: Impact of Format Restrictions on Reasoning in LLMs’https://www.marktechpost.com/2024/08/09/balancing-act-the-impact-of-format-restrictions-on-reasoning-in-large-language-models/

    Strict schemas often force models to provide a final answer key before a reasoning key, effectively stripping the model of its ability to ‘think’ before committing to a result.

  11. r/MachineLearning — ‘Plain English Outperforms JSON for LLM Tool [Calling]’https://www.reddit.com/r/MachineLearning/comments/1o8szk0/r_plain_english_outperforms_json_for_llm_tool/

    Replacing JSON with natural language tool-calling can boost accuracy by up to 18 percentage points by reducing this formatting burden and context bloat.

  12. OpenReview discussion of LLM-as-judge biaseshttps://openreview.net/forum?id=LZnKNApvhG

    Position bias… verbosity bias… self-preference bias, where models like GPT-4 may favor outputs from their own model family, potentially inflating benchmark scores in a non-objective manner.

  13. The Moonlight (independent paper review)https://www.themoonlight.io/en/review/the-amazing-agent-race-strong-tool-users-weak-navigators

    navigation errors were present in 52% of trials, while tool-use errors remained below 17%… reasoning-optimized GPT-OSS-120B model performed poorly (3.1% FA) because it spent its token budget on internal thinking rather than executing tool calls

    2 3 4
  14. INFOGENT / ‘Failure is Feedback’ (CLiC-it 2025)https://aclanthology.org/2025.clicit-1.63.pdf

    Most autonomous web agents suffer from an inability to backtrack… Traditional agents are often designed for forward-only navigation; once they navigate away from a useful page to a dead end, they lack the state-space memory to return

  15. Clarifai benchmark roundup (GPT-OSS comparison)https://www.clarifai.com/blog/openai-gpt-oss-benchmarks-how-it-compares-to-glm-4.5-qwen3-deepseek-and-kimi-k2

    On TAU-bench Retail… the model scored 67.8%, trailing behind GLM-4.5 (79.7%) and Kimi K2 (70.6%)… BFCL-v3 function-calling benchmark… 67–68%, while competitors like Qwen3 Thinking reached nearly 72%

  16. Fireworks.ai GPT-OSS deployment noteshttps://fireworks.ai/blog/openai-gpt-oss

    many standard LLM libraries and frontends fail to pass these reasoning tokens back to the model in multi-turn conversations, leading to a total breakdown in agentic behavior… a documented regression specifically broke parallel tool calling

  17. OpenReview — Berkeley RDI benchmark audithttps://openreview.net/forum?id=5t7DtLwTVC&referrer=%5Bthe%20profile%20of%20Elizabeth%20M.%20Daly%5D(%2Fprofile%3Fid%3D~Elizabeth_M._Daly1)

    Berkeley RDI researchers proved this by ‘breaking’ top benchmarks like WebArena and GAIA using simple exploits—such as DOM injection or prompt injections—that allowed agents to achieve near-perfect scores without solving a single task

  18. Quesma — CompileBench on Harborhttps://quesma.com/blog/compilebench-in-harbor/

    Harbor functions as a CI/CD pipeline for agents: it initializes a sandbox, drops the agent into it, records the interaction trajectory, and employs a verifier to produce a numerical reward… some third-party task images required patching to resolve inherent reliability issues

Jack Sun

Jack Sun, writing.

Engineer · Bay Area

Hands-on with agentic AI all day — building frameworks, reading what industry ships, occasionally writing them down.

Digest
All · AI Tech · AI Research · AI News
Writing
Essays
Elsewhere
Subscribe
All · AI Tech · AI Research · AI News · Essays

© 2026 Wei (Jack) Sun · jacksunwei.me Built on Astro · hosted on Cloudflare