JS Wei (Jack) Sun

IBM's leaderboard rebutted, NVIDIA's rank-8 enough, PaddleOCR slides to third

IBM's agent leaderboard, NVIDIA's Cosmos LoRA recipe, and PaddleOCR 3.5 each pair a headline win with a second number that walks it back.

IBM’s leaderboard rebutted, NVIDIA’s rank-8 enough, PaddleOCR slides to third

TL;DR

  • Princeton’s HAL finds scaffolds swing IBM leaderboard scores by ~30 points across the same benchmarks.
  • NVIDIA’s PEFT recipe tunes 2B Cosmos Predict 2.5 on 92 robot videos in 17h on one H100.
  • Rank-8 LoRA already saturates geometric Sampson errors in NVIDIA’s own ablation tables.
  • PaddleOCR 3.5’s 94.5% OmniDocBench crown shrinks to third at 80.0% on olmOCR-Bench.
  • Berkeley’s BenchJack hits 100% on SWE-Bench Verified via a 10-line conftest.py exploit.

Three releases hit the AI-tech feed today — IBM Research’s Open Agent Leaderboard, NVIDIA’s LoRA recipe for Cosmos Predict 2.5, and PaddleOCR 3.5 — and each writeup contains its own counter-evidence. IBM’s headline that the model matters more than the scaffold is directly contradicted by Princeton’s HAL data on the same benchmarks. NVIDIA’s recipe trains in 17 hours on one H100, but the ablation tables show rank-8 already saturates the geometric errors rank-32 is sold to fix. PaddleOCR’s 94.5% OmniDocBench crown drops to third at 80.0% on olmOCR-Bench, and the new Transformers backend runs an order of magnitude slower than the native paddle_static path.

The pattern isn’t bad-faith marketing — it’s the second number landing in the same artifact as the first, which is more than most release weeks deliver. Read the headlines, then read one table down.

IBM says model beats scaffold; Princeton’s data says otherwise

Source: huggingface-blog · published 2026-05-18

TL;DR

  • Princeton HAL finds scaffolds swing scores ~30 points, directly contradicting IBM’s “model dominates architecture” headline.
  • IBM Research shipped an Open Agent Leaderboard scoring six benchmarks on both success rate and USD/task.
  • The name “Open Agent Leaderboard” was already taken in January 2025 by OM AI Lab’s GitHub project.
  • Berkeley’s BenchJack hit 100% on SWE-Bench Verified with a 10-line conftest.py exploit.

The contested headline

IBM’s framing is that backbone model choice explains the overwhelming majority of variance — open-weight models like DeepSeek V3.2 and Kimi K2.5 trail frontier closed models by 18–29 points, and the agentic wrapper is secondary. Princeton’s Holistic Agent Leaderboard team reports the opposite: a well-engineered scaffold can lift the same backbone by nearly 30 percentage points, “often more than a jump between model generations” 1.

ClaimIBM Open Agent LeaderboardPrinceton HAL
Dominant variance driverBackbone model choiceAgent scaffold (retry, recovery, tool routing)
Max scaffold-only score swingTreated as secondary effect~30 percentage points 1
Implication for buyersPick the best model, wrap it thinlyEngineer the harness as carefully as the model

Independent reading of IBM’s own General Agent Evaluation paper sharpens the tension. Many architecture-level deltas fail to reach statistical significance (p > 0.1), and open-weight backbones show “architecture sinks” where the same model swings from 0.83 to 0.00 depending on scaffold 2. That’s a strong argument for scaffold importance, not against it. The leaderboard’s own data on shortlisting — a 5.5-point swing from a single architectural tweak 3 — points the same direction.

What IBM actually shipped

The Open Agent Leaderboard evaluates agents as full systems — backbone model plus tool use, planning, memory, error recovery — and reports both quality and cost-per-task. It’s powered by Exgentic, an open orchestration framework that standardizes the Task/Context/Actions triple across benchmarks so runs are reproducible. Six benchmarks cover coding (SWE-Bench Verified), web research (BrowseComp+), personal apps (AppWorld), and policy-bound customer service (tau2-Bench across airline, retail, telecom).

The genuinely useful empirical findings sit one layer below the headlines. Tool shortlisting — pruning the agent’s available tools to the relevant subset before each step — added 5.5 points to GPT-5.2 success rate and cut Claude Opus 4.5 cost by roughly $1.97 per task 3. Smolagents matched heavyweight harnesses at $3.21/task versus $5.97 3. And failed runs cost 20–54% more than successful ones, because agents exhaustively try inefficient paths before quitting 3 — a number practitioners can actually budget against.

The harness problem underneath

Cost/quality Pareto frontiers are only as trustworthy as the benchmarks underneath. Berkeley RDI’s BenchJack work, released months before this launch, scored 100% on SWE-Bench Verified by injecting a 10-line conftest.py that hooks pytest to mark every test passed — no code fixed 4. The same audit found WebArena leaking ground truth via file:// URLs and CAR-bench’s LLM judge falling to prompt injection. Berkeley’s authors call nested evaluator/SUT isolation “non-negotiable” 5. Exgentic uses Docker isolation but inherits the same exploitable benchmark internals.

Net take

Treat the leaderboard as a useful cost dashboard and a source of concrete scaffolding numbers, not as a settled verdict on model-vs-architecture. And if you go searching the name, note there are two projects called “Open Agent Leaderboard” — IBM’s is the one under the ibm-research HF org 6.


NVIDIA’s Cosmos LoRA recipe runs in 17h on one H100

Source: huggingface-blog · published 2026-05-18

TL;DR

  • NVIDIA’s PEFT recipe tunes the 2B Cosmos Predict 2.5 on 92 robot videos in 17h on one 80GB H100.
  • Rank-32 LoRA (~50M trainable params) lifts instruction-following scores over rank-8.
  • Geometric Sampson errors barely move past rank 8 — the physics priors are already there.
  • DoRA runs 10–20% slower per step than LoRA from extra magnitude/direction normalization, for tied quality.

What the recipe actually delivers

NVIDIA’s guide is a tight piece of engineering: inject LoRA/DoRA adapters into the DiT’s attention projections (to_q/k/v, to_out.0) and feedforward layers of Cosmos Predict 2.5, train under rectified flow on 92 GR00T Dreams clips, and a single 80GB H100 finishes 100 epochs in ~17 hours (2.5 hours on 8×H100). Rank 32 with alpha 32, bf16 mixed precision, LoRA weights upcast to fp32. The reported wins are concrete: lower Temporal and Cross-view Sampson Error, and higher Cosmos-Reason2-2B judge scores on physical plausibility and instruction following — most notably, the tuned model stops hallucinating extra hands and respects “left hand” vs “right hand” prompts.

The rank ablation is the most useful finding. Bumping rank 8 → 32 moves instruction-following scores but leaves geometric consistency roughly flat, which suggests the base model’s physics priors are largely intact and what fine-tuning actually buys is prompt grounding to the target embodiment.

DoRA’s “similar results” hides a compute tax

The post recommends DoRA as a stability fallback at very low ranks and calls the two methods a near-tie at rank 32. Independent numbers fill in what’s missing: DoRA’s weight-decomposition adds 0.01–6% more parameters and runs 10–20% slower per step than LoRA due to the extra normalization 7. On a 17-hour run, that’s two to three hours of compute for a tie. The honest read is that DoRA earns its slot only when rank-8 LoRA visibly destabilizes — otherwise rank-32 LoRA is the default.

The pixel-space premise is contested

The recipe takes for granted that a pixel-faithful video diffuser is the right world model for robotics. That premise is under live attack. Meta’s V-JEPA 2, which predicts in embedding space rather than pixels, reportedly plans ~30× faster and hits 65–80% zero-shot pick-and-place on just 62 hours of real robot data 8. NVIDIA’s counter is downstream: Cosmos Policy variants reach 98.5% on LIBERO and 71.1% on RoboCasa, beating diffusion-policy and VLA baselines 9 — but those are policy models, not the video generator this recipe tunes.

There’s a deeper structural worry too. Recent work argues LoRA introduces “intruder dimensions” — high-rank singular vectors orthogonal to pretrained knowledge — that can themselves cause the hallucination behavior fine-tuning is meant to fix 10. And DreamGen’s own authors flag an executability gap: visually coherent rollouts routinely produce kinematically impossible commands when an inverse-dynamics model tries to act on them 11. Using a sibling Cosmos model as judge catches obvious failures but risks self-reinforcement on the subtle ones.

Takeaway

“Fever dreams that occasionally teleport objects when the model reaches the bounds of its known distribution.” — an HN commenter on Cosmos video outputs 12

As a PEFT cookbook, the recipe is solid and reproducible on hardware a single lab can afford. As a claim about how to build robot world models, it leaves the load-bearing questions — pixel vs latent substrate, LoRA’s effect on physical priors, VLM-judge circularity, contact and friction modeling — entirely to the reader.


PaddleOCR 3.5 adds Transformers backend, trades speed for fit

Source: huggingface-blog · published 2026-05-18

TL;DR

  • PaddleOCR 3.5 ships a Transformers backend so PP-OCRv5 and PaddleOCR-VL 1.5 run on PyTorch without the PaddlePaddle runtime.
  • The headline 94.5% OmniDocBench crown shrinks to third at 80.0% on olmOCR-Bench, behind dots.mocr.
  • The Transformers path is materially slower than vLLM or paddle_static — 7 minutes per 15-page PDF vs. Marker’s 54 seconds.
  • Hugging Face’s native class handles only element-level prompts, leaving full-page parsing on the native engine or vLLM.

A distribution win, not a performance one

The headline change in PaddleOCR 3.5 is a new inference-engine interface: pass engine_config={"backend": "transformers", ...} and the same pipeline that used to require PaddlePaddle now runs on PyTorch with sdpa attention, bfloat16, and standard device placement. For teams whose stack is already transformers, vllm, and the HF Hub, that removes a real integration tax around RAG and Document AI workflows.

What the post is careful not to say loudly: this backend is the slow one. Baidu itself recommends paddle_static when throughput matters, and independent benchmarks make the gap concrete. A hands-on comparison across five OCR models clocked PaddleOCR-VL 1.5 at seven minutes on a 15-page academic paper, against Marker’s 54 seconds, and noted that the older PP-StructureV3 actually beats the 3.5 defaults on clean documents 13. vLLM serving narrows the gap to 0.5–1.0s per page at ~140 generation tok/s and ~550 prefill tok/s, but the raw Transformers backend trails both vLLM and SGLang on the identical weights 14.

flowchart LR
    A[PP-OCRv5 / PaddleOCR-VL 1.5] --> B{engine_config}
    B -->|paddle_static| C[Fastest throughput]
    B -->|vLLM / SGLang| D[0.5–1.0s per page]
    B -->|transformers| E[Easiest HF integration<br/>slowest of the three]

The SOTA claim depends on which benchmark you pick

The model story is genuinely impressive in its own lane. PaddleOCR-VL 1.5 is a 0.9B-parameter vision-language model that tops OmniDocBench v1.5 at 94.5%, beating much larger generalists including GPT-4o and Qwen2.5-VL 15. That is the result Baidu leans on.

It is not, however, a universal crown. On olmOCR-Bench — which weighs English-heavy PDFs and reading-order recovery — the leaderboard reorders:

ModelolmOCR-Bench
dots.mocr83.9%
PaddleOCR-VL80.0%
dots.ocr79.1%
MinerU 2.575.2%

Reviewers comparing the field route academic and formula-heavy work to MinerU, wild scene text to dots.ocr, and structured forms to PP-OCRv5 1613. “SOTA” collapses into “SOTA on the benchmark Baidu picked.”

Integration friction the blog post elides

The clean five-line code sample understates setup cost. The PaddleOCRVLForConditionalGeneration class is real and loadable through AutoModelForImageTextToText, but it only responds to element-level prompts — OCR:, Table Recognition:, Formula Recognition:, Chart Recognition: — and full-page document parsing still has to fall back to the native engine or a vLLM server 17. Meanwhile the official PaddlePaddle 3.2.2 container currently fails to load PaddleOCR-VL 0.9B safetensors weights with a framework paddle is invalid error, forcing manual wheel installs to work around it 18.

Takeaway

3.5 is best read as a packaging move. If your infrastructure is HF-native and your bottleneck was getting PaddleOCR into a Python service next to a RAG stack, the Transformers backend is a real unlock. If you care about latency or throughput, the honest path is still vLLM or paddle_static — and if you care about being best on your documents, you should benchmark against MinerU, dots.ocr, and Marker before assuming the OmniDocBench number transfers.

Round-ups

Simon Willison logs four bird species along the LA River

Source: simon-willison

Simon Willison capped a PyCon US trip with a morning walk along the Los Angeles River, photographing a Glaucous-winged Gull, Brown Pelican, Snowy Egret, and Canada Goose. The pelican sighting was the target; goslings near the swan boat lake were the bonus.

Footnotes

  1. Level Up Coding — Princeton HAL analysishttps://levelup.gitconnected.com/that-model-leaderboard-youre-trusting-might-not-be-the-honest-ones-3f99f12a2abc

    a well-designed agent framework (retry logic, error recovery, tool routing) can boost a model’s score by nearly 30 percentage points — often more than a jump between model generations

    2
  2. wispaper.ai — independent reading of the General Agent Evaluation paperhttps://www.wispaper.ai/en/user-blog/general-agent-evaluation-20260301/eng

    performance differences between the tested agent architectures were often not statistically significant (p > 0.1)… open-weight backbones like DeepSeek and Kimi often show ‘architecture sinks’ where performance swings from 0.83 to 0.00

  3. awesomeagents.ai — Open Agent Leaderboard recaphttps://awesomeagents.ai/news/open-agent-leaderboard-model-beats-architecture/

    adding shortlisting to ReAct increased GPT-5.2 success by 5.5 percentage points and cut Claude Opus 4.5 cost by ~$1.97 per task; Smolagents matched heavyweight harnesses at $3.21 vs $5.97

    2 3 4
  4. byteiota — Berkeley RDI ‘BenchJack’ coveragehttps://byteiota.com/berkeley-breaks-ai-agent-benchmarks-100-scores-zero-solutions/

    the agent achieved a 100% resolve rate by simply injecting a 10-line conftest.py file into the repository… force every test to report a ‘passed’ status, regardless of the actual code state

  5. Berkeley RDI blog — Trustworthy Benchmarkshttps://rdi.berkeley.edu/blog/trustworthy-benchmarks-cont/

    non-negotiable need for nested sandboxing and total isolation of the evaluator from the system under test

  6. GitHub — om-ai-lab/open-agent-leaderboardhttps://github.com/om-ai-lab/open-agent-leaderboard

    OmAgent-based leaderboard ranking algorithms (CoT, ReAct, ToT) on GSM8K/MATH-500, released January 2025 under the identical name ‘Open Agent Leaderboard’

  7. NVIDIA Developer Blog (DoRA introduction)https://developer.nvidia.com/blog/introducing-dora-a-high-performing-alternative-to-lora-for-fine-tuning/

    DoRA decomposes weights into magnitude and direction, applying the low-rank update only to the directional component… introduces a slight parameter overhead—typically around 0.01% to 6% more than LoRA… approximately 10% to 20% slower to train than LoRA due to additional normalization steps.

  8. Turing — Exploring V-JEPA 2https://www.turing.com/blog/exploring-v-jepa-2

    V-JEPA 2 reportedly outperforms pixel-heavy models like Cosmos by up to 30× in planning speed during robotic manipulation tasks… achieves success rates of 65–80% in zero-shot pick-and-place tasks using only 62 hours of real-world robot data.

  9. The Robot Report — Cosmos Policy WFMshttps://www.therobotreport.com/nvidia-adds-cosmos-policy-world-foundation-models/

    Cosmos Policy reached success rates of 98.5% on the LIBERO benchmark and 71.1% on RoboCasa, outperforming traditional diffusion policies and vision-language-action models.

  10. Clemson CECAS — Small Models, Big Capabilitieshttps://blogs.clemson.edu/cecas/small-models-big-capabilities/

    PEFT methods like LoRA create ‘intruder dimensions’—high-ranking singular vectors dissimilar to pre-trained knowledge—that bludgeon the existing vector space, causing the model to hallucinate even when the base model previously understood the content.

  11. ResearchGate — DreamGen: Neural Trajectories paperhttps://www.researchgate.net/publication/391878946_DreamGen_Unlocking_Generalization_in_Robot_Learning_through_Neural_Trajectories

    Visually coherent video rollouts often generate kinematically impossible commands when processed by an inverse dynamics model; relying solely on VLM-based visual plausibility is insufficient for high-stakes manipulation.

  12. Hacker News discussion on Cosmoshttps://news.ycombinator.com/item?id=45331837

    Critics described generated outputs as ‘fever dreams’ that occasionally teleport objects when the model reaches the bounds of its known distribution, and noted video models lack physical feedback like pressure, touch, and friction.

  13. gopubby — ‘I tested 5 OCR models on 6 real-world datasets’https://ai.gopubby.com/i-tested-5-ocr-models-on-6-real-world-datasets-heres-which-one-you-should-actually-use-50badae3c16d

    PaddleOCR-VL 1.5 took 7 minutes on a 15-page academic paper with ‘painful setup’ compared to Marker’s 54 seconds; older lightweight PP-StructureV3 actually outperforms the default 3.5 configurations in speed.

    2
  14. Hugging Face — PaddleOCR-VL model card / vLLM servinghttps://huggingface.co/PaddlePaddle/PaddleOCR-VL

    Served via vLLM, PaddleOCR-VL-1.5 averages ~140 tokens/s generation and 550 tokens/s prefill, cutting per-image latency from 2–5s on the native backend to 0.5–1.0s; the raw Transformers backend is materially slower than vLLM or SGLang.

  15. Towards AI — PaddleOCR-VL 1.5 deep divehttps://pub.towardsai.net/paddleocr-vl-1-5-a-deep-dive-into-the-0-9b-model-that-outperforms-gpt-4o-on-document-parsing-c93bac97ac1f

    PaddleOCR-VL 1.5 (a 0.9B parameter model) currently leads OmniDocBench v1.5 with a breakthrough score of 94.5%, outperforming significantly larger generalist models like GPT-4o and Qwen2.5-VL.

  16. NetMind — PDF parser comparisonhttps://blog.netmind.ai/article/Which_PDF_Parser_Should_You_Use%3F_Comparing_Docling%2C_Marker%2C_MinerU%2C_olmOCR_-_and_Why_NetMind_ParsePro_Might_Be_Better

    On olmOCR-Bench dots.mocr leads with 83.9%, followed by PaddleOCR-VL (80.0%) and dots.ocr (79.1%), while MinerU 2.5 trails at 75.2% — ‘best’ is context-dependent.

  17. Hugging Face transformers docs — paddleocr_vlhttps://huggingface.co/docs/transformers/model_doc/paddleocr_vl

    Native class PaddleOCRVLForConditionalGeneration loadable via AutoModelForImageTextToText; element-level prompts (‘OCR:’, ‘Table Recognition:’, ‘Formula Recognition:’, ‘Chart Recognition:’) trigger different tasks, but page-level parsing still requires the native engine or vLLM server.

  18. GitHub — PaddleOCR issueshttps://github.com/PaddlePaddle/PaddleOCR/issues

    PaddlePaddle 3.2.2 container fails to load PaddleOCR-VL 0.9B safetensors weights, throwing ‘framework paddle is invalid’; users must manually install specific safetensors wheels to bypass the error.

Jack Sun

Jack Sun, writing.

Engineer · Bay Area

Hands-on with agentic AI all day — building frameworks, reading what industry ships, occasionally writing them down.

Digest
All · AI Tech · AI Research · AI News
Writing
Essays
Elsewhere
Subscribe
All · AI Tech · AI Research · AI News · Essays

© 2026 Wei (Jack) Sun · jacksunwei.me Built on Astro · hosted on Cloudflare