IBM's leaderboard rebutted, NVIDIA's rank-8 enough, PaddleOCR slides to third
IBM's agent leaderboard, NVIDIA's Cosmos LoRA recipe, and PaddleOCR 3.5 each pair a headline win with a second number that walks it back.
IBM’s leaderboard rebutted, NVIDIA’s rank-8 enough, PaddleOCR slides to third
TL;DR
- Princeton’s HAL finds scaffolds swing IBM leaderboard scores by ~30 points across the same benchmarks.
- NVIDIA’s PEFT recipe tunes 2B Cosmos Predict 2.5 on 92 robot videos in 17h on one H100.
- Rank-8 LoRA already saturates geometric Sampson errors in NVIDIA’s own ablation tables.
- PaddleOCR 3.5’s 94.5% OmniDocBench crown shrinks to third at 80.0% on olmOCR-Bench.
- Berkeley’s BenchJack hits 100% on SWE-Bench Verified via a 10-line
conftest.pyexploit.
Three releases hit the AI-tech feed today — IBM Research’s Open Agent Leaderboard, NVIDIA’s LoRA recipe for Cosmos Predict 2.5, and PaddleOCR 3.5 — and each writeup contains its own counter-evidence. IBM’s headline that the model matters more than the scaffold is directly contradicted by Princeton’s HAL data on the same benchmarks. NVIDIA’s recipe trains in 17 hours on one H100, but the ablation tables show rank-8 already saturates the geometric errors rank-32 is sold to fix. PaddleOCR’s 94.5% OmniDocBench crown drops to third at 80.0% on olmOCR-Bench, and the new Transformers backend runs an order of magnitude slower than the native paddle_static path.
The pattern isn’t bad-faith marketing — it’s the second number landing in the same artifact as the first, which is more than most release weeks deliver. Read the headlines, then read one table down.
IBM says model beats scaffold; Princeton’s data says otherwise
Source: huggingface-blog · published 2026-05-18
TL;DR
- Princeton HAL finds scaffolds swing scores ~30 points, directly contradicting IBM’s “model dominates architecture” headline.
- IBM Research shipped an Open Agent Leaderboard scoring six benchmarks on both success rate and USD/task.
- The name “Open Agent Leaderboard” was already taken in January 2025 by OM AI Lab’s GitHub project.
- Berkeley’s BenchJack hit 100% on SWE-Bench Verified with a 10-line
conftest.pyexploit.
The contested headline
IBM’s framing is that backbone model choice explains the overwhelming majority of variance — open-weight models like DeepSeek V3.2 and Kimi K2.5 trail frontier closed models by 18–29 points, and the agentic wrapper is secondary. Princeton’s Holistic Agent Leaderboard team reports the opposite: a well-engineered scaffold can lift the same backbone by nearly 30 percentage points, “often more than a jump between model generations” 1.
| Claim | IBM Open Agent Leaderboard | Princeton HAL |
|---|---|---|
| Dominant variance driver | Backbone model choice | Agent scaffold (retry, recovery, tool routing) |
| Max scaffold-only score swing | Treated as secondary effect | ~30 percentage points 1 |
| Implication for buyers | Pick the best model, wrap it thinly | Engineer the harness as carefully as the model |
Independent reading of IBM’s own General Agent Evaluation paper sharpens the tension. Many architecture-level deltas fail to reach statistical significance (p > 0.1), and open-weight backbones show “architecture sinks” where the same model swings from 0.83 to 0.00 depending on scaffold 2. That’s a strong argument for scaffold importance, not against it. The leaderboard’s own data on shortlisting — a 5.5-point swing from a single architectural tweak 3 — points the same direction.
What IBM actually shipped
The Open Agent Leaderboard evaluates agents as full systems — backbone model plus tool use, planning, memory, error recovery — and reports both quality and cost-per-task. It’s powered by Exgentic, an open orchestration framework that standardizes the Task/Context/Actions triple across benchmarks so runs are reproducible. Six benchmarks cover coding (SWE-Bench Verified), web research (BrowseComp+), personal apps (AppWorld), and policy-bound customer service (tau2-Bench across airline, retail, telecom).
The genuinely useful empirical findings sit one layer below the headlines. Tool shortlisting — pruning the agent’s available tools to the relevant subset before each step — added 5.5 points to GPT-5.2 success rate and cut Claude Opus 4.5 cost by roughly $1.97 per task 3. Smolagents matched heavyweight harnesses at $3.21/task versus $5.97 3. And failed runs cost 20–54% more than successful ones, because agents exhaustively try inefficient paths before quitting 3 — a number practitioners can actually budget against.
The harness problem underneath
Cost/quality Pareto frontiers are only as trustworthy as the benchmarks underneath. Berkeley RDI’s BenchJack work, released months before this launch, scored 100% on SWE-Bench Verified by injecting a 10-line conftest.py that hooks pytest to mark every test passed — no code fixed 4. The same audit found WebArena leaking ground truth via file:// URLs and CAR-bench’s LLM judge falling to prompt injection. Berkeley’s authors call nested evaluator/SUT isolation “non-negotiable” 5. Exgentic uses Docker isolation but inherits the same exploitable benchmark internals.
Net take
Treat the leaderboard as a useful cost dashboard and a source of concrete scaffolding numbers, not as a settled verdict on model-vs-architecture. And if you go searching the name, note there are two projects called “Open Agent Leaderboard” — IBM’s is the one under the ibm-research HF org 6.
NVIDIA’s Cosmos LoRA recipe runs in 17h on one H100
Source: huggingface-blog · published 2026-05-18
TL;DR
- NVIDIA’s PEFT recipe tunes the 2B Cosmos Predict 2.5 on 92 robot videos in 17h on one 80GB H100.
- Rank-32 LoRA (~50M trainable params) lifts instruction-following scores over rank-8.
- Geometric Sampson errors barely move past rank 8 — the physics priors are already there.
- DoRA runs 10–20% slower per step than LoRA from extra magnitude/direction normalization, for tied quality.
What the recipe actually delivers
NVIDIA’s guide is a tight piece of engineering: inject LoRA/DoRA adapters into the DiT’s attention projections (to_q/k/v, to_out.0) and feedforward layers of Cosmos Predict 2.5, train under rectified flow on 92 GR00T Dreams clips, and a single 80GB H100 finishes 100 epochs in ~17 hours (2.5 hours on 8×H100). Rank 32 with alpha 32, bf16 mixed precision, LoRA weights upcast to fp32. The reported wins are concrete: lower Temporal and Cross-view Sampson Error, and higher Cosmos-Reason2-2B judge scores on physical plausibility and instruction following — most notably, the tuned model stops hallucinating extra hands and respects “left hand” vs “right hand” prompts.
The rank ablation is the most useful finding. Bumping rank 8 → 32 moves instruction-following scores but leaves geometric consistency roughly flat, which suggests the base model’s physics priors are largely intact and what fine-tuning actually buys is prompt grounding to the target embodiment.
DoRA’s “similar results” hides a compute tax
The post recommends DoRA as a stability fallback at very low ranks and calls the two methods a near-tie at rank 32. Independent numbers fill in what’s missing: DoRA’s weight-decomposition adds 0.01–6% more parameters and runs 10–20% slower per step than LoRA due to the extra normalization 7. On a 17-hour run, that’s two to three hours of compute for a tie. The honest read is that DoRA earns its slot only when rank-8 LoRA visibly destabilizes — otherwise rank-32 LoRA is the default.
The pixel-space premise is contested
The recipe takes for granted that a pixel-faithful video diffuser is the right world model for robotics. That premise is under live attack. Meta’s V-JEPA 2, which predicts in embedding space rather than pixels, reportedly plans ~30× faster and hits 65–80% zero-shot pick-and-place on just 62 hours of real robot data 8. NVIDIA’s counter is downstream: Cosmos Policy variants reach 98.5% on LIBERO and 71.1% on RoboCasa, beating diffusion-policy and VLA baselines 9 — but those are policy models, not the video generator this recipe tunes.
There’s a deeper structural worry too. Recent work argues LoRA introduces “intruder dimensions” — high-rank singular vectors orthogonal to pretrained knowledge — that can themselves cause the hallucination behavior fine-tuning is meant to fix 10. And DreamGen’s own authors flag an executability gap: visually coherent rollouts routinely produce kinematically impossible commands when an inverse-dynamics model tries to act on them 11. Using a sibling Cosmos model as judge catches obvious failures but risks self-reinforcement on the subtle ones.
Takeaway
“Fever dreams that occasionally teleport objects when the model reaches the bounds of its known distribution.” — an HN commenter on Cosmos video outputs 12
As a PEFT cookbook, the recipe is solid and reproducible on hardware a single lab can afford. As a claim about how to build robot world models, it leaves the load-bearing questions — pixel vs latent substrate, LoRA’s effect on physical priors, VLM-judge circularity, contact and friction modeling — entirely to the reader.
PaddleOCR 3.5 adds Transformers backend, trades speed for fit
Source: huggingface-blog · published 2026-05-18
TL;DR
- PaddleOCR 3.5 ships a Transformers backend so PP-OCRv5 and PaddleOCR-VL 1.5 run on PyTorch without the PaddlePaddle runtime.
- The headline 94.5% OmniDocBench crown shrinks to third at 80.0% on olmOCR-Bench, behind dots.mocr.
- The Transformers path is materially slower than vLLM or paddle_static — 7 minutes per 15-page PDF vs. Marker’s 54 seconds.
- Hugging Face’s native class handles only element-level prompts, leaving full-page parsing on the native engine or vLLM.
A distribution win, not a performance one
The headline change in PaddleOCR 3.5 is a new inference-engine interface: pass engine_config={"backend": "transformers", ...} and the same pipeline that used to require PaddlePaddle now runs on PyTorch with sdpa attention, bfloat16, and standard device placement. For teams whose stack is already transformers, vllm, and the HF Hub, that removes a real integration tax around RAG and Document AI workflows.
What the post is careful not to say loudly: this backend is the slow one. Baidu itself recommends paddle_static when throughput matters, and independent benchmarks make the gap concrete. A hands-on comparison across five OCR models clocked PaddleOCR-VL 1.5 at seven minutes on a 15-page academic paper, against Marker’s 54 seconds, and noted that the older PP-StructureV3 actually beats the 3.5 defaults on clean documents 13. vLLM serving narrows the gap to 0.5–1.0s per page at ~140 generation tok/s and ~550 prefill tok/s, but the raw Transformers backend trails both vLLM and SGLang on the identical weights 14.
flowchart LR
A[PP-OCRv5 / PaddleOCR-VL 1.5] --> B{engine_config}
B -->|paddle_static| C[Fastest throughput]
B -->|vLLM / SGLang| D[0.5–1.0s per page]
B -->|transformers| E[Easiest HF integration<br/>slowest of the three]
The SOTA claim depends on which benchmark you pick
The model story is genuinely impressive in its own lane. PaddleOCR-VL 1.5 is a 0.9B-parameter vision-language model that tops OmniDocBench v1.5 at 94.5%, beating much larger generalists including GPT-4o and Qwen2.5-VL 15. That is the result Baidu leans on.
It is not, however, a universal crown. On olmOCR-Bench — which weighs English-heavy PDFs and reading-order recovery — the leaderboard reorders:
| Model | olmOCR-Bench |
|---|---|
| dots.mocr | 83.9% |
| PaddleOCR-VL | 80.0% |
| dots.ocr | 79.1% |
| MinerU 2.5 | 75.2% |
Reviewers comparing the field route academic and formula-heavy work to MinerU, wild scene text to dots.ocr, and structured forms to PP-OCRv5 1613. “SOTA” collapses into “SOTA on the benchmark Baidu picked.”
Integration friction the blog post elides
The clean five-line code sample understates setup cost. The PaddleOCRVLForConditionalGeneration class is real and loadable through AutoModelForImageTextToText, but it only responds to element-level prompts — OCR:, Table Recognition:, Formula Recognition:, Chart Recognition: — and full-page document parsing still has to fall back to the native engine or a vLLM server 17. Meanwhile the official PaddlePaddle 3.2.2 container currently fails to load PaddleOCR-VL 0.9B safetensors weights with a framework paddle is invalid error, forcing manual wheel installs to work around it 18.
Takeaway
3.5 is best read as a packaging move. If your infrastructure is HF-native and your bottleneck was getting PaddleOCR into a Python service next to a RAG stack, the Transformers backend is a real unlock. If you care about latency or throughput, the honest path is still vLLM or paddle_static — and if you care about being best on your documents, you should benchmark against MinerU, dots.ocr, and Marker before assuming the OmniDocBench number transfers.
Round-ups
Simon Willison logs four bird species along the LA River
Source: simon-willison
Simon Willison capped a PyCon US trip with a morning walk along the Los Angeles River, photographing a Glaucous-winged Gull, Brown Pelican, Snowy Egret, and Canada Goose. The pelican sighting was the target; goslings near the swan boat lake were the bonus.
Footnotes
-
Level Up Coding — Princeton HAL analysis — https://levelup.gitconnected.com/that-model-leaderboard-youre-trusting-might-not-be-the-honest-ones-3f99f12a2abc
↩ ↩2a well-designed agent framework (retry logic, error recovery, tool routing) can boost a model’s score by nearly 30 percentage points — often more than a jump between model generations
-
wispaper.ai — independent reading of the General Agent Evaluation paper — https://www.wispaper.ai/en/user-blog/general-agent-evaluation-20260301/eng
↩performance differences between the tested agent architectures were often not statistically significant (p > 0.1)… open-weight backbones like DeepSeek and Kimi often show ‘architecture sinks’ where performance swings from 0.83 to 0.00
-
awesomeagents.ai — Open Agent Leaderboard recap — https://awesomeagents.ai/news/open-agent-leaderboard-model-beats-architecture/
↩ ↩2 ↩3 ↩4adding shortlisting to ReAct increased GPT-5.2 success by 5.5 percentage points and cut Claude Opus 4.5 cost by ~$1.97 per task; Smolagents matched heavyweight harnesses at $3.21 vs $5.97
-
byteiota — Berkeley RDI ‘BenchJack’ coverage — https://byteiota.com/berkeley-breaks-ai-agent-benchmarks-100-scores-zero-solutions/
↩the agent achieved a 100% resolve rate by simply injecting a 10-line conftest.py file into the repository… force every test to report a ‘passed’ status, regardless of the actual code state
-
Berkeley RDI blog — Trustworthy Benchmarks — https://rdi.berkeley.edu/blog/trustworthy-benchmarks-cont/
↩non-negotiable need for nested sandboxing and total isolation of the evaluator from the system under test
-
GitHub — om-ai-lab/open-agent-leaderboard — https://github.com/om-ai-lab/open-agent-leaderboard
↩OmAgent-based leaderboard ranking algorithms (CoT, ReAct, ToT) on GSM8K/MATH-500, released January 2025 under the identical name ‘Open Agent Leaderboard’
-
NVIDIA Developer Blog (DoRA introduction) — https://developer.nvidia.com/blog/introducing-dora-a-high-performing-alternative-to-lora-for-fine-tuning/
↩DoRA decomposes weights into magnitude and direction, applying the low-rank update only to the directional component… introduces a slight parameter overhead—typically around 0.01% to 6% more than LoRA… approximately 10% to 20% slower to train than LoRA due to additional normalization steps.
-
Turing — Exploring V-JEPA 2 — https://www.turing.com/blog/exploring-v-jepa-2
↩V-JEPA 2 reportedly outperforms pixel-heavy models like Cosmos by up to 30× in planning speed during robotic manipulation tasks… achieves success rates of 65–80% in zero-shot pick-and-place tasks using only 62 hours of real-world robot data.
-
The Robot Report — Cosmos Policy WFMs — https://www.therobotreport.com/nvidia-adds-cosmos-policy-world-foundation-models/
↩Cosmos Policy reached success rates of 98.5% on the LIBERO benchmark and 71.1% on RoboCasa, outperforming traditional diffusion policies and vision-language-action models.
-
Clemson CECAS — Small Models, Big Capabilities — https://blogs.clemson.edu/cecas/small-models-big-capabilities/
↩PEFT methods like LoRA create ‘intruder dimensions’—high-ranking singular vectors dissimilar to pre-trained knowledge—that bludgeon the existing vector space, causing the model to hallucinate even when the base model previously understood the content.
-
ResearchGate — DreamGen: Neural Trajectories paper — https://www.researchgate.net/publication/391878946_DreamGen_Unlocking_Generalization_in_Robot_Learning_through_Neural_Trajectories
↩Visually coherent video rollouts often generate kinematically impossible commands when processed by an inverse dynamics model; relying solely on VLM-based visual plausibility is insufficient for high-stakes manipulation.
-
Hacker News discussion on Cosmos — https://news.ycombinator.com/item?id=45331837
↩Critics described generated outputs as ‘fever dreams’ that occasionally teleport objects when the model reaches the bounds of its known distribution, and noted video models lack physical feedback like pressure, touch, and friction.
-
gopubby — ‘I tested 5 OCR models on 6 real-world datasets’ — https://ai.gopubby.com/i-tested-5-ocr-models-on-6-real-world-datasets-heres-which-one-you-should-actually-use-50badae3c16d
↩ ↩2PaddleOCR-VL 1.5 took 7 minutes on a 15-page academic paper with ‘painful setup’ compared to Marker’s 54 seconds; older lightweight PP-StructureV3 actually outperforms the default 3.5 configurations in speed.
-
Hugging Face — PaddleOCR-VL model card / vLLM serving — https://huggingface.co/PaddlePaddle/PaddleOCR-VL
↩Served via vLLM, PaddleOCR-VL-1.5 averages ~140 tokens/s generation and 550 tokens/s prefill, cutting per-image latency from 2–5s on the native backend to 0.5–1.0s; the raw Transformers backend is materially slower than vLLM or SGLang.
-
Towards AI — PaddleOCR-VL 1.5 deep dive — https://pub.towardsai.net/paddleocr-vl-1-5-a-deep-dive-into-the-0-9b-model-that-outperforms-gpt-4o-on-document-parsing-c93bac97ac1f
↩PaddleOCR-VL 1.5 (a 0.9B parameter model) currently leads OmniDocBench v1.5 with a breakthrough score of 94.5%, outperforming significantly larger generalist models like GPT-4o and Qwen2.5-VL.
-
NetMind — PDF parser comparison — https://blog.netmind.ai/article/Which_PDF_Parser_Should_You_Use%3F_Comparing_Docling%2C_Marker%2C_MinerU%2C_olmOCR_-_and_Why_NetMind_ParsePro_Might_Be_Better
↩On olmOCR-Bench dots.mocr leads with 83.9%, followed by PaddleOCR-VL (80.0%) and dots.ocr (79.1%), while MinerU 2.5 trails at 75.2% — ‘best’ is context-dependent.
-
Hugging Face transformers docs — paddleocr_vl — https://huggingface.co/docs/transformers/model_doc/paddleocr_vl
↩Native class PaddleOCRVLForConditionalGeneration loadable via AutoModelForImageTextToText; element-level prompts (‘OCR:’, ‘Table Recognition:’, ‘Formula Recognition:’, ‘Chart Recognition:’) trigger different tasks, but page-level parsing still requires the native engine or vLLM server.
-
GitHub — PaddleOCR issues — https://github.com/PaddlePaddle/PaddleOCR/issues
↩PaddlePaddle 3.2.2 container fails to load PaddleOCR-VL 0.9B safetensors weights, throwing ‘framework paddle is invalid’; users must manually install specific safetensors wheels to bypass the error.