JS Wei (Jack) Sun

IBM's leaderboard rebutted, NVIDIA's rank-8 enough, PaddleOCR slides to third

Every URL the pipeline pulled into ranking for this issue — primary sources plus the supporting and contradicting findings each Researcher returned. Inline citations in the issue point back here.

← Back to the issue

Sources

The Open Agent Leaderboard huggingface.co

Fine-Tuning NVIDIA Cosmos Predict 2.5 with LoRA/DoRA for Robot Video Generation huggingface.co

PaddleOCR 3.5: Running OCR and Document Parsing Tasks with a Transformers Backend huggingface.co

Glaucous-winged Gull, Brown Pelican, Snowy Egret, Canada Goose simonwillison.net

Simon Willison capped a PyCon US trip with a morning walk along the Los Angeles River, photographing a Glaucous-winged Gull, Brown Pelican, Snowy Egret, and Canada Goose. The pelican sighting was the target; goslings near the swan boat lake were the bonus.

References

byteiota — Berkeley RDI ‘BenchJack’ coverage byteiota.com

the agent achieved a 100% resolve rate by simply injecting a 10-line conftest.py file into the repository… force every test to report a ‘passed’ status, regardless of the actual code state

Berkeley RDI blog — Trustworthy Benchmarks rdi.berkeley.edu

non-negotiable need for nested sandboxing and total isolation of the evaluator from the system under test

GitHub — om-ai-lab/open-agent-leaderboard github.com

OmAgent-based leaderboard ranking algorithms (CoT, ReAct, ToT) on GSM8K/MATH-500, released January 2025 under the identical name ‘Open Agent Leaderboard’

Level Up Coding — Princeton HAL analysis levelup.gitconnected.com

a well-designed agent framework (retry logic, error recovery, tool routing) can boost a model’s score by nearly 30 percentage points — often more than a jump between model generations

wispaper.ai — independent reading of the General Agent Evaluation paper wispaper.ai

performance differences between the tested agent architectures were often not statistically significant (p > 0.1)… open-weight backbones like DeepSeek and Kimi often show ‘architecture sinks’ where performance swings from 0.83 to 0.00

awesomeagents.ai — Open Agent Leaderboard recap awesomeagents.ai

adding shortlisting to ReAct increased GPT-5.2 success by 5.5 percentage points and cut Claude Opus 4.5 cost by ~$1.97 per task; Smolagents matched heavyweight harnesses at $3.21 vs $5.97

NVIDIA Developer Blog (DoRA introduction) developer.nvidia.com

DoRA decomposes weights into magnitude and direction, applying the low-rank update only to the directional component… introduces a slight parameter overhead—typically around 0.01% to 6% more than LoRA… approximately 10% to 20% slower to train than LoRA due to additional normalization steps.

Turing — Exploring V-JEPA 2 turing.com

V-JEPA 2 reportedly outperforms pixel-heavy models like Cosmos by up to 30× in planning speed during robotic manipulation tasks… achieves success rates of 65–80% in zero-shot pick-and-place tasks using only 62 hours of real-world robot data.

The Robot Report — Cosmos Policy WFMs therobotreport.com

Cosmos Policy reached success rates of 98.5% on the LIBERO benchmark and 71.1% on RoboCasa, outperforming traditional diffusion policies and vision-language-action models.

Clemson CECAS — Small Models, Big Capabilities blogs.clemson.edu

PEFT methods like LoRA create ‘intruder dimensions’—high-ranking singular vectors dissimilar to pre-trained knowledge—that bludgeon the existing vector space, causing the model to hallucinate even when the base model previously understood the content.

ResearchGate — DreamGen: Neural Trajectories paper researchgate.net

Visually coherent video rollouts often generate kinematically impossible commands when processed by an inverse dynamics model; relying solely on VLM-based visual plausibility is insufficient for high-stakes manipulation.

Hacker News discussion on Cosmos news.ycombinator.com

Critics described generated outputs as ‘fever dreams’ that occasionally teleport objects when the model reaches the bounds of its known distribution, and noted video models lack physical feedback like pressure, touch, and friction.

Towards AI — PaddleOCR-VL 1.5 deep dive pub.towardsai.net

PaddleOCR-VL 1.5 (a 0.9B parameter model) currently leads OmniDocBench v1.5 with a breakthrough score of 94.5%, outperforming significantly larger generalist models like GPT-4o and Qwen2.5-VL.

NetMind — PDF parser comparison blog.netmind.ai

On olmOCR-Bench dots.mocr leads with 83.9%, followed by PaddleOCR-VL (80.0%) and dots.ocr (79.1%), while MinerU 2.5 trails at 75.2% — ‘best’ is context-dependent.

gopubby — ‘I tested 5 OCR models on 6 real-world datasets’ ai.gopubby.com

PaddleOCR-VL 1.5 took 7 minutes on a 15-page academic paper with ‘painful setup’ compared to Marker’s 54 seconds; older lightweight PP-StructureV3 actually outperforms the default 3.5 configurations in speed.

GitHub — PaddleOCR issues github.com

PaddlePaddle 3.2.2 container fails to load PaddleOCR-VL 0.9B safetensors weights, throwing ‘framework paddle is invalid’; users must manually install specific safetensors wheels to bypass the error.

Hugging Face transformers docs — paddleocr_vl huggingface.co

Native class PaddleOCRVLForConditionalGeneration loadable via AutoModelForImageTextToText; element-level prompts (‘OCR:’, ‘Table Recognition:’, ‘Formula Recognition:’, ‘Chart Recognition:’) trigger different tasks, but page-level parsing still requires the native engine or vLLM server.

Hugging Face — PaddleOCR-VL model card / vLLM serving huggingface.co

Served via vLLM, PaddleOCR-VL-1.5 averages ~140 tokens/s generation and 550 tokens/s prefill, cutting per-image latency from 2–5s on the native backend to 0.5–1.0s; the raw Transformers backend is materially slower than vLLM or SGLang.

Jack Sun

Jack Sun, writing.

Engineer · Bay Area

Hands-on with agentic AI all day — building frameworks, reading what industry ships, occasionally writing them down.

Digest
All · AI Tech · AI Research · AI News
Writing
Essays
Elsewhere
Subscribe
All · AI Tech · AI Research · AI News · Essays

© 2026 Wei (Jack) Sun · jacksunwei.me Built on Astro · hosted on Cloudflare