Agent harnesses and local leaderboards mature — and quietly hide their tradeoffs
Every URL the pipeline pulled into ranking for this issue — primary sources plus the supporting and contradicting findings each Researcher returned. Inline citations in the issue point back here.
Sources
Notion’s cofounder and head of AI peel back the curtains to talk about finally shipping the Knowledge Work AI agents the world has been waiting for.
(AINews) Top Local Models List - April 2026 latent.space
a quiet day lets us check in on the local models scene
References
Metomic — ‘The NotionAI Security Gap’ metomic.io
Attackers can hide malicious prompts in PDF files to trick agents into exfiltrating ARR or client data to external servers… the ‘lethal trifecta’ — access to private data, exposure to untrusted inputs, and the ability to communicate externally — remains an existential risk.
Martia (Medium) — Progressive Disclosure in AI Agents medium.com
Instead of front-loading 100+ tools at initialization, the system reveals tool details only when needed… Layer 1 (Discovery) sees only lightweight metadata; Layer 2 (Activation) injects full schemas into context.
BiggoFinance — MCP vs CLI debate coverage finance.biggo.com
Connecting a standard GitHub MCP server can dump 55,000 tokens into the context window, a 275x increase over using a simple
ghCLI command… CLI agents have shown 33% better token efficiency in complex multi-step debugging workflows.
eesel.ai — Notion AI Security & Privacy Practices eesel.ai
For Enterprise plan users, Notion employs zero-retention APIs… Free, Plus, and Business plans may have prompts and generated content retained by subprocessors for up to 30 days. Notion lacks end-to-end encryption, retaining the technical ability to access user content.
arXiv — Headroom Eval methodology paper arxiv.org
Tasks with a 50% success probability offer the highest discriminative power… centering evaluations around a 30–70% mid-range difficulty filter reduces the number of required evaluation tasks by over 40% while maintaining ranking fidelity. UC Berkeley researchers demonstrated that major agent benchmarks could be hacked.
Medium (codetodeploy) — Notion AI in 2026 medium.com
Over 21,000 agents created within the first weeks of public beta; Remote reported saving 20 hours per week on IT triage with 95% accuracy. Reviews note significant latency degradation when agents operate within databases that have accumulated years of data.
r/LocalLLaMA — ‘Open Sourced LLM Ranking 2026’ thread reddit.com
Qwen3-Coder-Next is considered the undisputed king of coding models in early 2026… MiniMax M2.7 has gained a reputation as the ‘accessible Sonnet at home’ for agentic and tool-heavy workflows.
The New Stack — ‘China Leads Open AI Models’ thenewstack.io
Chinese-developed models accounted for roughly 41% of global Hugging Face downloads in the year ending February 2026, surpassing the U.S. share of 36.5%.
a16z — ‘Asserting American Leadership in Open Source AI’ a16z.com
The White House and the House Committee on the CCP launched investigations into ‘model distillation,’ alleging Chinese labs are training lower-cost versions without equivalent safety guardrails.
Ismat Samadov — ‘Why I Stopped Trusting LLM Benchmarks’ ismatsamadov.com
MMLU-CF shows that model performance can drop by as much as 14–16 points when test questions are slightly rephrased to bypass training data leakage… small wording changes shuffle model rankings significantly.
DecodesFuture — ‘Best GPU for Local LLMs 2026’ decodesfuture.com
The 5090’s 32GB limit remains a bottleneck for 70B+ models… a Mac with 128GB unified memory can run a 70B model entirely in memory at 15–20 tok/s, the only consumer-accessible desktop capable of running 70B+ frontier models without a performance cliff.