JS Wei (Jack) Sun

Agent harnesses and local leaderboards mature — and quietly hide their tradeoffs

Every URL the pipeline pulled into ranking for this issue — primary sources plus the supporting and contradicting findings each Researcher returned. Inline citations in the issue point back here.

← Back to the issue

Sources

Notion’s Token Town: 5 Rebuilds, 100+ Tools, MCP vs CLIs and the Software Factory Future — Simon Last & Sarah Sachs of Notion latent.space

Notion’s cofounder and head of AI peel back the curtains to talk about finally shipping the Knowledge Work AI agents the world has been waiting for.

(AINews) Top Local Models List - April 2026 latent.space

a quiet day lets us check in on the local models scene

References

Metomic — ‘The NotionAI Security Gap’ metomic.io

Attackers can hide malicious prompts in PDF files to trick agents into exfiltrating ARR or client data to external servers… the ‘lethal trifecta’ — access to private data, exposure to untrusted inputs, and the ability to communicate externally — remains an existential risk.

Martia (Medium) — Progressive Disclosure in AI Agents medium.com

Instead of front-loading 100+ tools at initialization, the system reveals tool details only when needed… Layer 1 (Discovery) sees only lightweight metadata; Layer 2 (Activation) injects full schemas into context.

BiggoFinance — MCP vs CLI debate coverage finance.biggo.com

Connecting a standard GitHub MCP server can dump 55,000 tokens into the context window, a 275x increase over using a simple gh CLI command… CLI agents have shown 33% better token efficiency in complex multi-step debugging workflows.

eesel.ai — Notion AI Security & Privacy Practices eesel.ai

For Enterprise plan users, Notion employs zero-retention APIs… Free, Plus, and Business plans may have prompts and generated content retained by subprocessors for up to 30 days. Notion lacks end-to-end encryption, retaining the technical ability to access user content.

arXiv — Headroom Eval methodology paper arxiv.org

Tasks with a 50% success probability offer the highest discriminative power… centering evaluations around a 30–70% mid-range difficulty filter reduces the number of required evaluation tasks by over 40% while maintaining ranking fidelity. UC Berkeley researchers demonstrated that major agent benchmarks could be hacked.

Medium (codetodeploy) — Notion AI in 2026 medium.com

Over 21,000 agents created within the first weeks of public beta; Remote reported saving 20 hours per week on IT triage with 95% accuracy. Reviews note significant latency degradation when agents operate within databases that have accumulated years of data.

r/LocalLLaMA — ‘Open Sourced LLM Ranking 2026’ thread reddit.com

Qwen3-Coder-Next is considered the undisputed king of coding models in early 2026… MiniMax M2.7 has gained a reputation as the ‘accessible Sonnet at home’ for agentic and tool-heavy workflows.

The New Stack — ‘China Leads Open AI Models’ thenewstack.io

Chinese-developed models accounted for roughly 41% of global Hugging Face downloads in the year ending February 2026, surpassing the U.S. share of 36.5%.

a16z — ‘Asserting American Leadership in Open Source AI’ a16z.com

The White House and the House Committee on the CCP launched investigations into ‘model distillation,’ alleging Chinese labs are training lower-cost versions without equivalent safety guardrails.

Ismat Samadov — ‘Why I Stopped Trusting LLM Benchmarks’ ismatsamadov.com

MMLU-CF shows that model performance can drop by as much as 14–16 points when test questions are slightly rephrased to bypass training data leakage… small wording changes shuffle model rankings significantly.

DecodesFuture — ‘Best GPU for Local LLMs 2026’ decodesfuture.com

The 5090’s 32GB limit remains a bottleneck for 70B+ models… a Mac with 128GB unified memory can run a 70B model entirely in memory at 15–20 tok/s, the only consumer-accessible desktop capable of running 70B+ frontier models without a performance cliff.

Jack Sun

Jack Sun, writing.

Engineer · Bay Area

Hands-on with agentic AI all day — building frameworks, reading what industry ships, occasionally writing them down.

Digest
All · AI Tech · AI Research · AI News
Writing
Essays
Elsewhere
Subscribe
All · AI Tech · AI Research · AI News · Essays

© 2026 Wei (Jack) Sun · jacksunwei.me Built on Astro · hosted on Cloudflare