JS Wei (Jack) Sun

Agent harnesses and local leaderboards mature — and quietly hide their tradeoffs

Notion's agent rebuild and the new Chinese-led local-models leaderboard both show real engineering progress wrapped in framings that elide the catches.

Agent harnesses and local leaderboards mature — and quietly hide their tradeoffs

TL;DR

  • Notion rebuilt its agent stack 4–5 times; Agent Harness 2.0 filters 100+ tools per task via progressive disclosure.
  • The same interview soft-pedals a lethal-trifecta prompt-injection risk and 30-day prompt retention on non-Enterprise tiers.
  • Latent Space’s April local-models picks — Qwen, GLM, DeepSeek, MiniMax — match r/LocalLLaMA consensus and are all Chinese-origin.
  • Chinese models now drive ~41% of Hugging Face downloads vs 36.5% U.S., turning the leaderboard into a policy artifact.
  • Underlying benchmarks lose 14–16 points on rephrased questions, and frontier local picks still need a 128GB Mac Studio.

Today’s tech coverage is about what the practitioner-facing story leaves out. Notion’s engineering interview is a genuinely substantive look at building an agent harness — five rebuilds, tool-filtering via progressive disclosure, evals tuned to Item Response Theory headroom. It’s also an interview that walks past a textbook lethal-trifecta prompt-injection setup and a 30-day prompt retention default on non-Enterprise plans.

The local-models picks tell a parallel story. Latent Space’s April leaderboard converges with the r/LocalLLaMA consensus on Qwen, GLM, DeepSeek, and MiniMax — a list that is now overwhelmingly Chinese, mirroring a Hugging Face download share that has flipped past U.S. models and is starting to attract policy attention. The benchmarks holding the rankings up, meanwhile, drop 14–16 points the moment questions are rephrased, and “running it locally” still means a 128GB Mac Studio.

Both pieces are useful precisely because they take the surface claim seriously enough to show where it cracks.

Notion’s agent stack is real engineering — and a soft-pedaled threat model

Source: latent-space · published 2026-04-15

TL;DR

  • Notion rebuilt its agent system 4–5 times; “Agent Harness 2.0” filters 100+ tools per task via progressive disclosure.
  • Simon Last’s pro-CLI stance has hard backing: a GitHub MCP server can dump ~55k tokens of schema vs near-zero for gh.
  • The interview soft-pedals a “lethal trifecta” prompt-injection risk and 30-day prompt retention on non-Enterprise tiers.
  • Headroom evals targeting ~30% pass rates match Item Response Theory — but agent benchmarks are known to be gameable.

The architecture is real. The threat model is missing.

Simon Last and Sarah Sachs walk through four years of agent rebuilds — JavaScript-API coding agents that couldn’t write code, an XML block representation models couldn’t navigate, finally settling on Notion-flavored Markdown plus a SQL-Lite abstraction the models actually grok. That’s a credible engineering arc. What the interview elides is what happens when you point such an agent at a workspace full of untrusted PDFs.

Metomic’s writeup of the “lethal trifecta” — private data access, untrusted input ingestion, and outbound communication — describes proof-of-concept exploits where a malicious PDF dropped into a workspace coaxes a Notion agent into exfiltrating client data through its own web-search tool 1. eesel adds a governance asterisk Sachs doesn’t mention: zero-retention APIs are Enterprise-only, while Plus and Business prompts may sit on subprocessor infrastructure for up to 30 days, and Notion retains the technical ability to read user content 2.

flowchart LR
    A[Private workspace data] --> B{Custom Agent}
    C[Untrusted PDF / web content] --> B
    B --> D[Web search + 100+ tools]
    D -. exfiltration .-> E((External endpoint))

For a product pitched as the “system of record” for knowledge work, the absence of E2EE is a meaningful caveat on the enterprise narrative.

Progressive disclosure, and why Last is right about CLIs

The “100+ tools, filtered per task” pattern Last describes has a name in the broader literature: progressive disclosure, formalized as a two-layer scheme where Discovery sees only lightweight metadata and Activation injects full schemas on demand 3. It’s converging across Anthropic’s Agent Skills and the MCP ecosystem.

Last’s CLI bullishness lands harder when you check the receipts. Connecting a stock GitHub MCP server can inject roughly 55,000 tokens of schema before the agent does anything — a reported 275× overhead versus a gh CLI invocation, with CLI agents posting 33% better token efficiency on multi-step debugging 4. “Self-debugging,” in other words, is also self-economizing. That reframes Notion’s credit-based pricing: it’s not just a UX abstraction over volatile token costs, it’s a forcing function to keep agentic workflows from turning every database autofill into a margin-eating disaster.

Evals built for models that don’t exist yet

Notion’s “design tests so frontier models pass ~30%” rule isn’t folklore. An arXiv methodology paper grounds it in Item Response Theory: items in the 30–70% pass band carry the highest discriminative signal and cut required eval volume by ~40% while preserving ranking fidelity 5. The same paper flags a failure mode the interview skips — UC Berkeley work showed major agent benchmarks can be hacked by gaming metrics without solving the underlying tasks 5. That’s the unspoken brief for Notion’s Model Behavior Engineers: keep the evals checkable, not just LLM-judged.

Adoption is real. The ceiling is too.

The launch traction is genuine — 21,000+ agents created in the first weeks of public beta, Remote claiming 20 hours/week saved on IT triage at 95% accuracy 6. But the same reviewers flag a ceiling Last and Sachs don’t acknowledge: agents degrade noticeably in workspaces with messy schemas or databases holding years of accumulated rows 6. The buyers most likely to want a “system of record” agent are exactly the ones with the gnarliest historical data. Notion has shipped the framework. Whether it survives contact with a 5-year-old enterprise workspace — and a determined attacker with a PDF — is the next chapter.


The local-models leaderboard is now a Chinese leaderboard

Source: latent-space · published 2026-04-14

TL;DR

  • Latent Space’s April 2026 local-models picks line up with r/LocalLLaMA consensus: Qwen, GLM, DeepSeek, MiniMax on top.
  • Chinese-origin models now drive ~41% of Hugging Face downloads vs 36.5% U.S. — and have become a policy target.
  • The benchmarks underwriting any “top” list drop 14–16 points when questions are merely rephrased.
  • Running the frontier picks locally still requires a 128GB Mac Studio or a multi-GPU rig.

The picks are right, and they’re almost all Chinese

Latent Space’s quiet-day check-in on the local scene maps cleanly onto what r/LocalLLaMA’s 2026 ranking thread crowd-sourced the same week: Qwen 3.5/3.6 as the general workhorse, Qwen3-Coder-Next as the “undisputed king” of local coding, GLM-5.1 and DeepSeek V3.2 at the frontier tier, and MiniMax M2.7 as the “accessible Sonnet at home” for agentic, tool-heavy workflows 7. Community deployment signal and the newsletter’s shortlist are converging — which is the strongest validation a vibes-driven list can get.

The community adds two corrections worth keeping. For creative writing, small Mistral Nemo fine-tunes like Rocinante X 12B still beat Gemma 27B on natural, “un-GPT-like” prose, and Gemma 4 is reportedly weak at hard debugging compared to the Qwen coder line 8. So the list is directionally right but coding- and chat-skewed.

What’s striking is the geography. All five headline recommendations come from Chinese labs. Hugging Face download data shows Chinese-origin models hit roughly 41% of global downloads in the year ending February 2026, surpassing the U.S. share of 36.5% 9. “Top local models” is now, by default, a survey of Chinese open-weights releases.

The provenance fight the newsletter skips

That dominance has become a Washington problem. The White House and the House Select Committee on the CCP have opened investigations into “model distillation,” alleging Chinese labs are shipping cheaper variants without equivalent safety guardrails, and a16z and other US-leaning voices are pushing for an explicit American open-source counter-program 10. A reader downloading GLM-5.1 today is wading into an active licensing and provenance debate the recommendation doesn’t flag.

Benchmarks broken, hardware bifurcated

The deeper problem is that the scaffolding under any “top” list is rotting. MMLU-CF, the contamination-controlled rebuild, shows model scores dropping 14–16 points when questions are slightly rephrased, and small lexical perturbations reshuffle local-model rankings outright 11. The honest move — which Latent Space arguably makes by leaning on community deployment over leaderboard position — is to admit the rankings are practitioner consensus, not measurement.

Hardware is the other quiet caveat. The RTX 5090 caps at 32GB, which forces 70B+ models into aggressive quantization or a RAM-offload cliff at 1–8 tok/s. Only Apple Silicon with 128GB+ unified memory keeps a 70B model fully resident at a usable 15–20 tok/s 12. In practice:

SetupCeilingFrontier-class (70B+) viable?
RTX 5090 (32GB)~32B dense, quantizedNo — offload cliff
Multi-GPU rig70B+Yes, at $$$$ and watts
Mac Studio 128GB+70B at 15–20 tok/sYes, only consumer desk option

So “just run GLM-5.1 locally” is really “buy a $5K+ Mac Studio or build a multi-GPU box.”

Takeaway

The list is useful — it matches what practitioners actually deploy. But the interesting story underneath it is that the open-weights frontier is now Chinese, the benchmarks ratifying that frontier are unreliable, and the hardware to run it is a small minority of desks. Treat the rankings as a deployment snapshot, not a measurement.

Footnotes

  1. Metomic — ‘The NotionAI Security Gap’https://www.metomic.io/resource-centre/the-notionai-security-gap-how-to-prevent-data-exposure-before-processing-begins

    Attackers can hide malicious prompts in PDF files to trick agents into exfiltrating ARR or client data to external servers… the ‘lethal trifecta’ — access to private data, exposure to untrusted inputs, and the ability to communicate externally — remains an existential risk.

  2. eesel.ai — Notion AI Security & Privacy Practiceshttps://www.eesel.ai/blog/notion-ai-security-privacy-practices

    For Enterprise plan users, Notion employs zero-retention APIs… Free, Plus, and Business plans may have prompts and generated content retained by subprocessors for up to 30 days. Notion lacks end-to-end encryption, retaining the technical ability to access user content.

  3. Martia (Medium) — Progressive Disclosure in AI Agentshttps://medium.com/@martia_es/progressive-disclosure-the-technique-that-helps-control-context-and-tokens-in-ai-agents-8d6108b09289

    Instead of front-loading 100+ tools at initialization, the system reveals tool details only when needed… Layer 1 (Discovery) sees only lightweight metadata; Layer 2 (Activation) injects full schemas into context.

  4. BiggoFinance — MCP vs CLI debate coveragehttps://finance.biggo.com/news/3782bf347ef2a2ee

    Connecting a standard GitHub MCP server can dump 55,000 tokens into the context window, a 275x increase over using a simple gh CLI command… CLI agents have shown 33% better token efficiency in complex multi-step debugging workflows.

  5. arXiv — Headroom Eval methodology paperhttps://arxiv.org/html/2603.23749v1

    Tasks with a 50% success probability offer the highest discriminative power… centering evaluations around a 30–70% mid-range difficulty filter reduces the number of required evaluation tasks by over 40% while maintaining ranking fidelity. UC Berkeley researchers demonstrated that major agent benchmarks could be hacked.

    2
  6. Medium (codetodeploy) — Notion AI in 2026https://medium.com/codetodeploy/how-notion-ai-is-changing-the-way-we-work-in-2026-729cc6ad79d3

    Over 21,000 agents created within the first weeks of public beta; Remote reported saving 20 hours per week on IT triage with 95% accuracy. Reviews note significant latency degradation when agents operate within databases that have accumulated years of data.

    2
  7. r/LocalLLaMA — ‘Open Sourced LLM Ranking 2026’ threadhttps://www.reddit.com/r/LocalLLaMA/comments/1rqpmea/open_sourced_llm_ranking_2026/

    Qwen3-Coder-Next is considered the undisputed king of coding models in early 2026… MiniMax M2.7 has gained a reputation as the ‘accessible Sonnet at home’ for agentic and tool-heavy workflows.

  8. r/LocalLLaMA discussion on creative-writing model pickshttps://www.reddit.com/r/LocalLLaMA/comments/1rqpmea/open_sourced_llm_ranking_2026/

    Rocinante X 12B (a Mistral Nemo fine-tune) is frequently cited for producing more natural, ‘un-GPT-like’ prose than much larger models like Gemma 27B… developers report lackluster logic in Gemma 4 for complex debugging compared to Qwen coder variants.

  9. The New Stack — ‘China Leads Open AI Models’https://thenewstack.io/china-leads-open-ai-models/

    Chinese-developed models accounted for roughly 41% of global Hugging Face downloads in the year ending February 2026, surpassing the U.S. share of 36.5%.

  10. a16z — ‘Asserting American Leadership in Open Source AI’https://a16z.com/asserting-american-leadership-in-open-source-ai/

    The White House and the House Committee on the CCP launched investigations into ‘model distillation,’ alleging Chinese labs are training lower-cost versions without equivalent safety guardrails.

  11. Ismat Samadov — ‘Why I Stopped Trusting LLM Benchmarks’https://www.ismatsamadov.com/blog/why-i-stopped-trusting-llm-benchmarks

    MMLU-CF shows that model performance can drop by as much as 14–16 points when test questions are slightly rephrased to bypass training data leakage… small wording changes shuffle model rankings significantly.

  12. DecodesFuture — ‘Best GPU for Local LLMs 2026’https://www.decodesfuture.com/articles/best-gpu-for-local-llms-2026-guide

    The 5090’s 32GB limit remains a bottleneck for 70B+ models… a Mac with 128GB unified memory can run a 70B model entirely in memory at 15–20 tok/s, the only consumer-accessible desktop capable of running 70B+ frontier models without a performance cliff.

Jack Sun

Jack Sun, writing.

Engineer · Bay Area

Hands-on with agentic AI all day — building frameworks, reading what industry ships, occasionally writing them down.

Digest
All · AI Tech · AI Research · AI News
Writing
Essays
Elsewhere
Subscribe
All · AI Tech · AI Research · AI News · Essays

© 2026 Wei (Jack) Sun · jacksunwei.me Built on Astro · hosted on Cloudflare