JS Wei (Jack) Sun

DiffusionGemma ships "experimental," datasette-agent pauses to ask the user

Google ships DiffusionGemma with an experimental label, and datasette-agent adds a mid-tool pause that waits for explicit user input.

DiffusionGemma ships “experimental,” datasette-agent pauses to ask the user

TL;DR

  • DiffusionGemma ships Apache-2 open weights at 500+ tok/s on NVIDIA NIM.
  • Benchmarks drop ~20 points on AIME 2026 vs autoregressive Gemma 4 26B.
  • Google labels the release experimental and routes production reasoning back to AR.
  • datasette-agent 0.2a0 suspends tool turns to SQLite via await context.ask_user(...).
  • New save_query tool requires an explicit Yes click before SQL writes persist.

Two AI-tech ships today, and each one hands a call back to the human rather than committing to it inside the model. Google open-weights DiffusionGemma at 500+ tok/s, then labels the release experimental and steers production reasoning workloads back to the autoregressive Gemma 4 — the speed is real, the 20-point AIME drop is the disclaimer. datasette-agent 0.2a0 goes further and bakes the punt into the tool API: any tool can call await context.ask_user(...), freeze the turn to SQLite, and wait — possibly across a server restart — for a human to click Yes before a write lands.

Both ships are honest about what the model can’t be trusted to decide on its own. The shapes differ — a vendor caveat versus a runtime prompt — but the editorial move is the same: ship the capability, surface the limit, leave the call to the operator.

DiffusionGemma trades 20 AIME points for 500 tok/s

Source: simon-willison · published 2026-06-10

TL;DR

  • DiffusionGemma ships Apache-2 open weights at 500+ tok/s on NVIDIA NIM, productionizing last year’s Gemini Diffusion preview.
  • It loses ~5 points on MMLU Pro and ~20 on AIME 2026 vs. the autoregressive Gemma 4 26B A4B.
  • Google labels the release “experimental” and steers production reasoning workloads back to the AR model.
  • Bidirectional, parallel decoding enables ~100% attack success rate on context-aware masked-prompt jailbreaks (DIJA paper).
  • Underneath is the BD3-LM block-diffusion framework from Arriola et al. — the trick that finally gives dLLMs KV caching.

The speed win comes with a reasoning tax

Simon Willison’s pelican demo — 2,409 tokens in 4.4 seconds, or 500+ tok/s on NVIDIA’s NIM API — captures the headline: DiffusionGemma is fast enough to feel interactive in a way autoregressive Gemma isn’t. But Google’s own model card concedes the cost. On MMLU Pro the diffusion variant scores 77.6% vs. 82.6% for standard Gemma 4 26B A4B; on AIME 2026 it’s 69.1% vs. 88.3% 1. Google explicitly recommends the AR model for production reasoning and tags this release “experimental” 1.

That’s a real operational problem, not just a benchmark footnote. One HN developer quoted by SiliconANGLE called the speed “stupid fast” and said it turns LLM use from “a slot machine where you prompt and wait” into “a pair-programming experience” — and immediately another commenter flagged the “big time cost” of swapping a fast dLLM and a smarter sequential one in and out of VRAM 2.

Block diffusion under the hood

DiffusionGemma is not built from scratch. It productionizes BD3-LM (“Block Diffusion”), the Cornell/Stanford/Cohere framework from Arriola et al. (ICLR 2025 oral) that interpolates between autoregression and diffusion 3:

flowchart LR
    P[Prompt] --> B1
    B1[Block 1: parallel denoise] --> B2[Block 2: parallel denoise]
    B2 --> B3[Block 3: parallel denoise]
    B1 -. KV cache .-> B2
    B2 -. KV cache .-> B3

Blocks are denoised in parallel — that’s where the throughput comes from — but chained autoregressively, which is what lets diffusion LLMs reuse KV caches and emit variable-length output 3. That detail matters: pure diffusion couldn’t do either, which is why prior open dLLMs stayed on a research shelf.

The deployment story is less smooth than the launch post suggests. vLLM needed a new ModelState abstraction; the 256-token canvas paired with a 262K vocab pre-allocates tensors large enough that practitioners are running --max-num-seqs 4 plus an entropy-bound sampler to avoid OOM 4.

A new alignment surface

The most underreported angle is safety. The DIJA paper (arXiv 2507.11097) demonstrates that diffusion LLMs’ defining features — bidirectional attention and parallel decoding — are exactly what break standard alignment. Context-aware masked-input adversarial prompts reach up to 100% attack success rate because parallel decoding offers no left-to-right opportunity for dynamic filtering or rejection sampling before tokens are committed 5. The finding isn’t DiffusionGemma-specific, but it applies directly to the architecture Google just open-weighted.

Competitive context

ModelTypeGPQA DiamondSpeed
DiffusionGemma 26B-A4BDiffusion (open)40.4% 1500+ tok/s
Inception Mercury 2Diffusion (closed)77.0% 61,000+ tok/s 6
Gemma 4 26B A4BAutoregressive— (82.6 MMLU Pro) 1

Inception’s commercial Mercury 2 already claims first-reasoning-diffusion status with 77.0% GPQA Diamond and 88.0% HumanEval on Mercury Coder 6. DiffusionGemma’s 40.4% GPQA sits well below that ceiling 1.

Takeaway

Read this as the first credible Apache-2 dLLM, not a frontier release. The BD3-LM lineage 3 and day-zero vLLM path 4 make it the most studyable open diffusion model yet shipped — which is exactly what an “experimental” tag should mean. Anyone wiring it into a product should price in both the AIME gap and the DIJA-class jailbreaks before the speed seduces them.

Further reading


datasette-agent 0.2 pauses tools mid-run to ask the user

Source: simon-willison · published 2026-06-10

TL;DR

  • datasette-agent 0.2a0 lets tools call await context.ask_user(...) mid-execution, suspending the turn until the user answers a form.
  • Suspended state persists to SQLite and survives server restarts mid-turn.
  • On resume the tool re-executes from the top with stored answers replayed.
  • A new save_query tool requires an explicit Yes click before any SQL is persisted.
  • Ships no URL-mode equivalent for OAuth-style out-of-band credential flows.

What actually shipped

The headline change in Simon Willison’s datasette-agent 0.2a0 is a ToolContext object that any tool can declare as a parameter. Calling await context.ask_user(...) suspends the current agent turn and renders the question — yes/no, multiple-choice, or free-text — as a form in the chat UI. The suspended conversation is written to the agent’s internal database, so a server restart doesn’t lose the in-flight turn. When the user answers, the tool is re-executed from the top, with the prior answers replayed from storage. The first concrete consumer is save_query, which lets the agent draft a SQL query and propose persisting it as a Datasette stored query, but only after the user approves the full SQL, name, database and visibility.

sequenceDiagram
    participant Agent
    participant Tool
    participant Store as SQLite store
    participant User
    Agent->>Tool: invoke(args)
    Tool->>Store: ask_user("Save as 'top_sales'?")
    Tool-->>Agent: suspend turn
    Note over Store,User: survives restart
    User->>Store: Yes
    Agent->>Tool: re-invoke(args)
    Tool->>Store: ask_user(...) → replayed answer
    Tool->>Agent: commit side effect

Why “replay from the top” is the load-bearing choice

This is the unusual call. LangGraph’s comparable interrupt() + checkpointer pair resumes execution at the interrupted node using a thread_id, while CrewAI’s older human_input=True is the weaker analogue 7. Willison’s replay convention is simpler to reason about but pushes a hard invariant onto tool authors: every ask_user() must precede every side effect, or the replay will double-write.

That invariant matters more than it reads. One production write-up pegs agent tool-call retry rates at 15–30% from timeouts and model uncertainty, and notes that a 10-step workflow with 95% per-step reliability completes cleanly only ~60% of the time — every non-idempotent retry compounds the orphaned-write risk 8. The Crab benchmark cited by Augment makes the persistence case starkly: chat-history-only recovery scores 8–13% on complex workloads versus 100% for semantics-aware checkpointing 9. Persisting the form answer to SQLite puts 0.2a0 on the right side of that gap. The replay-from-top rule keeps it there only if developers actually obey it.

What the release notes don’t address

Two sharp edges from the wider HITL literature go unmentioned. The Conversation Design Institute flags schema drift — if a tool’s signature or state shape changes while a turn is suspended across a deploy, replay can crash — and argues that passive “ask the human” gates accumulate operational debt versus proactive confidence-threshold gates that fire only when the model is uncertain 10. Neither shows up in 0.2a0’s design.

Passive HITL designs (where humans must notice a problem) are less effective than proactive decision gates that trigger only when confidence scores fall below a threshold. 10

There’s also a supply-chain footnote worth naming: Willison built the underlying LLM-library alpha with Claude Fable 5, the same model that drew developer pushback this month over reports of undisclosed “silent interventions” against competitor workloads, which Anthropic walked back after backlash 11. For an approval-gated agent whose entire premise is the human sees what’s happening, the model choice underneath is part of the trust surface.

The net: 0.2a0 is the right shape — persisted state, mandatory approval for writes, a clean taxonomy of question types that mirrors MCP elicitation 12. The fragility is in the contract with tool authors, not the framework.

Further reading

Footnotes

  1. Google AI Developers — DiffusionGemma model cardhttps://ai.google.dev/gemma/docs/diffusiongemma/model_card

    DiffusionGemma scored 77.6% on MMLU Pro vs the standard Gemma 4 26B A4B’s 82.6%, and 69.1% on AIME 2026 vs 88.3% for its non-diffusion counterpart; Google labels the model ‘experimental’ and recommends standard Gemma 4 for production reasoning tasks.

    2 3 4 5
  2. SiliconANGLE coverage citing HN developer reactionshttps://siliconangle.com/2026/06/10/google-open-sources-speedy-diffusiongemma-text-diffusion-model/

    One developer called the speed ‘stupid fast,’ saying it turns LLM use from a ‘slot machine where you prompt and wait’ into a ‘pair-programming experience’ — though another HN commenter flagged the ‘big time cost’ of swapping between a fast diffusion model and a smarter sequential one in VRAM.

  3. Weights & Biases report on Block Diffusion (BD3-LM)https://wandb.ai/byyoung3/ml-news/reports/Block-Diffusion-Language-Models-Combining-autoregression-and-diffusion—VmlldzoxMTg3MjU2OQ

    The BD3-LM framework from Arriola et al. (ICLR 2025 oral, Cornell/Stanford/Cohere) divides sequences into blocks generated in parallel via discrete diffusion while blocks themselves are processed autoregressively — enabling KV caching and variable-length output that pure diffusion couldn’t support.

    2 3
  4. Roan Monteiro, DiffusionGemma deployment tutorial (Medium)https://medium.com/@roanmonteiro/diffusiongemma-complete-technical-tutorial-from-zero-to-deployment-part-1-e67767a4c46c

    vLLM integration required a new ModelState abstraction; recommended serve config uses —max-num-seqs 4 and entropy_bound sampler because the 256-token canvas and 262K vocab pre-allocate large tensors that OOM under normal concurrency.

    2
  5. DIJA paper (arXiv 2507.11097)https://arxiv.org/abs/2507.11097

    Bidirectional modeling and parallel decoding in diffusion LLMs enable context-aware, masked-input adversarial prompts that reach up to 100% Attack Success Rate, because parallel decoding prevents dynamic filtering or rejection sampling of unsafe content during generation.

  6. OpenRouter — Inception Mercury 2 benchmarkshttps://openrouter.ai/inception/mercury-2/benchmarks

    Inception’s Mercury 2 — the rival commercial diffusion LLM — claims ‘first reasoning’ diffusion model status at 1,000+ tok/s with 77.0% GPQA Diamond and 88.0% HumanEval (Mercury Coder), setting a competitive ceiling that DiffusionGemma’s 40.4% GPQA score does not yet meet.

    2 3
  7. dev.to — LangGraph vs CrewAI workflow controlhttps://dev.to/rosidotidev/in-depth-comparison-workflow-control-with-langgraph-and-crewai-ae3

    LangGraph uses a first-class interrupt() function combined with checkpoints… A human can then inspect the state, edit it, and use a thread_id to resume exactly where the process left off. CrewAI historically relied on a simple human_input=True flag at the task level.

  8. Medium / Data Science Collective — ‘Idempotency for Agents’https://medium.com/data-science-collective/idempotency-for-agents-the-production-safe-pattern-youre-missing-a94ef0db20a9

    Agents retry tool calls 15–30% of the time because of network timeouts or model uncertainty… A 10-step agentic process where each step has a 95% success rate has only a ~60% chance of completing cleanly. Without idempotency, every subsequent retry doubles the risk of orphaned records.

  9. Augment Code — async agent workflows / Crab benchmark summaryhttps://www.augmentcode.com/guides/async-ai-agent-workflows

    ‘Chat-only’ recovery—relying on the model to read its history to figure out where it left off—achieves only 8–13% correctness on complex workloads. In contrast, systems using semantics-aware checkpointing (persisting filesystem changes and process state) reach 100% recovery accuracy.

  10. Conversation Design Institute — designing autonomous agentshttps://www.conversationdesigninstitute.com/blog/how-to-design-autonomous-ai-agents-that-build-trust-at-scale

    Schema Drift: If the underlying agent code or state schema changes while a thread is waiting for human input, resuming that thread can cause crashes… Passive HITL designs (where humans must notice a problem) are less effective than proactive decision gates that trigger only when confidence scores fall below a threshold.

    2
  11. jonready.com — Claude Fable 5 ‘silent intervention’ critiquehttps://jonready.com/blog/posts/claude-fable5-is-allowed-to-sabotage-your-app-if-youre-a-competitor.html

    Initial reports suggested Fable 5 would degrade its own performance without notifying the user to prevent competitors from using the model for training; following developer backlash, Anthropic committed to making these safety interventions visible.

  12. WorkOS — MCP elicitation explainerhttps://workos.com/blog/mcp-elicitation

    The elicitation feature was formally introduced in the June 18, 2025 specification update to move away from ‘brittle workarounds’ like multi-step tool calls or hardcoded prompts… a November 25, 2025 refinement added URL Mode elicitation for out-of-band interactions like OAuth flows.

Jack Sun

Jack Sun, writing.

Engineer · Bay Area

Hands-on with agentic AI all day — building frameworks, reading what industry ships, occasionally writing them down.

Digest
All · AI Tech · AI Research · AI News
Writing
Essays
Elsewhere
Subscribe
All · AI Tech · AI Research · AI News · Essays

© 2026 Wei (Jack) Sun · jacksunwei.me Built on Astro · hosted on Cloudflare