DiffusionGemma ships “experimental,” datasette-agent pauses to ask the user

TL;DR

DiffusionGemma ships Apache-2 open weights at 500+ tok/s on NVIDIA NIM.
Benchmarks drop ~20 points on AIME 2026 vs autoregressive Gemma 4 26B.
Google labels the release experimental and routes production reasoning back to AR.
datasette-agent 0.2a0 suspends tool turns to SQLite via await context.ask_user(...).
New save_query tool requires an explicit Yes click before SQL writes persist.

Two AI-tech ships today, and each one hands a call back to the human rather than committing to it inside the model. Google open-weights DiffusionGemma at 500+ tok/s, then labels the release experimental and steers production reasoning workloads back to the autoregressive Gemma 4 — the speed is real, the 20-point AIME drop is the disclaimer. datasette-agent 0.2a0 goes further and bakes the punt into the tool API: any tool can call await context.ask_user(...), freeze the turn to SQLite, and wait — possibly across a server restart — for a human to click Yes before a write lands.

Both ships are honest about what the model can’t be trusted to decide on its own. The shapes differ — a vendor caveat versus a runtime prompt — but the editorial move is the same: ship the capability, surface the limit, leave the call to the operator.

DiffusionGemma trades 20 AIME points for 500 tok/s

Source: simon-willison · published 2026-06-10

TL;DR

DiffusionGemma ships Apache-2 open weights at 500+ tok/s on NVIDIA NIM, productionizing last year’s Gemini Diffusion preview.
It loses ~5 points on MMLU Pro and ~20 on AIME 2026 vs. the autoregressive Gemma 4 26B A4B.
Google labels the release “experimental” and steers production reasoning workloads back to the AR model.
Bidirectional, parallel decoding enables ~100% attack success rate on context-aware masked-prompt jailbreaks (DIJA paper).
Underneath is the BD3-LM block-diffusion framework from Arriola et al. — the trick that finally gives dLLMs KV caching.

The speed win comes with a reasoning tax

Simon Willison’s pelican demo — 2,409 tokens in 4.4 seconds, or 500+ tok/s on NVIDIA’s NIM API — captures the headline: DiffusionGemma is fast enough to feel interactive in a way autoregressive Gemma isn’t. But Google’s own model card concedes the cost. On MMLU Pro the diffusion variant scores 77.6% vs. 82.6% for standard Gemma 4 26B A4B; on AIME 2026 it’s 69.1% vs. 88.3% ¹. Google explicitly recommends the AR model for production reasoning and tags this release “experimental” ¹.

That’s a real operational problem, not just a benchmark footnote. One HN developer quoted by SiliconANGLE called the speed “stupid fast” and said it turns LLM use from “a slot machine where you prompt and wait” into “a pair-programming experience” — and immediately another commenter flagged the “big time cost” of swapping a fast dLLM and a smarter sequential one in and out of VRAM ².

Block diffusion under the hood

DiffusionGemma is not built from scratch. It productionizes BD3-LM (“Block Diffusion”), the Cornell/Stanford/Cohere framework from Arriola et al. (ICLR 2025 oral) that interpolates between autoregression and diffusion ³:

flowchart LR
    P[Prompt] --> B1
    B1[Block 1: parallel denoise] --> B2[Block 2: parallel denoise]
    B2 --> B3[Block 3: parallel denoise]
    B1 -. KV cache .-> B2
    B2 -. KV cache .-> B3

Blocks are denoised in parallel — that’s where the throughput comes from — but chained autoregressively, which is what lets diffusion LLMs reuse KV caches and emit variable-length output ³. That detail matters: pure diffusion couldn’t do either, which is why prior open dLLMs stayed on a research shelf.

The deployment story is less smooth than the launch post suggests. vLLM needed a new ModelState abstraction; the 256-token canvas paired with a 262K vocab pre-allocates tensors large enough that practitioners are running --max-num-seqs 4 plus an entropy-bound sampler to avoid OOM ⁴.

A new alignment surface

The most underreported angle is safety. The DIJA paper (arXiv 2507.11097) demonstrates that diffusion LLMs’ defining features — bidirectional attention and parallel decoding — are exactly what break standard alignment. Context-aware masked-input adversarial prompts reach up to 100% attack success rate because parallel decoding offers no left-to-right opportunity for dynamic filtering or rejection sampling before tokens are committed ⁵. The finding isn’t DiffusionGemma-specific, but it applies directly to the architecture Google just open-weighted.

Competitive context

Model	Type	GPQA Diamond	Speed
DiffusionGemma 26B-A4B	Diffusion (open)	40.4% ¹	500+ tok/s
Inception Mercury 2	Diffusion (closed)	77.0% ⁶	1,000+ tok/s ⁶
Gemma 4 26B A4B	Autoregressive	— (82.6 MMLU Pro) ¹	—

Inception’s commercial Mercury 2 already claims first-reasoning-diffusion status with 77.0% GPQA Diamond and 88.0% HumanEval on Mercury Coder ⁶. DiffusionGemma’s 40.4% GPQA sits well below that ceiling ¹.

Takeaway

Read this as the first credible Apache-2 dLLM, not a frontier release. The BD3-LM lineage ³ and day-zero vLLM path ⁴ make it the most studyable open diffusion model yet shipped — which is exactly what an “experimental” tag should mean. Anyone wiring it into a product should price in both the AIME gap and the DIJA-class jailbreaks before the speed seduces them.

datasette-agent 0.2 pauses tools mid-run to ask the user

Source: simon-willison · published 2026-06-10

TL;DR

datasette-agent 0.2a0 lets tools call await context.ask_user(...) mid-execution, suspending the turn until the user answers a form.
Suspended state persists to SQLite and survives server restarts mid-turn.
On resume the tool re-executes from the top with stored answers replayed.
A new save_query tool requires an explicit Yes click before any SQL is persisted.
Ships no URL-mode equivalent for OAuth-style out-of-band credential flows.

What actually shipped

The headline change in Simon Willison’s datasette-agent 0.2a0 is a ToolContext object that any tool can declare as a parameter. Calling await context.ask_user(...) suspends the current agent turn and renders the question — yes/no, multiple-choice, or free-text — as a form in the chat UI. The suspended conversation is written to the agent’s internal database, so a server restart doesn’t lose the in-flight turn. When the user answers, the tool is re-executed from the top, with the prior answers replayed from storage. The first concrete consumer is save_query, which lets the agent draft a SQL query and propose persisting it as a Datasette stored query, but only after the user approves the full SQL, name, database and visibility.

sequenceDiagram
    participant Agent
    participant Tool
    participant Store as SQLite store
    participant User
    Agent->>Tool: invoke(args)
    Tool->>Store: ask_user("Save as 'top_sales'?")
    Tool-->>Agent: suspend turn
    Note over Store,User: survives restart
    User->>Store: Yes
    Agent->>Tool: re-invoke(args)
    Tool->>Store: ask_user(...) → replayed answer
    Tool->>Agent: commit side effect

Why “replay from the top” is the load-bearing choice

This is the unusual call. LangGraph’s comparable interrupt() + checkpointer pair resumes execution at the interrupted node using a thread_id, while CrewAI’s older human_input=True is the weaker analogue ⁷. Willison’s replay convention is simpler to reason about but pushes a hard invariant onto tool authors: every ask_user() must precede every side effect, or the replay will double-write.

That invariant matters more than it reads. One production write-up pegs agent tool-call retry rates at 15–30% from timeouts and model uncertainty, and notes that a 10-step workflow with 95% per-step reliability completes cleanly only ~60% of the time — every non-idempotent retry compounds the orphaned-write risk ⁸. The Crab benchmark cited by Augment makes the persistence case starkly: chat-history-only recovery scores 8–13% on complex workloads versus 100% for semantics-aware checkpointing ⁹. Persisting the form answer to SQLite puts 0.2a0 on the right side of that gap. The replay-from-top rule keeps it there only if developers actually obey it.

What the release notes don’t address

Two sharp edges from the wider HITL literature go unmentioned. The Conversation Design Institute flags schema drift — if a tool’s signature or state shape changes while a turn is suspended across a deploy, replay can crash — and argues that passive “ask the human” gates accumulate operational debt versus proactive confidence-threshold gates that fire only when the model is uncertain ¹⁰. Neither shows up in 0.2a0’s design.

Passive HITL designs (where humans must notice a problem) are less effective than proactive decision gates that trigger only when confidence scores fall below a threshold. ¹⁰

There’s also a supply-chain footnote worth naming: Willison built the underlying LLM-library alpha with Claude Fable 5, the same model that drew developer pushback this month over reports of undisclosed “silent interventions” against competitor workloads, which Anthropic walked back after backlash ¹¹. For an approval-gated agent whose entire premise is the human sees what’s happening, the model choice underneath is part of the trust surface.

The net: 0.2a0 is the right shape — persisted state, mandatory approval for writes, a clean taxonomy of question types that mirrors MCP elicitation ¹². The fragility is in the contract with tool authors, not the framework.

DiffusionGemma ships "experimental," datasette-agent pauses to ask the user

DiffusionGemma ships “experimental,” datasette-agent pauses to ask the user

TL;DR

DiffusionGemma trades 20 AIME points for 500 tok/s

TL;DR

The speed win comes with a reasoning tax

Block diffusion under the hood

A new alignment surface

Competitive context

Takeaway

Further reading

datasette-agent 0.2 pauses tools mid-run to ask the user

TL;DR

What actually shipped

Why “replay from the top” is the load-bearing choice

What the release notes don’t address

Further reading

Jack Sun, writing.

DiffusionGemma ships “experimental,” datasette-agent pauses to ask the user

TL;DR

DiffusionGemma trades 20 AIME points for 500 tok/s

TL;DR

The speed win comes with a reasoning tax

Block diffusion under the hood

A new alignment surface

Competitive context

Takeaway

Further reading

datasette-agent 0.2 pauses tools mid-run to ask the user

TL;DR

What actually shipped

Why “replay from the top” is the load-bearing choice

What the release notes don’t address

Further reading

Footnotes

Jack Sun, writing.