DiffusionGemma ships "experimental," datasette-agent pauses to ask the user
Every URL the pipeline pulled into ranking for this issue — primary sources plus the supporting and contradicting findings each Researcher returned. Inline citations in the issue point back here.
Sources
DiffusionGemma simonwillison.net
DiffusionGemma Last May Google briefly released an experimental Gemini Diffusion model. I tried the preview at the time and recorded it running at 857 tokens/second. It was an exciting model, but Google made no further announcements about it. That research has returned in the best possible way: as a new open weight (Apache 2 licensed) Gemma model, google/diffusiongemma-26B-A4B-it . NVIDIA are currently hosting the model for free on their NIM cloud API. I used that API to generate this pelican ,…
datasette-agent 0.2a0 simonwillison.net
Release: datasette-agent 0.2a0 Highlights from the release notes: Tools can now ask the user questions mid-execution. Tools that declare a context parameter receive a ToolContext object, and await context.ask_user(…) can ask a yes/no, multiple-choice ( options=[…] ) or free-text ( free_text=True ) question. While a question is unanswered the agent turn suspends: the question renders as a form in the chat UI and persists to the internal database, so suspended conversations survive a server r…
References
Google AI Developers — DiffusionGemma model card ai.google.dev
DiffusionGemma scored 77.6% on MMLU Pro vs the standard Gemma 4 26B A4B’s 82.6%, and 69.1% on AIME 2026 vs 88.3% for its non-diffusion counterpart; Google labels the model ‘experimental’ and recommends standard Gemma 4 for production reasoning tasks.
DIJA paper (arXiv 2507.11097) arxiv.org
Bidirectional modeling and parallel decoding in diffusion LLMs enable context-aware, masked-input adversarial prompts that reach up to 100% Attack Success Rate, because parallel decoding prevents dynamic filtering or rejection sampling of unsafe content during generation.
Weights & Biases report on Block Diffusion (BD3-LM) wandb.ai
The BD3-LM framework from Arriola et al. (ICLR 2025 oral, Cornell/Stanford/Cohere) divides sequences into blocks generated in parallel via discrete diffusion while blocks themselves are processed autoregressively — enabling KV caching and variable-length output that pure diffusion couldn’t support.
Roan Monteiro, DiffusionGemma deployment tutorial (Medium) medium.com
vLLM integration required a new ModelState abstraction; recommended serve config uses —max-num-seqs 4 and entropy_bound sampler because the 256-token canvas and 262K vocab pre-allocate large tensors that OOM under normal concurrency.
SiliconANGLE coverage citing HN developer reactions siliconangle.com
One developer called the speed ‘stupid fast,’ saying it turns LLM use from a ‘slot machine where you prompt and wait’ into a ‘pair-programming experience’ — though another HN commenter flagged the ‘big time cost’ of swapping between a fast diffusion model and a smarter sequential one in VRAM.
OpenRouter — Inception Mercury 2 benchmarks openrouter.ai
Inception’s Mercury 2 — the rival commercial diffusion LLM — claims ‘first reasoning’ diffusion model status at 1,000+ tok/s with 77.0% GPQA Diamond and 88.0% HumanEval (Mercury Coder), setting a competitive ceiling that DiffusionGemma’s 40.4% GPQA score does not yet meet.
WorkOS — MCP elicitation explainer workos.com
The elicitation feature was formally introduced in the June 18, 2025 specification update to move away from ‘brittle workarounds’ like multi-step tool calls or hardcoded prompts… a November 25, 2025 refinement added URL Mode elicitation for out-of-band interactions like OAuth flows.
Medium / Data Science Collective — ‘Idempotency for Agents’ medium.com
Agents retry tool calls 15–30% of the time because of network timeouts or model uncertainty… A 10-step agentic process where each step has a 95% success rate has only a ~60% chance of completing cleanly. Without idempotency, every subsequent retry doubles the risk of orphaned records.
Augment Code — async agent workflows / Crab benchmark summary augmentcode.com
‘Chat-only’ recovery—relying on the model to read its history to figure out where it left off—achieves only 8–13% correctness on complex workloads. In contrast, systems using semantics-aware checkpointing (persisting filesystem changes and process state) reach 100% recovery accuracy.
dev.to — LangGraph vs CrewAI workflow control dev.to
LangGraph uses a first-class interrupt() function combined with checkpoints… A human can then inspect the state, edit it, and use a thread_id to resume exactly where the process left off. CrewAI historically relied on a simple human_input=True flag at the task level.
Conversation Design Institute — designing autonomous agents conversationdesigninstitute.com
Schema Drift: If the underlying agent code or state schema changes while a thread is waiting for human input, resuming that thread can cause crashes… Passive HITL designs (where humans must notice a problem) are less effective than proactive decision gates that trigger only when confidence scores fall below a threshold.
jonready.com — Claude Fable 5 ‘silent intervention’ critique jonready.com
Initial reports suggested Fable 5 would degrade its own performance without notifying the user to prevent competitors from using the model for training; following developer backlash, Anthropic committed to making these safety interventions visible.