Sources

DiffusionGemma Last May Google briefly released an experimental Gemini Diffusion model. I tried the preview at the time and recorded it running at 857 tokens/second. It was an exciting model, but Google made no further announcements about it. That research has returned in the best possible way: as a new open weight (Apache 2 licensed) Gemma model, google/diffusiongemma-26B-A4B-it . NVIDIA are currently hosting the model for free on their NIM cloud API. I used that API to generate this pelican ,…

datasette-agent 0.2a0 simonwillison.net

Release: datasette-agent 0.2a0 Highlights from the release notes: Tools can now ask the user questions mid-execution. Tools that declare a context parameter receive a ToolContext object, and await context.ask_user(…) can ask a yes/no, multiple-choice ( options=[…] ) or free-text ( free_text=True ) question. While a question is unanswered the agent turn suspends: the question renders as a form in the chat UI and persists to the internal database, so suspended conversations survive a server r…

References

Google AI Developers — DiffusionGemma model card ai.google.dev

DiffusionGemma scored 77.6% on MMLU Pro vs the standard Gemma 4 26B A4B’s 82.6%, and 69.1% on AIME 2026 vs 88.3% for its non-diffusion counterpart; Google labels the model ‘experimental’ and recommends standard Gemma 4 for production reasoning tasks.

DIJA paper (arXiv 2507.11097) arxiv.org

Bidirectional modeling and parallel decoding in diffusion LLMs enable context-aware, masked-input adversarial prompts that reach up to 100% Attack Success Rate, because parallel decoding prevents dynamic filtering or rejection sampling of unsafe content during generation.

Weights & Biases report on Block Diffusion (BD3-LM) wandb.ai

The BD3-LM framework from Arriola et al. (ICLR 2025 oral, Cornell/Stanford/Cohere) divides sequences into blocks generated in parallel via discrete diffusion while blocks themselves are processed autoregressively — enabling KV caching and variable-length output that pure diffusion couldn’t support.

Roan Monteiro, DiffusionGemma deployment tutorial (Medium) medium.com

vLLM integration required a new ModelState abstraction; recommended serve config uses —max-num-seqs 4 and entropy_bound sampler because the 256-token canvas and 262K vocab pre-allocate large tensors that OOM under normal concurrency.

SiliconANGLE coverage citing HN developer reactions siliconangle.com

One developer called the speed ‘stupid fast,’ saying it turns LLM use from a ‘slot machine where you prompt and wait’ into a ‘pair-programming experience’ — though another HN commenter flagged the ‘big time cost’ of swapping between a fast diffusion model and a smarter sequential one in VRAM.

OpenRouter — Inception Mercury 2 benchmarks openrouter.ai

Inception’s Mercury 2 — the rival commercial diffusion LLM — claims ‘first reasoning’ diffusion model status at 1,000+ tok/s with 77.0% GPQA Diamond and 88.0% HumanEval (Mercury Coder), setting a competitive ceiling that DiffusionGemma’s 40.4% GPQA score does not yet meet.

WorkOS — MCP elicitation explainer workos.com

The elicitation feature was formally introduced in the June 18, 2025 specification update to move away from ‘brittle workarounds’ like multi-step tool calls or hardcoded prompts… a November 25, 2025 refinement added URL Mode elicitation for out-of-band interactions like OAuth flows.

Medium / Data Science Collective — ‘Idempotency for Agents’ medium.com

Agents retry tool calls 15–30% of the time because of network timeouts or model uncertainty… A 10-step agentic process where each step has a 95% success rate has only a ~60% chance of completing cleanly. Without idempotency, every subsequent retry doubles the risk of orphaned records.

Augment Code — async agent workflows / Crab benchmark summary augmentcode.com

‘Chat-only’ recovery—relying on the model to read its history to figure out where it left off—achieves only 8–13% correctness on complex workloads. In contrast, systems using semantics-aware checkpointing (persisting filesystem changes and process state) reach 100% recovery accuracy.

dev.to — LangGraph vs CrewAI workflow control dev.to

LangGraph uses a first-class interrupt() function combined with checkpoints… A human can then inspect the state, edit it, and use a thread_id to resume exactly where the process left off. CrewAI historically relied on a simple human_input=True flag at the task level.

Conversation Design Institute — designing autonomous agents conversationdesigninstitute.com

Schema Drift: If the underlying agent code or state schema changes while a thread is waiting for human input, resuming that thread can cause crashes… Passive HITL designs (where humans must notice a problem) are less effective than proactive decision gates that trigger only when confidence scores fall below a threshold.

jonready.com — Claude Fable 5 ‘silent intervention’ critique jonready.com

Initial reports suggested Fable 5 would degrade its own performance without notifying the user to prevent competitors from using the model for training; following developer backlash, Anthropic committed to making these safety interventions visible.

Sources

References

Jack Sun, writing.