Willison ships str_replace over Hashline, Nemotron skips the verify harness

TL;DR

Willison’s datasette-agent-edit ships Claude’s str_replace trio despite his own Hashline lifting Grok from 6.7% to 68.3%.
Unified diffs beat sequential str_replace by 3.5× on cost and 6.5× on speed in independent benchmarks.
Nemotron 30B one-shots Snake and Breakout in a Build Small hackathon entry.
The same pipeline fails on Tetris and Three.js, collapsing into blank screens without a verify harness.
Qwen 3.5 4B reportedly one-shots ray-caster games, suggesting training mix beats parameter count.

Today’s two AI-tech ships share an awkward shape: each picks an edit-loop or build-loop pattern the field has already shown to be the weaker option, while the better-documented alternative sits one library away. datasette-agent-edit 0.1a0 wires up Claude’s view / str_replace / insert trio even though Willison’s own Hashline format took Grok Code Fast from 6.7% to 68.3%, and even though independent benchmarks put unified diffs 3.5× cheaper and 6.5× faster than sequential str_replace. A Build Small hackathon entry one-shots Snake on Nemotron 30B but skips the verification harness that recent agentic-coding writeups treat as load-bearing — and Tetris promptly collapses into a blank screen.

Both stories are worth reading as inventories of what the next iteration would have to add: a diff-format swap on one side, a verify loop on the other. The interesting question isn’t whether the shipped versions work — it’s why builders keep leaving the proven upgrade on the shelf.

datasette-agent-edit ships Claude’s edit trio, skips Hashline

Source: simon-willison · published 2026-06-07

TL;DR

datasette-agent-edit 0.1a0 ships a shared base plugin with Claude’s view / str_replace / insert trio.
The str_replace pattern is battle-tested but brittle — a single tab-vs-space mismatch or stray auto-format between reads breaks the edit loop.
Independent benchmarks put unified diffs 3.5× cheaper and 6.5× faster than sequential str_replace edits on a 1,000-line file.
Willison’s own Hashline format took Grok Code Fast from 6.7% → 68.3% on coding tasks with no retraining.

What the plugin actually does

Simon Willison’s new datasette-agent-edit 0.1a0 is plumbing, not a product: a base plugin that other Datasette Agent plugins will import to get a consistent set of editing tools. The trio — view (with line numbers), str_replace (with a uniqueness constraint on old_str), and insert (after a given line) — is copied wholesale from Anthropic’s published text-editor tool. The motivation is straightforward: Willison is planning several plugins that mutate text (collaborative Markdown editing, large SQL query rewrites, SVG tweaks), and he doesn’t want each one reinventing the edit primitives.

That choice is defensible. Claude-family models are explicitly trained against this exact schema, so a Claude-backed agent gets the cheapest path to working edits. The uniqueness rule on old_str is a useful safety rail — if the target string appears twice, the tool refuses rather than silently corrupting the file.

The `str_replace` tax

The pattern’s failure mode is loud and well-documented. Practitioners report that str_replace “requires perfect reproduction of indentation… a single tab-vs-space difference causes a ‘String to replace not found’ error, especially when auto-formatters modify the file between Read and Edit calls” ¹. Models compensate by padding old_str with extra context lines, which makes the whitespace problem worse, not better.

Independent benchmarks point in the same direction. Aider’s own data shows switching from search/replace blocks to unified diffs cut GPT-4 Turbo’s “lazy coding” (placeholder comments replacing real code) by 3× ². A 5-strategy comparison on a 1,000-line file found diff- and script-based patching 3.5× cheaper and 6.5× faster than sequential str_replace-style line edits; whole-file rewrites, the other obvious alternative, lost content in the middle ³. Morph LLM’s framing is the cleanest: diffs let the agent send only the changed lines plus context, bypassing the dominant cost driver of regenerating surrounding code ⁴.

None of this kills the pattern for Datasette’s use case — Markdown fragments and SQL snippets are short, well inside the sweet spot where str_replace works fine — but it does cap the plugin’s ambition.

The Hashline elephant

The pointed context the release notes don’t mention is that Willison himself has shipped something better in adjacent work. His Hashline format tags every line with a short content hash, so models target edits by hash rather than reproducing whitespace at all. Reported gains: Grok Code Fast jumped from 6.7% to 68.3% success on coding tasks with no model retraining, and Gemini gained a stable 8% ⁵. That datasette-agent-edit ships the older Claude primitives anyway reads as a pragmatic bet on Claude-family tool-calling fluency — not a verdict that str_replace is the endgame.

The missing surface: review

One theme prominent in the broader agentic-editing conversation but absent from this release: edits are also a review surface. OpenClaw’s argument is that a chat summary of an edit is insufficient for safety; agents should emit a read-only diff artifact a human can inspect with familiar version-control tools before commit ⁶. A base plugin that standardizes editing across Markdown, SQL, and SVG is the natural place to standardize that audit trail too. The 0.1a0 tag suggests there’s room.

Nemotron 30B one-shots Snake but chokes on Tetris

Source: huggingface-blog · published 2026-06-07

TL;DR

Nemotron 30B one-shot Snake and Breakout in a Build Small hackathon entry hosted on Hugging Face Spaces.
The same pipeline failed on Tetris and full Three.js games, collapsing into blank screens.
Long prompts, skill cards, and a Codex-distilled RAG file skipped the verification harness agentic-coding writeups now treat as mandatory.
Qwen 3.5 4B reportedly one-shots ray-caster games, suggesting training mix beats parameter count here.

What actually broke

“Amazing Digital Dentures” started as an AI digital pet that would spin user to-do items into Three.js mini-games and ended as a Hugging Face Space that emits one-shot HTML snippets. The model in the loop was NVIDIA’s Nemotron-3 Nano 30B A3B — a hybrid Mamba-Transformer MoE that activates only ~3.2–3.6B parameters per token and scores 68.3% on LiveCodeBench and 38.8% on SWE-bench ⁷. On the benchmark sheet, that’s a credible coding model.

In practice, the developer’s long system prompts explaining game mechanics and Three.js syntax “blew past” Nemotron’s effective context window, producing hallucinated APIs even after the RAG pivot ⁸. Outputs degraded into blank screens. Snake and Breakout survived the one-shot path; Tetris did not.

The harness-shaped hole

This is not a Nemotron-only problem. An independent technical review of the same model found it could build a physics simulator from scratch but needed “constant babysitting” as it “occasionally lost focus or executed bizarre, non-functional commands” ⁹. That’s the same drift the hackathon hit when scaling from a 2D arcade game to a 3D engine.

The broader agentic-coding literature is converging on a diagnosis: single-shot generation of production code is “an illusion” without a persistent harness of automated tests and iterative oversight ¹⁰. The Dentures pipeline — long prompt → skill cards → Codex-distilled RAG → one-shot Three.js — skipped exactly that loop. Notably, the most functional Nemotron use in the same hackathon cohort, the “Her” project, made the opposite bet: it ran Nemotron-Mini-4B only for prose and “softer suggestions” and handed the actual code work to a deterministic engine ¹¹. SLM as reasoning layer over a verified executor is starting to look like the working recipe.

flowchart LR
    A[User task] --> B[Long Three.js prompt]
    B --> C[Skill cards]
    C --> D[Codex-distilled RAG]
    D --> E[Nemotron 30B one-shot]
    E --> F((Blank screen))
    E -.missing.-> G[Test harness / iterate]
    G -.would feed.-> E

A 4B counterexample

The “30B can’t do Three.js” reading isn’t unanimous. r/LocalLLaMA users report Qwen 3.5 4B “vibe coding” working ray-caster games in a single shot ¹². If a 4B Qwen can ship a ray caster while a 30B Nemotron stalls on Tetris, the bottleneck is plausibly Nemotron’s training mix — heavier on enterprise and agentic tool-use, lighter on game-loop JavaScript idioms — not raw capacity.

Takeaway

Read as a postmortem of one model, this is a story about a hackathon hitting a 30B ceiling. Read against the surrounding evidence, it’s a case study in two well-documented anti-patterns: picking a model whose benchmark strengths don’t match the target domain ⁷¹², and attempting single-shot generation without a verification harness ¹⁰¹¹. The HTML-toymaker pivot is the rational endpoint of both. The interesting question for the next Build Small cohort isn’t “which 30B?” — it’s whether anyone wires up the test loop the agentic-coding writeups keep saying is the actual unlock.

dev.to — Alex Chen, ‘The Harness Problem is Real’ — https://dev.to/alexchen31337/the-harness-problem-is-real-and-the-edit-tool-is-where-it-starts-nff

str_replace requires perfect reproduction of indentation… a single tab-vs-space difference causes a ‘String to replace not found’ error, especially when auto-formatters modify the file between Read and Edit calls.

↩
aider.chat — Unified Diffs documentation — https://aider.chat/docs/unified-diffs.html

Switching from search/replace blocks to a unified diff format reduced GPT-4 Turbo’s ‘lazy coding’ by 3x in Aider’s benchmark.

↩
dev.to — ‘I Benchmarked 5 File Editing Strategies for AI Coding Agents’ — https://dev.to/ceaksan/i-benchmarked-5-file-editing-strategies-for-ai-coding-agents-heres-what-actually-works-1855

For 10 changes to a 1,000-line file, script- and diff-based patching were 3.5x cheaper and 6.5x faster than sequential str_replace-style line edits; whole-file rewrites suffered from ‘lost-in-the-middle’ omissions.

↩
Morph LLM — ‘Diff Format Explained’ — https://www.morphllm.com/edit-formats/diff-format-explained

Unified diffs let the agent send only the necessary changes plus context lines, bypassing the need to process or regenerate surrounding code — the dominant cost driver for str_replace-style tools on large files.

↩
Ben’s Bites — coverage of Simon Willison’s ‘Hashline’ format — https://www.bensbites.com/p/something-big-is-happening

Grok Code Fast jumped from 6.7% to 68.3% success on coding tasks simply by switching to the Hashline tool interface — no model retraining — while Gemini gained a stable 8%.

↩
OpenClaw Playbook — ‘Diff Artifacts for Agent Review’ — https://www.openclawplaybook.ai/blog/openclaw-diff-artifacts-agent-review/

A chat summary of an edit is insufficient for security; agents should emit a read-only diff artifact a human can inspect before commit, using familiar version-control tools to catch unintended side effects.

↩
MindStudio model card for Nemotron-3 Nano 30B — https://www.mindstudio.ai/models/nemotron-3-nano-30b

68.3% score on LiveCodeBench and a 38.8% achievement on the rigorous SWE-bench… activates roughly 3.2 to 3.6 billion per token

↩ ↩²
PrintingPressAI write-up of ‘Amazing Digital Dentures’ — https://www.printingpressai.com/article/generative-ai/amazing-digital-dentures-a-failed-project

long system prompts explaining game mechanics and Three.js syntax often ‘blew past’ the effective context window of the Nemotron 30B model, leading to degraded performance and hallucinated APIs

↩
Medium technical review of Nemotron-3 Nano 30B A3B — https://medium.com/@leucopsis/a-technical-review-of-nvidias-nemotron-3-nano-30b-a3b-e91673f22df4

while the model could build a physics simulator from scratch, it required constant ‘babysitting’ as it occasionally lost focus or executed bizarre, non-functional commands

↩
Anastasios Vanis, ‘The Sweet Spot of Agentic Coding’ (Medium) — https://medium.com/@anastasiosvanis/the-sweet-spot-of-agentic-coding-why-100-reliable-code-requires-a-harness-c6ccf58e16d5

single-shot generation remains an ‘illusion’ for production-ready apps; true reliability requires a persistent ‘harness’ that includes automated testing and iterative human oversight

↩ ↩²
GenAI Secret Sauce Daily Digest, 2026-06-07 — https://genaisecretsauce.com/genai-secret-sauce-daily-digest-2026-06-07/

‘Her’ utilized the Nemotron-Mini-4B-Instruct model on ZeroGPU exclusively for handling prose and ‘softer suggestions,’ while leaving the heavy lifting of code forensics to a deterministic engine

↩ ↩²
r/LocalLLaMA thread on Qwen 3.5 4B — https://www.reddit.com/r/LocalLLaMA/comments/1rkb8en/qwen_35_4b_is_so_good_that_it_can_vibe_code_a/

Qwen 3.5 4B is cited as being able to create ‘Ray caster games in one shot’

↩ ↩²

Willison ships str_replace over Hashline, Nemotron skips the verify harness

TL;DR

datasette-agent-edit ships Claude’s edit trio, skips Hashline

TL;DR

What the plugin actually does

The str_replace tax

The Hashline elephant

The missing surface: review

Nemotron 30B one-shots Snake but chokes on Tetris

TL;DR

What actually broke

The harness-shaped hole

A 4B counterexample

Takeaway

Footnotes

Jack Sun, writing.

The `str_replace` tax