Datasette Agent self-heals SQL, Daytona boots sandboxes in 60ms

TL;DR

Datasette Agent ships with three plugins: charts, ChatGPT image-gen, and a Fly Sprites code sandbox.
A self-healing SQL loop picks tables, generates queries, then reads SQLite errors and rewrites.
Daytona boots agent sandboxes in ~60ms and scales to 50,000 concurrent in 75 seconds.
Daytona reports 74% MoM growth, with its largest customer running ~850K sandboxes per day.
RL and eval workloads jumped from 0% to ~50% of Daytona usage in months.

Two independent agent-infrastructure ships landed today, at different layers of the stack. Datasette Agent arrives with a plugin SDK and a two-step loop that picks relevant tables, generates SQL, reads SQLite errors, and rewrites the query autonomously — useful context being that Claude-3.7-Sonnet hits 17.78% on BIRD-INTERACT multi-turn SQL and GPT-5 just 8.67%. One of the launch plugins, Fly Sprites, reopens what Simon Willison himself named the lethal trifecta: private data, untrusted input, and external reach in the same agent.

Daytona, meanwhile, is the runtime layer underneath agents like these. Shared-kernel containers replace Firecracker microVMs to push sandbox boot down to ~60ms, with one customer running ~850K sandboxes a day and RL/eval traffic going from zero to roughly half the workload inside months. Two different bets on what agent infrastructure should optimize for — query reliability on one side, raw sandbox throughput on the other.

Datasette Agent ships self-healing SQL and a plugin SDK

Source: simon-willison · published 2026-05-21

TL;DR

Datasette Agent shipped with three plugins: charts, ChatGPT image-gen, and a Fly Sprites code sandbox.
A two-step loop picks relevant tables, generates SQL, then reads SQLite errors and rewrites the query autonomously.
Claude-3.7-Sonnet hits 17.78% and GPT-5 just 8.67% on BIRD-INTERACT multi-turn SQL — context for why self-heal matters.
The Sprites plugin reopens the “lethal trifecta” — private data, untrusted input, external reach — that Willison himself coined.

The pattern, not just the product

Read as a single drop — primary release plus three plugin repos shipped the same day — this isn’t “Simon ships an agent.” It’s a reference implementation of a specific pattern: two-step table selection plus self-healing SQL, sitting on top of the LLM 0.32a0 refactor that swapped the legacy prompt/response primitive for a message-sequence abstraction explicitly so agents can replay tool turns from SQLite ¹.

The self-heal is the load-bearing trick. Independent analysis traces the loop: the agent first selects a relevant subset of tables to keep the context window sane, then generates SQL, and on failure parses the SQLite error and rewrites the query without user intervention ². That’s what turns a flaky text-to-SQL one-shot into something you’d point at a real database. It also runs up the token bill — one traced prompt burned 14,000+ output tokens ² — which is why the demo defaults to Gemini 3.1 Flash-Lite rather than a frontier model.

Why the loop matters: the benchmark gap

The launch post is breezy about Flash-Lite “having no trouble writing SQLite queries.” The 2026 benchmark picture is harsher. On BIRD-INTERACT, which tests multi-turn SQL with error recovery, Claude-3.7-Sonnet manages 17.78% and GPT-5 a dismal 8.67% on full agentic tasks ³. The gap between demo-quality SQL and production-grade SQL is exactly what the two-step plus self-heal architecture is built to paper over — and why the “View SQL” inspection affordance is the feature, not a debug nicety.

The plugin surface — and the trifecta it reopens

flowchart LR
    U[User prompt] --> A{Datasette Agent}
    DB[(SQLite DBs)] --> A
    A --> P1[charts: Observable Plot]
    A --> P2[imagegen: ChatGPT Images 2.0]
    A --> P3[sprites: Fly sandbox]
    P3 -. exec .-> S[(100GB persistent FS)]
    P2 -. egress .-> X((OpenAI))

The plugin ecosystem is where the cluster gets interesting — and where the security story gets messy. Datasette Agent already reads private data and ingests untrusted strings (prompts, scraped blog content). Adding the OpenAI image-gen and Sprites plugins gives it external reach, putting it squarely inside Willison’s own “lethal trifecta” framing ⁴. The standard mitigation, Meta’s “Agents Rule of Two,” would require dropping one leg; Datasette’s answer is read-only DB connections and follower databases. Noma Security argues that’s not enough — they’ve documented “two-out-of-three” failures where an agent with only untrusted input and state-change rights was tricked into wiping a filesystem, no exfiltration needed ⁵.

The Sprites bet is the most aggressive move. Unlike E2B-style ephemeral sandboxes, Fly Sprites give each agent session a 100GB persistent root filesystem with Firecracker isolation and ~300ms checkpoint/restore ⁶. That’s a deliberate vote for stateful, resumable agent workflows over run-and-burn — useful for long analytical sessions, risky because the blast radius now persists across turns.

What to actually watch

Three things worth tracking: whether the LLM 0.32a0 “agent abstractions” Willison hints at get extracted into a reusable library other projects can adopt; whether anyone benchmarks the self-heal loop against BIRD-INTERACT to quantify the lift; and whether the Sprites integration ships any defense beyond “it’s sandboxed” once the trifecta is fully wired up.

Daytona trades microVM isolation for 60ms agent sandboxes

Source: latent-space · published 2026-05-21

TL;DR

Daytona boots an agent sandbox in ~60ms and scales to 50,000 concurrent in 75 seconds on bare-metal containers.
The company reports 74% MoM growth, with its largest customer running ~850K sandboxes per day.
RL and eval workloads went from 0% to ~50% of usage in months, with spiky 0-to-100K-CPU bursts.
Daytona runs shared-kernel containers rather than Firecracker microVMs — a speed-for-isolation trade.

The bet: agents need computers, not exec boxes

Ivan Burazin’s pitch to swyx is that “code execution sandbox” undersells what agents actually need. An autonomous agent can’t be paused by closing a laptop, can’t lose disk state between tool calls, and increasingly needs to drive legacy GUI apps that have no API. Daytona’s answer is a stateful sandbox that snapshots disk + memory, resizes CPU/RAM on the fly to dodge OOMs, and — soon — runs Windows and macOS, not just Linux.

The growth numbers back the thesis. Daytona closed a $24M Series A in early 2026 led by FirstMark with Datadog and Figma Ventures participating; LangChain, Turing, Writer, and SambaNova are named customers, with SambaNova reporting ~200 engineering hours per week saved on infra maintenance ⁷. The 74% MoM and 850K-sandboxes-per-day figures are self-reported but consistent with the funding signal.

The workload mix is the more interesting datapoint. Reinforcement learning and eval pipelines went from nothing to roughly half of Daytona’s usage in a few months, and they don’t behave like human traffic — they spike from idle to 100,000 CPUs in seconds. Modal’s engineering team has flagged that at that fan-out, storage throughput, not CPU, becomes the actual bottleneck; CPUs sit idle waiting on I/O ⁸. Daytona’s bare-metal-with-local-IOPS story is aimed squarely at this, though no public benchmark verifies the claim.

Speed vs. isolation: the architecture trade

Burazin says “bare metal” a lot. What he means is bare-metal hosts running containers with a custom scheduler — not microVMs. Northflank’s head-to-head sorts the category into two camps ⁹:

Platform	Isolation model
E2B	Firecracker microVM (dedicated kernel per session)
Daytona	Docker-style containers (shared host kernel)
Cloudflare Sandbox	Containers (shared host kernel)

The 60ms boot and IOPS story is plausible on bare-metal containers in a way it isn’t on microVMs. The cost is the long-standing “containers were never a security boundary” critique — fine for trusted first-party agent code, more fraught for arbitrary user-submitted execution. Unicorner’s profile also notes persistent reliability complaints (workspace creation failures, API timeouts) and pushback on Daytona’s open-source framing, with the control plane being proprietary ¹⁰.

CLI over MCP, and the macOS wall

Burazin’s aside that CLI beats MCP as the agent primitive isn’t a vendor talking point — practitioner benchmarks measure CLI agents at 10×–32× more token-efficient than MCP on identical GitHub tasks, because MCP clients front-load entire tool schemas into context ¹¹. It’s a contested take (MCP defenders point to enterprise auth and multi-step orchestration) but a defensible one.

The macOS ambition is where the interview understates the problem. Apple’s EULA caps each host at two concurrent VMs, requires Apple hardware, imposes a 24-hour cooldown before a license can be reassigned, and pins memory snapshots to the specific physical machine ¹². That isn’t a “navigate the licensing” problem — it’s a structural ceiling on horizontal scaling under compliant terms. Any vendor promising elastic macOS-for-agents is either ignoring the EULA or rebuilding the economics from scratch.

The category is real, the customer pull is real, and Daytona is winning on speed. Whether shared-kernel isolation and Apple’s lawyers hold up at the next 10× of growth is the open question the interview doesn’t answer.

Simon Willison — LLM 0.32a0 refactor notes — https://simonwillison.net/2026/Apr/29/llm/

The core change in version 0.32 is a major refactor that replaces the legacy prompt-response model with a sequence-of-messages abstraction… heavily influenced by the development of Datasette Agent.

↩
n1n.ai analysis of Datasette Agent — https://explore.n1n.ai/blog/exploring-datasette-agent-llm-data-analysis-2026-05-22

It first selects relevant tables to avoid ‘schema stuffing’… then generates and executes SQL. A key feature is its ‘self-healing’ capability: if a generated query fails, the agent analyzes the error message and attempts to rewrite the SQL autonomously.

↩ ↩²
QueryPanel — NL-to-SQL in production 2026 — https://querypanel.io/blog/nl-sql-production-2026

In the more rigorous BIRD-INTERACT benchmark—which tests multi-turn SQL conversations and error recovery—all models struggle, with Claude-3.7-Sonnet hitting 17.78% and GPT-5 scoring just 8.67% on full agentic tasks.

↩
HiddenLayer — The Lethal Trifecta — https://www.hiddenlayer.com/research/the-lethal-trifecta-and-how-to-defend-against-it

If an agent possesses access to private data, exposure to untrusted tokens, and an exfiltration vector, an attacker can use a hidden prompt injection in a document to force the agent to find sensitive data and leak it to an external server.

↩
Noma Security — critique of Rule of Two — https://noma.security/blog/mcp-servers-agentic-risk-and-the-framework-that-protects-it/

They point to ‘two-out-of-three’ failures where an agent with only untrusted input and state-change capabilities was tricked into wiping a local filesystem without needing to exfiltrate data.

↩
Northflank — Fly Sprites alternatives review — https://northflank.com/blog/top-fly-io-sprites-alternatives-for-secure-ai-code-execution-and-sandboxed-environments

Sprites provide a 100GB persistent root filesystem backed by object storage… a standout technical feature is the checkpoint/restore capability, which captures the entire system state in roughly 300ms.

↩
PRNewswire — Daytona $24M Series A — https://www.prnewswire.com/news-releases/daytona-raises-24m-series-a-to-give-every-agent-a-computer-302680740.html

Daytona raises $24M Series A led by FirstMark Capital… strategic investments from Datadog and Figma Ventures… customers include LangChain, Turing, Writer and SambaNova. SambaNova reports the partnership saved roughly 200 hours per week in infrastructure maintenance and six months of total engineering time.

↩
Modal blog — Applied Compute & RL — https://modal.com/blog/applied-compute-reinforcement-learning

When scaling a workload from 10 to 1,000 nodes, storage throughput often fails to keep pace, causing CPUs to remain idle despite ‘spiky’ demand. Modern RL workflows are bringing the CPU-to-GPU ratio closer to 1:1 as simulation environments become more compute-intensive.

↩
Northflank blog (Daytona vs E2B) — https://northflank.com/blog/daytona-vs-e2b-ai-code-execution-sandboxes

E2B is widely considered the gold standard for running untrusted code because it utilizes Firecracker microVMs, providing hardware-level isolation with a dedicated kernel per session… Daytona and Cloudflare Sandboxes primarily use Docker-based containers that share the host’s kernel.

↩
Unicorner Newsletter — Daytona profile — https://read.unicorner.news/p/daytona

Some users have labeled it ‘fake open source,’ pointing out that while the CLI is accessible, the control plane is proprietary or requires a cloud account for full functionality. Independent reviews have documented persistent workspace creation failures and API timeouts.

↩
Medium — ‘CLI-Based Agents vs MCP: The 2026 Showdown’ — https://lalatenduswain.medium.com/cli-based-agents-vs-mcp-the-2026-showdown-that-every-ai-engineer-needs-to-understand-7dfbc9e3e1f9

CLI-based agents proved to be 10x to 32x more token-efficient than those using MCP for identical GitHub tasks… MCP clients typically load an entire server’s tool schemas (often tens of thousands of tokens) into the context window before execution.

↩
Koyeb — Top sandbox platforms 2026 — https://www.koyeb.com/blog/top-sandbox-code-execution-platforms-for-ai-code-execution-2026

Apple’s EULA mandates that macOS run only on Apple-branded hardware and limits each host to a maximum of two concurrent virtual machines… Apple imposes a 24-hour ‘cooldown’ period before a license can be reassigned, and virtualized memory snapshots are pinned to the specific physical machine, preventing migration between host servers.

↩

Datasette Agent self-heals SQL, Daytona boots sandboxes in 60ms

Datasette Agent self-heals SQL, Daytona boots sandboxes in 60ms

TL;DR

Datasette Agent ships self-healing SQL and a plugin SDK

TL;DR

The pattern, not just the product

Why the loop matters: the benchmark gap

The plugin surface — and the trifecta it reopens

What to actually watch

Further reading

Daytona trades microVM isolation for 60ms agent sandboxes

TL;DR

The bet: agents need computers, not exec boxes

Speed vs. isolation: the architecture trade

CLI over MCP, and the macOS wall

Jack Sun, writing.

Datasette Agent self-heals SQL, Daytona boots sandboxes in 60ms

TL;DR

Datasette Agent ships self-healing SQL and a plugin SDK

TL;DR

The pattern, not just the product

Why the loop matters: the benchmark gap

The plugin surface — and the trifecta it reopens

What to actually watch

Further reading

Daytona trades microVM isolation for 60ms agent sandboxes

TL;DR

The bet: agents need computers, not exec boxes

Speed vs. isolation: the architecture trade

CLI over MCP, and the macOS wall

Footnotes

Jack Sun, writing.