JS Wei (Jack) Sun

OpenAI rebuilds Codex sandbox on Windows, Willison ships Codex-built blog

OpenAI rebuilds Codex's Windows sandbox after a PTY leak; Willison ships a Codex-built Datasette blog and names the maintenance bill.

OpenAI rebuilds Codex sandbox on Windows, Willison ships Codex-built blog

TL;DR

  • OpenAI rebuilt Codex’s Windows sandbox around dedicated local users so Firewall can enforce outbound blocks.
  • A unified_exec PTY path already leaked the host user and let curl exfiltrate despite network_access=false.
  • Simon Willison shipped a Datasette project blog end-to-end in Codex Desktop, then published the full build transcript.
  • Codex Desktop tops Terminal-Bench 2.0 at 77.3% versus Claude Code’s 65.4%.
  • Willison’s own caveat: doubling AI output while raising maintenance cost effectively quadrupled long-term burden.

Today is a Codex day, and both shipments tell the same story from opposite ends. OpenAI rebuilt the Codex Windows sandbox around dedicated local users — the only design that lets Windows Firewall actually enforce outbound network blocks — after rejecting AppContainer, Windows Sandbox VM, and MIC as too narrow, too heavy, or too risky. The redesign matters because the previous boundary already leaked: a unified_exec PTY path let whoami see the host user and curl exfiltrate with network_access=false set.

On the user-facing side, Simon Willison shipped a new Datasette project blog built entirely in Codex Desktop and treated the build transcript itself as a public artifact. Codex Desktop leads Terminal-Bench 2.0 at 77.3% to Claude Code’s 65.4%, but Willison’s own postscript names the cost: doubling AI output while raising maintenance burden effectively quadruples the long-term bill. Two shipments, two extensions of Codex’s reach, two costs documented in the same release.

OpenAI’s Codex sandbox runs as a separate Windows user

Source: openai-blog · published 2026-05-13

TL;DR

  • OpenAI rebuilt the Codex Windows sandbox around dedicated local users so Windows Firewall can actually enforce outbound network blocks.
  • AppContainer, Windows Sandbox VM, and MIC were all rejected as too narrow, too heavy, or too risky.
  • Anthropic ducked the same problem — Claude Code on Windows requires WSL2 rather than a native Win32 sandbox.
  • The boundary already leaked once: a unified_exec PTY path let whoami see the host user and curl exfiltrate despite network_access=false.

Why Windows needed a bespoke design

Codex’s sandbox story on macOS and Linux is boring — Seatbelt and bubblewrap exist, you call them. Windows has no equivalent that combines strong isolation with the freedom a coding agent needs to run Git, Python, and arbitrary shell tools. OpenAI’s first cut leaned on synthetic SIDs and write-restricted tokens to fence off the working directory while denying writes to .git and .codex. Network “isolation” was advisory: poison HTTPS_PROXY, stub out SSH, hope tools respect it. They didn’t.

The rewrite accepts an admin-elevation step at install time so it can do the one thing Windows Firewall actually requires: target a distinct security principal. Restricted tokens spawned from the user’s own session can’t be firewalled. So Codex now provisions two local accounts, CodexSandboxOffline and CodexSandboxOnline, and runs sandboxed commands as them. Independent technical write-ups corroborate this pivot, noting OpenAI explicitly rejected AppContainer’s capability model as “incompatible with open-ended developer workflows” before settling on the hybrid token approach 1.

The four-binary architecture

flowchart LR
    A[codex.exe<br/>primary user] --> B[codex-windows-sandbox-setup.exe<br/>elevated: SIDs, DPAPI, firewall rules]
    A --> C[codex-command-runner.exe<br/>runs as CodexSandboxOffline/Online]
    C --> D[Child process<br/>write_restricted token]
    E[Windows Firewall] -. blocks .-> C
    D -. ACL-denied .-> F[.git, .codex]

Setup runs once with admin rights to mint the sandbox users, encrypt their credentials with DPAPI, and install firewall rules keyed to those principals. At runtime, codex-command-runner.exe crosses the user boundary, forges a final write_restricted token (with Everyone, the logon session, and the synthetic sandbox-write SID), and spawns the child. Read ACLs for C:\Windows, C:\Program Files, and the user profile are granted asynchronously so the sandbox user can actually load DLLs.

Where the boundary already leaks

The blog post reads as a clean architecture; the Codex repo tells a messier story. GitHub issue #19315 documents a unified_exec PTY path that bypassed the entire sandbox — whoami returned the host’s primary user and curl exfiltrated cleanly 2. Separately, researchers chained malicious GitHub branch names through Codex’s container-setup sanitizer and pulled OAuth tokens out of unencrypted local session files in the desktop build 3. The new ACL-deny on .codex is a direct hardening against that class — what Pierce’s agent-sandbox survey calls Configuration-Based Sandbox Escapes, where writable agent config becomes a persistence path 4 — but it doesn’t help if a sibling code path skips the principal switch entirely.

There’s also a structural ceiling. Anything sharing the NT kernel can be escaped through a kernel bug, and the only real step up is a user-space kernel like gVisor at a 10–50% performance penalty most consider too heavy for local CLI use 5. That is precisely the trade Anthropic made by punting to WSL2 6: accept a VM boundary, lose host integration.

Takeaway

This is the most detailed public engineering on native Win32 agent isolation to date, and it is meaningfully better than the proxy-poisoning prototype it replaced. Read it as defense-in-depth, not a hard boundary — the bypasses that have already shipped in Codex itself make clear that “fail-closed” means “fail-closed against non-kernel adversaries who also can’t reach an unsandboxed code path.”


Willison ships Datasette blog, publishes Codex build session

Source: simon-willison · published 2026-05-13

TL;DR

  • Simon Willison shipped a new Datasette project blog built end-to-end in OpenAI Codex Desktop.
  • Willison published the full build session as a public Gist, treating the transcript as a shippable artifact.
  • Codex Desktop leads Terminal-Bench 2.0 at 77.3% vs Claude Code’s 65.4%, but stumbles on architectural reasoning at scale.
  • Doubling AI output while raising maintenance cost effectively “quadrupled” long-term burden, per Willison’s own caveat.

The actual news isn’t the blog

Datasette finally has its own blog at datasette.io/blog, decoupled from Willison’s personal site. That’s housekeeping. The detail worth paying attention to is buried in the announcement: Codex Desktop now exports a full Markdown transcript of a coding session, and Willison published the entire build log as a GitHub Gist alongside the launch.

The Modelwire writeup names the shift directly — Willison is “normalizing AI-generated logs as legitimate technical artifacts, moving beyond viewing them as mere temporary scaffolding” 7. That’s a small primitive with large implications. If the transcript is the artifact, code review extends to reviewing the prompts and the model’s reasoning chain. PR descriptions start linking to sessions. “How was this built?” gets a literal answer instead of a reconstructed one.

Codex Desktop, in context

Willison’s choice of tool isn’t incidental. Independent benchmarks put Codex Desktop ahead of Claude Code on agentic coding tasks:

ToolTerminal-Bench 2.0Token efficiency
Codex Desktop77.3%baseline
Claude Code65.4%2–4× more tokens 8

That efficiency gap is what makes Codex viable for the kind of multi-agent, long-running sessions Willison favors. But the marketing should be read against hands-on dissent. Stack Overflow’s testing found Codex “excelled at identifying unused variables but failed to recognize deeper architectural anti-patterns, such as tight coupling between domains and databases” 9. r/dotnet practitioners report it “occasionally enters ‘loops’ when trying to fix bugs, necessitating a handoff to rivals like Claude” 10.

A static project blog with RSS feeds and template styling is exactly the shape of problem Codex handles cleanly. Conclusions about its architectural judgment shouldn’t generalize from this build.

Vibe coding vs. vibe engineering

The source article reaches for “vibe coding” approvingly. Willison’s own public position is sharper than that. On Fediverse he wrote that if a developer uses AI to double their output but the resulting code is harder to maintain, they have “essentially ‘quadrupled’ their long-term burden” 11. He distinguishes vibe coding (throwaway scripts) from vibe engineering (production code with accountability). A project blog that will accumulate release notes for years sits firmly on the engineering side — which is presumably why the transcript got published in the first place. An auditable build log is the accountability mechanism.

The Datasette 1.0 backdrop

The blog launch reads like clearing the runway, not a victory lap. The actual 1.0 alpha state is unglamorous: releases 1.0a28 and 1.0a29 have been dedicated to chasing “gnarly” segfaults and race conditions in datasette.close() and internal connection management 12. Security hardening (Sec-Fetch-Site CSRF replacement) landed in 1.0a27. The pipeline of announcements Willison alludes to is real, but the underlying project is mid-stabilization, not mid-celebration. Worth setting expectations accordingly when the next post drops.

Footnotes

  1. Luis Cardoso — ‘Sandboxes for AI’ technical bloghttps://www.luiscardoso.dev/blog/sandboxes-for-ai

    OpenAI’s Codex team originally rejected AppContainer because its narrow, capability-based model was incompatible with open-ended developer workflows… They instead pivoted to a hybrid approach using write-restricted tokens

  2. GitHub issue openai/codex#19315 — unified_exec PTY bypasshttps://github.com/openai/codex/issues/19315

    the unified_exec PTY (Pseudo-Terminal) path… bypassed these restrictions… a simple whoami command would return the host’s primary user account, and networked commands like curl could successfully exfiltrate data

  3. Cyber Security News — Codex command-injection disclosurehttps://cybersecuritynews.com/openai-codex-command-injection-vulnerability/

    malicious GitHub branch names could bypass input sanitization during container setup, allowing attackers to exfiltrate OAuth tokens… desktop versions of Codex allegedly stored session tokens in unencrypted local files

  4. pierce.dev — ‘A deep dive on agent sandboxes’https://pierce.dev/notes/a-deep-dive-on-agent-sandboxes

    Configuration-Based Sandbox Escapes (CBSE), where the agent’s own startup logic or configuration files are writable from within the sandbox, allowing for persistent host-side re-execution

  5. dev.to — ‘OS-level sandboxing: kernel isolation for AI agents’https://dev.to/uenyioha/os-level-sandboxing-kernel-isolation-for-ai-agents-3fdg

    while tools like bubblewrap or Landlock provide strong isolation, they share the host kernel; a kernel-level vulnerability could still allow an agent to escape… gVisor… with a 10–50% performance overhead that is generally deemed too heavy for local CLI use

  6. Claude Code official security docs (Anthropic)https://code.claude.com/docs/en/security

    Claude Code’s native sandbox currently supports macOS (via Seatbelt) and Linux (via bubblewrap)… for Windows users, it requires WSL2 to enforce kernel-level filesystem and network restrictions

  7. themodelwire.com — coverage of the Datasette blog launchhttps://www.themodelwire.com/article/welcome-to-the-datasette-blog-01KRHXMC7P16H2SSBASNVPJV0M

    Willison published the session that built the blog as a GitHub Gist, treating the AI’s step-by-step process as essential documentation… normalizing AI-generated logs as legitimate technical artifacts, moving beyond viewing them as mere temporary scaffolding.

  8. buildfastwithai.com — Claude Code vs Codex 2026https://www.buildfastwithai.com/blogs/claude-code-vs-codex-2026

    Codex Desktop… holds a clear lead on Terminal-Bench 2.0, scoring 77.3% compared to Claude’s 65.4%… Codex is 2 to 4 times more token-efficient than Claude Code for equivalent tasks.

  9. Stack Overflow Blog — ‘A new worst coder has entered the chat’https://stackoverflow.blog/2026/01/02/a-new-worst-coder-has-entered-the-chat-vibe-coding-without-code-knowledge/

    In practical tests involving legacy codebases, developers found that Codex excelled at identifying unused variables but failed to recognize deeper architectural anti-patterns, such as tight coupling between domains and databases.

  10. Reddit r/dotnet — Codex in VS Code threadhttps://www.reddit.com/r/dotnet/comments/1qcuhoj/is_it_just_me_i_find_openai_codex_in_vscode/

    Some users expressed frustration that the AI occasionally enters ‘loops’ when trying to fix bugs, necessitating a handoff to rivals like Claude for more complex reasoning.

  11. Simon Willison (Fediverse post)https://fedi.simonwillison.net/@simon/115333498401581242

    If a developer uses AI to double their output but the resulting code is harder to maintain, they have essentially ‘quadrupled’ their long-term burden.

  12. Datasette docs (1.0a release notes)https://docs.datasette.io/en/latest/

    Recent alpha releases (1.0a28 and 1.0a29) were dedicated to resolving ‘gnarly’ segfaults and race conditions involving the new datasette.close() method and internal connection management.

Jack Sun

Jack Sun, writing.

Engineer · Bay Area

Hands-on with agentic AI all day — building frameworks, reading what industry ships, occasionally writing them down.

Digest
All · AI Tech · AI Research · AI News
Writing
Essays
Elsewhere
Subscribe
All · AI Tech · AI Research · AI News · Essays

© 2026 Wei (Jack) Sun · jacksunwei.me Built on Astro · hosted on Cloudflare