Around the weights: gating, silicon, and scaffolding take over

TL;DR

OpenAI launched GPT-5.5 to ChatGPT and Codex but withheld the API; Simon Willison reverse-engineered the Codex endpoint within 48 hours.
GPT-5.5 API pricing, when it lands, doubles GPT-5.4 to $5/$30 per million tokens, with Pro at $30/$180.
Anthropic’s postmortem pins two months of Claude Code regressions on three harness bugs — stripped thinking tokens, evicted system prompts, tool-call race conditions.
DeepSeek V4-Pro (1.6T/49B active, MIT, 1M context) prices at $1.74/$3.48 per million tokens, under a third of Gemini 3.1 Pro on output.
V4-Flash hallucinates up to 96% on unknown facts, and the stack runs on Huawei silicon rather than Nvidia.

Today’s three features all point to the same place — and it isn’t the weights. OpenAI shipped GPT-5.5 without an API, turning distribution itself into the gate (which practitioners promptly picked, reverse-engineering the Codex endpoint inside 48 hours). Anthropic’s postmortem on two months of degraded Claude Code output blames three bugs in the harness, not the model — thinking tokens stripped, system prompts evicted, tool calls timing out on success. And DeepSeek V4 lands at roughly a third of Gemini 3.1 Pro’s output cost, but the headline isn’t the price; it’s that a frontier-adjacent open-weights model is now training and serving on a Huawei stack.

Read together, they describe a frontier where the model is no longer the product. Gating decisions, inference silicon, and the scaffolding layer around the weights are now where capability is gained, lost, or reshaped before a user ever sees a token.

GPT-5.5 ships without an API, and practitioners route around it in 48 hours

TL;DR

OpenAI shipped GPT-5.5 to ChatGPT and Codex but held back the API, citing safety work with partners.
Pricing, when the API lands, doubles GPT-5.4: $5/$30 per 1M tokens, with Pro at $30/$180.
Within 48 hours, Simon Willison reverse-engineered the Codex subscription endpoint into an LLM plugin to benchmark it anyway.
Early verdicts (Willison, Mollick): genuinely capable, expensive, and still on the “jagged frontier” — strong on construction, unpredictable elsewhere.

A launch with a missing piece

GPT-5.5 dropped this week into ChatGPT and the Codex CLI, but not the API. OpenAI’s stated reason — “API deployments require different safeguards… we are working closely with partners and customers” — is the kind of staged rollout that used to be uncontroversial. In 2026 it’s a problem, because the people who actually evaluate frontier models do their work against an API, not a chat box with hidden system prompts in the way.

The pricing preview makes the stakes clear. GPT-5.5 will land at $5 per 1M input / $30 per 1M output — exactly double GPT-5.4. GPT-5.5 Pro goes to $30 / $180. GPT-5.4 stays in the catalog at half the price, which puts the lineup in a familiar shape: 5.4 is now Sonnet to 5.5’s Opus. That’s a real product decision, not a rounding error, and it pushes serious workloads back toward the cheaper tier unless 5.5 earns its premium on the specific task.

The Codex backdoor becomes the evaluation channel

The interesting story this week isn’t the model — it’s how fast the community routed around the missing API. OpenAI recently hired OpenClaw’s Peter Steinberger and, in the process, blessed the /backend-api/codex/responses endpoint that Codex CLI, Pi, and OpenCode all use to consume a $20/month ChatGPT subscription as if it were an API. Romain Huet’s confirmation (“wherever they like… JetBrains, Xcode, OpenCode, Pi, and now Claude Code”) effectively turned the consumer subscription into a developer credential.

Willison took the implication literally: he had Claude Code reverse-engineer the open-source openai/codex repo, lifted the auth flow, and shipped llm-openai-via-codex — an LLM plugin that pipes prompts through your Codex login to any model the subscription unlocks, GPT-5.5 included. Same week, he cut LLM 0.31 and a GPT-5.5 prompting guide. The whole “GPT-5.5 cluster” is really one story told four ways: the launch, the workaround, the tool release that makes the workaround usable, and the evaluation it enables.

The pelican benchmark hints at what reasoning budget now buys you. Default settings spent 39 reasoning tokens and produced a mangled SVG. Cranking reasoning_effort to xhigh spent 9,322 tokens over nearly four minutes, switched to a CSS-and-gradients approach, and rendered something recognizably bird-shaped. That’s a ~240× spread on a single prompt, and it’s now a knob the caller has to budget for explicitly.

The verdict, with caveats

Ethan Mollick’s same-week review lands on “jagged frontier” — GPT-5.5 is excellent at things and surprisingly weak at others, in ways that don’t generalize from the marketing. That’s consistent with Willison’s pelican result: the model can do the work, but you pay (in dollars and latency) for the reasoning budget that gets it there.

Two open questions the launch leaves on the table. First: does the doubled price hold once independent benchmarks land, or does GPT-5.4 quietly absorb most of the volume? Second: how long does the Codex-subscription backdoor stay “officially supported” once people start running production workloads through a $20/month plan? Independent benchmark verification isn’t yet available — that’s the next shoe.

Claude Code’s two bad months were a harness problem, not a model problem

TL;DR

Anthropic’s postmortem blames three bugs in the Claude Code harness — not the underlying models — for two months of degraded output.
The worst bug, shipped March 26, repeatedly stripped “thinking” tokens from any session idle for over an hour, making Claude seem forgetful.
A context-window pruner also evicted system prompts and tool definitions; a tool-execution race condition triggered timeout loops on successful commands.
For agent builders: the scaffolding around the model is now a first-class source of regressions, and standard model evals won’t catch it.

The harness was eating its own context

For most of February and March, Claude Code users complained the tool had gotten dumber: forgetful mid-task, repetitive across turns, prone to re-running shell commands it had just succeeded at. Anthropic’s April 23 postmortem confirms the complaints were real and pins them on three independent regressions in the Claude Code harness — the runtime that assembles prompts, manages session state, and executes tools — rather than on Sonnet or Opus themselves.

That distinction matters. Internal model evals never flagged anything because the models passed them in isolation. The bugs lived in the layer between the user and the weights.

The three bugs

Bug	Shipped	What it did	User-visible symptom
Idle-session thinking deletion	Mar 26	Optimization meant to prune old chain-of-thought once on resume; instead ran every turn for the rest of the session	Claude “forgot” recent reasoning, repeated suggestions
Context-window mismanagement	(unspecified)	Long sessions summarized away system prompts and tool definitions to make room for code	Hallucinated tool calls, ignored formatting rules
Tool execution race condition	(unspecified)	Harness reported `timeout` to the model even when the command had succeeded	Infinite retry loops, wasted tokens, “stuck” agent

The idle-session bug is the one worth dwelling on. It was a latency optimization: clear stale thinking tokens once when a user comes back after an hour, so the next response doesn’t have to re-process them. A logic error made the cleanup fire on every subsequent turn, continuously decapitating the model’s recent reasoning. Simon Willison points out that he personally has 11 Claude Code sessions older than an hour open right now and spends more time prompting in stale sessions than fresh ones — a usage pattern the bug specifically punished.

Why this is hard to catch

Harness bugs sit in a blind spot. Model benchmarks run against fresh contexts and clean tool wrappers, so a regression that only manifests after an hour of idle time, or only in long multi-file refactors, or only when a tool’s exit code races with a timeout, will pass every offline eval.

Anthropic claims the patches have restored a 14% gap in multi-file refactoring success rate and cut repetitive reasoning loops by ~22%. Those are internal numbers; independent benchmark verification is not yet available.

The takeaway for anyone shipping agents

If you’re building on top of an LLM, the postmortem is a useful forcing function. Three things the Claude Code team got bitten by are generic to every agentic system:

Session-lifetime state mutations (cache eviction, summarization, thinking-token pruning) need invariants and tests, not just one-shot correctness checks.
Context assembly is load-bearing. Anything that silently rewrites the prompt — including “smart” summarizers — can decapitate the agent’s instructions without raising an error.
Tool-execution wrappers need to be deterministic about success/failure signaling. A false timeout looks identical to model confusion in logs.

The unsexy infrastructure around the model is now where the regressions live. Anthropic’s willingness to publish a detailed postmortem rather than quietly ship fixes is the more interesting signal here — it suggests they expect the rest of the agent-tooling industry to start treating harness reliability as a discipline of its own.

DeepSeek V4 undercuts the frontier — on a Huawei stack

TL;DR

V4-Pro (1.6T/49B active) and V4-Flash (284B/13B) ship under MIT, both with 1M-token context.
Pro at $1.74/$3.48 per million tokens — under a third of Gemini 3.1 Pro on output.
Independent benchmarks confirm “frontier minus 3–6 months”; Kimi K2.6 still leads open weights at 54 vs Pro’s 52.
Catches: V4-Flash hallucinates up to 96% on unknown facts, and full weights still need ~170GB even quantized.

Frontier-adjacent at a structural discount

The pricing is the part everyone will quote: V4-Flash at $0.14/$0.28 per million tokens beats GPT-5.4 Nano outright, and V4-Pro at $1.74/$3.48 undercuts Gemini 3.1 Pro by more than half on output. DeepSeek’s own paper attributes this to KV-cache and FLOPs reductions of roughly 10× versus V3.2 at million-token contexts.

SemiAnalysis put numbers on it: V4-Pro needs ~9.6 GiB of KV cache per sequence at 1M tokens, against an estimated 83.9 GiB for a V3.2-style architecture — small enough to serve million-token windows on a single 8×B200 node without memory thrashing ¹. That is the engineering achievement, and it’s why the pricing isn’t a loss-leader.

Where it actually ranks

Artificial Analysis’s evaluation tracks Simon’s framing but sharpens it. V4-Pro lands at 52 on the Intelligence Index — a 10-point jump over V3.2 — and tops the GDPval-AA index for real-world professional tasks (legal, finance) at 1554, ahead of GLM-5.1 and MiniMax-M2.7 ². DeepSeek’s “3–6 months behind frontier” self-assessment holds: V4-Pro-Max beats GPT-5.2 and Gemini-3.0-Pro on reasoning suites but trails GPT-5.4 and Gemini 3.1 Pro ².

It is also no longer the open-weights leader on raw intelligence. Kimi K2.6 sits at 54 and is natively multimodal; V4 launched text-only ³.

Open-weights model	AA Intelligence	Modality
Kimi K2.6	54	Text + image + video
DeepSeek V4-Pro	52	Text only
GLM-5.1	~50	Text + image

The Huawei angle Simon skipped

The most under-discussed piece: V4 was optimized for Huawei Ascend 910C/950PR and “Supernode” interconnect, reportedly achieving 85% NPU utilization and a ~40% hardware-cost reduction versus comparable NVIDIA deployments ⁴. The cheap API isn’t just algorithmic efficiency — it’s a cheaper, sanctions-resilient silicon stack underneath. That’s a strategic signal as much as a technical one: the Chinese open-weights tier is now demonstrably decoupled from H100 supply.

Two catches worth naming

First, factuality. Artificial Analysis found V4-Flash hallucinates up to 96% of the time when asked about facts it doesn’t know — the model strongly prefers confabulation over abstention ⁵. V4-Pro-Max leads open models on SimpleQA at 57.9%, but the reasoning variants carry a measurable accuracy tax. Don’t deploy Flash anywhere a “I don’t know” matters.

Second, the local-inference dream. Simon hopes to run Flash on a 128GB M5 MacBook; practitioners on r/DeepSeek point out the active-parameter figure is a serving metric, not a memory one. All 284B Flash weights must be resident, and Unsloth’s mixed-precision GGUFs land near 170GB — ruling out anything below an M5 Max with 192GB+, with reported throughput of 5–8 tok/s on aggressive quants ⁶. Open weights, yes. Laptop weights, no.

The takeaway: V4 is the clearest evidence yet that the frontier price floor is a Western-stack artifact, not a physics constant.

Round-ups

It’s a big one

Simon Willison’s weekly newsletter rounds up coverage of GPT-5.5, ChatGPT Images 2.0 and Qwen3.6-27B, packaging 5 blog posts, 8 links, 3 quotes and a new chapter of his Agentic Engineering Patterns guide alongside benchmark images of pelicans, a possum and raccoons.

Here’s how our TPUs power increasingly demanding AI workloads.

Google published an explainer video on how its Tensor Processing Units handle increasingly demanding AI workloads, walking through the TPU architecture and its role in Google Cloud’s AI infrastructure stack.

Serving the For You feed

Bluesky’s For You custom feed, used by roughly 72,000 people, runs on a single Go process with SQLite on a 16-core, 96GB gaming PC in spacecowboy’s living room, fronted by a $7/month OVH VPS over Tailscale. Total operating cost: $30/month, including $20 of electricity.

russellromney/honker

Honker is a Rust SQLite extension that brings Postgres-style NOTIFY/LISTEN, durable Kafka-like streams and the transactional outbox pattern to SQLite. It adds 20+ custom SQL functions and lets workers poll the .db-wal file every 1ms for near-real-time delivery without running full queries.

Extract PDF text in your browser with LiteParse for the web

Simon Willison ported LlamaIndex’s LiteParse PDF extractor — which uses spatial heuristics and Tesseract OCR rather than AI — to run entirely in the browser via PDF.js. He vibe-coded the port with Claude Code in a 59-minute session, deploying to GitHub Pages without reading a line of the generated TypeScript.

WHY ARE YOU LIKE THIS

ChatGPT Images 2.0, prompted to depict a horse riding an astronaut riding a pelican on a bicycle, spontaneously added a roadside sign reading ‘WHY ARE YOU LIKE THIS’ — a layered variant of Simon Willison’s pelican-on-a-bicycle benchmark stacking multiple absurd subjects.

Millisecond Converter

Willison shipped a small browser tool that converts millisecond values into seconds and minutes, built to scratch the itch of reading prompt durations reported by his LLM command-line utility.

SemiAnalysis newsletter — https://newsletter.semianalysis.com/p/the-coding-assistant-breakdown-more

V4-Pro requires approximately 9.6 GiB of KV cache per sequence [at 1M tokens], nearly nine times less than the estimated 83.9 GiB required by a V3.2-style architecture… makes DeepSeek V4 one of the few models capable of serving million-token windows on a single node (e.g., 8x B200 GPUs) without catastrophic memory overhead

↩
Artificial Analysis — https://artificialanalysis.ai/articles/deepseek-is-back-among-the-leading-open-weights-models-with-v4-pro-and-v4-flash

DeepSeek V4 Pro… Intelligence Index score of 52, a significant jump from V3.2’s 42… GDPval-AA index ranks V4 Pro as the current leader among open models for real-world professional tasks (e.g., legal and finance work), scoring 1554—ahead of rivals like GLM-5.1 and MiniMax-M2.7

↩ ↩²
Digital Applied (Kimi K2 comparison) — https://www.digitalapplied.com/blog/kimi-k2-thinking

Kimi K2.6 currently leads the open-weights category with a score of 54 on the Artificial Analysis Intelligence Index, followed closely by DeepSeek V4 Pro at 52 and GLM-5.1 at approximately 50… Kimi K2.6 is natively multimodal, handling text, images, and video within a single architecture, while DeepSeek V4 remains text-only at launch

↩
BiggoFinance / 36kr (Huawei Ascend pivot) — https://finance.biggo.com/news/202604280450_DeepSeek_V4_Price_Cut_Huawei_Chip_AI

V4 was optimized to run on Huawei Ascend processors and ‘Supernode’ technology… the final V4 release achieved a reported 85% hardware utilization rate on Huawei NPUs, which is estimated to provide a 40% reduction in hardware costs compared to comparable NVIDIA A100 deployments

↩
Artificial Analysis (hallucination findings) — https://artificialanalysis.ai/articles/deepseek-is-back-among-the-leading-open-weights-models-with-v4-pro-and-v4-flash

the models exhibit high ‘hallucination rates’ (up to 96% for Flash) when faced with unknown information, indicating a tendency to prioritize response generation over factual uncertainty

↩
r/DeepSeek technical deep-dive thread — https://www.reddit.com/r/DeepSeek/comments/1sv0n20/deepseek_v4_technical_deep_dive_16t_params_1m/

the ‘active parameter’ count (e.g., 13B for V4-Flash) is misleading for local deployment; since it is an MoE model, the full 284B or 1.6T parameter weights must still be loaded into memory or high-speed storage, making local hosting inaccessible for most consumers

↩

Around the weights: gating, silicon, and scaffolding take over

Around the weights: gating, silicon, and scaffolding take over

TL;DR

GPT-5.5 ships without an API, and practitioners route around it in 48 hours

TL;DR

A launch with a missing piece

The Codex backdoor becomes the evaluation channel

The verdict, with caveats

Further reading

Claude Code’s two bad months were a harness problem, not a model problem

TL;DR

The harness was eating its own context

The three bugs

Why this is hard to catch

The takeaway for anyone shipping agents

DeepSeek V4 undercuts the frontier — on a Huawei stack

TL;DR

Frontier-adjacent at a structural discount

Where it actually ranks

The Huawei angle Simon skipped

Two catches worth naming

Round-ups

It’s a big one

Here’s how our TPUs power increasingly demanding AI workloads.

Serving the For You feed

russellromney/honker

Extract PDF text in your browser with LiteParse for the web

WHY ARE YOU LIKE THIS

Millisecond Converter

Jack Sun, writing.

Around the weights: gating, silicon, and scaffolding take over

TL;DR

GPT-5.5 ships without an API, and practitioners route around it in 48 hours

TL;DR

A launch with a missing piece

The Codex backdoor becomes the evaluation channel

The verdict, with caveats

Further reading

Claude Code’s two bad months were a harness problem, not a model problem

TL;DR

The harness was eating its own context

The three bugs

Why this is hard to catch

The takeaway for anyone shipping agents

DeepSeek V4 undercuts the frontier — on a Huawei stack

TL;DR

Frontier-adjacent at a structural discount

Where it actually ranks

The Huawei angle Simon skipped

Two catches worth naming

Round-ups

It’s a big one

Here’s how our TPUs power increasingly demanding AI workloads.

Serving the For You feed

russellromney/honker

Extract PDF text in your browser with LiteParse for the web

WHY ARE YOU LIKE THIS

Millisecond Converter

Footnotes

Jack Sun, writing.