JS Wei (Jack) Sun

The model isn't the variable anymore — access, harness, and silicon are

GPT-5.5, DeepSeek V4, and Claude Code's regression all show access policy, harness code, and hardware stacks now decide which model wins.

The model isn’t the variable anymore — access, harness, and silicon are

TL;DR

  • OpenAI ships GPT-5.5 to ChatGPT and Codex but withholds the production API; Simon Willison routes around the gate within 24 hours.
  • GPT-5.5 pricing doubles to $5/$30 per million tokens and reportedly loses seven Tom’s Guide categories to Claude Opus 4.7.
  • DeepSeek V4-Pro undercuts the frontier on price, but a 94% hallucination rate on unknown queries undermines the headline claim.
  • DeepSeek V4 was co-designed on Huawei’s CANN stack, tying its pricing story directly to the export-control fight.
  • Anthropic confirms three harness bugs — not weights — caused two months of Claude Code degradation, its second silent-regression postmortem in eight months.

Three frontier stories land today, and none of them are really about the weights. OpenAI ships GPT-5.5 to ChatGPT and Codex but withholds the production API on capability-risk grounds — and Simon Willison reverse-engineers Codex auth to route around the gate inside 24 hours. DeepSeek’s V4 launch reads as a pricing earthquake until you notice the model was co-designed on Huawei’s CANN stack, which makes the cost story a hardware-policy story. Anthropic’s Claude Code postmortem confirms two months of degradation came from three harness bugs, not the model — the second silent-regression admission in eight months. Read together, the day says the deciding variable in frontier AI competition has migrated off the model card. Access policy decides who can use a model. Harness code decides whether it works as advertised. The hardware stack decides what it costs. The weights are increasingly the least interesting part of any of these announcements.

GPT-5.5 ships through a side door — and Simon Willison walks right in

Source: simon-willison · published 2026-04-23

TL;DR

  • OpenAI launched GPT-5.5 in ChatGPT and Codex but held back the production API, citing HIGH cyber/bio capability ratings.
  • Within 24 hours Simon Willison shipped llm-openai-via-codex, an LLM plugin that reverse-engineers Codex auth to route around the missing API.
  • Pricing doubles to $5/$30 per 1M tokens ($30/$180 for Pro); the “40% more efficient” defense is contested.
  • Independent reviewers say the jagged frontier is alive: GPT-5.5 reportedly loses seven Tom’s Guide categories to Claude Opus 4.7.

A launch defined by what wasn’t shipped

The headline isn’t GPT-5.5’s capability bump — it’s that OpenAI shipped the model into ChatGPT and Codex while explicitly delaying the public API, citing the need for “different safeguards” at scale. That language matters: OpenAI’s own Preparedness Framework rated GPT-5.5 as HIGH on both cyber and biological capability, triggering a dedicated Bio Bug Bounty and a staggered rollout 1. The API isn’t late by accident; it’s gated.

So Simon Willison’s three-part response — a llm-openai-via-codex plugin, an llm 0.31 release wired to consume it, and a same-week prompting-guide writeup — is more pointed than it looks. He had Claude Code reverse-engineer the open-source Codex CLI’s token storage and built a path that lets any LLM user pipe prompts to GPT-5.5 through their personal ChatGPT subscription. The plugin works because OpenAI’s Romain Huet publicly blessed third-party use of the /backend-api/codex/responses endpoint. It also routes neatly around the deployment gate OpenAI built for the model’s HIGH-rated capabilities — a tension the post sidesteps.

Why the backdoor exists at all

That blessing only makes sense in the shadow of the OpenClaw fight. Anthropic cut off the OpenClaw harness from Claude subscriptions on defensible technical grounds: human users hit ~95% prompt-cache rates while agent harnesses bypass them, with single power users burning compute equivalent to hundreds of normal accounts 2. The execution was uglier — OpenClaw creator Peter Steinberger’s own Claude account got auto-suspended for “suspicious activity” before being restored, with Anthropic offering $200 credits to affected developers 3. OpenAI hired Steinberger and immediately positioned Codex endpoints as the open alternative. Willison calls it an “easy karma win”; it’s also a competitive land-grab timed to Anthropic’s Claude Code push.

flowchart LR
    U[Developer] --> S[ChatGPT subscription]
    S --> C[Codex CLI / app server<br/>open source]
    C --> E["/backend-api/codex/responses"]
    P[llm-openai-via-codex] -. reuses tokens .-> E
    E --> M[GPT-5.5]
    A[Public API<br/>delayed: HIGH bio/cyber] -.x.-> M

The pricing and capability questions Willison soft-pedals

Once the API does land, GPT-5.5 will cost twice GPT-5.4. OpenAI’s defense is that the new model is “roughly 40% more token-efficient,” but llm-stats notes that figure is self-reported and that “for many high-volume, low-complexity tasks GPT-5.4 remains the more cost-effective default” 4.

ModelInput ($/1M)Output ($/1M)
GPT-5.4$2.50$15
GPT-5.5$5.00$30
GPT-5.5 Pro$30.00$180

Capability dissent is sharper than the pelican-bench vibe suggests. Ethan Mollick’s review — which Willison cites approvingly — found GPT-5.5 Pro finished a procedural 3D harbor-town sim in 20 minutes vs 33 for GPT-5.4, but also flagged “persistent flatness” in fiction and “uncanny” metaphors 5. Independent testing reported by Tom’s Guide had GPT-5.5 losing to Claude Opus 4.7 in seven categories, even as OpenAI’s Terminal-Bench 2.0 score climbed from 75.1% to 82.7% 6.

The cluster’s real story is three stories braided together: a contested 2× price hike, a jagged-frontier model that doesn’t dominate Opus 4.7, and a sanctioned backdoor that exists precisely because Anthropic’s economically rational ban created room for one.

Further reading


DeepSeek V4: the pricing shock is real, the “frontier” claim isn’t

Source: simon-willison · published 2026-04-24

TL;DR

  • DeepSeek V4-Pro (1.6T params, 49B active) and V4-Flash (284B/13B) ship under MIT with 1M-token context and aggressive pricing.
  • V4-Flash at $0.14/$0.28 per million tokens undercuts GPT-5.4 Nano; V4-Pro at $1.74/$3.48 is the cheapest frontier-class model.
  • Independent tests show a 94% hallucination rate on unknown queries and 1M-token recall collapsing to 0.59.
  • V4 was co-designed on Huawei’s CANN stack, making the cost story inseparable from the export-control fight.

A genuine pricing reset

DeepSeek’s V4 preview drops the floor on what frontier-adjacent inference costs. V4-Flash is priced below GPT-5.4 Nano on input and output, and V4-Pro is roughly one-sixth the cost of Claude Opus 4.7 or GPT-5.5 7. The mechanism is architectural, not a subsidy: DeepSeek’s paper reports V4-Pro using 27% of V3.2’s per-token FLOPs and 10% of its KV cache at 1M context, with V4-Flash pushing that to 10% and 7% respectively. Compressed Sparse Attention is doing real work here.

ModelInput ($/M)Output ($/M)
DeepSeek V4 Flash$0.14$0.28
GPT-5.4 Nano$0.20$1.25
DeepSeek V4 Pro$1.74$3.48
Gemini 3.1 Pro$2.00$12.00
GPT-5.4$2.50$15.00
Claude Opus 4.7$5.00$25.00

V4-Pro is also now the largest open-weights model in circulation at 1.6T total parameters, eclipsing Kimi K2.6 and GLM-5.1.

Where the frontier claim wobbles

DeepSeek’s own paper concedes V4-Pro trails GPT-5.4 and Gemini-3.1-Pro by “approximately 3 to 6 months.” Independent evaluation suggests that’s generous on some axes. Artificial Analysis measured an 11-point AA-Omniscience improvement over V3 — but also a 94% hallucination rate on queries where the model lacks the answer 8.

V4 will answer almost anything you ask it. That is not the same as knowing the answer.

The 1M-token context tells a similar story. ByteWaveNetwork’s needle-in-haystack tests show 8-needle accuracy holding at 0.82 through 256K tokens, then dropping to 0.59 at the 1M ceiling, with failures scattered randomly through the window rather than clustering at the edges 9. Cheap million-token context is real; reliable million-token context is not. On long-horizon agentic work, Kimi K2.6 also outperforms V4 — sustaining 4,000+ tool calls over 13-hour sessions where V4 destabilizes — though V4-Pro still leads on raw competitive coding (Codeforces 3206, 80.6% on SWE-bench Verified) 10.

The Huawei subtext

The story Willison’s post understates is hardware. V4 was natively built against Huawei’s CANN stack rather than ported from CUDA, reportedly hitting >85% utilization on Ascend 950PR silicon 11. That’s the most concrete signal yet that U.S. export controls are catalyzing a parallel Chinese training stack rather than throttling one.

Politically, the release lands during active escalation. The Council on Foreign Relations notes Anthropic told U.S. lawmakers it identified roughly 24,000 fraudulent accounts generating 16M+ Claude exchanges allegedly used to distill reasoning traces into DeepSeek training data 12. The “Deterring American AI Model Theft Act” and a potential Entity List designation are now in play.

What to take away

Treat V4 as three claims, not one. The pricing reset is genuine and will pressure margins at every Western lab serving the same workloads. The “frontier” framing is marketing — V4 is competitive, not equivalent, and the hallucination and long-context regressions matter for any production deployment. And the cost number is inseparable from a Huawei-scale industrial bet that the next round of export controls will have to reckon with.


Anthropic’s Claude Code postmortem is really a harness-governance problem

Source: simon-willison · published 2026-04-24

TL;DR

  • Anthropic confirmed three harness bugs — not model changes — degraded Claude Code for two months, validating weeks of user complaints.
  • Independent measurements were brutal: Veracode saw vulnerabilities in 52% of tasks, TrustedSec recorded a 47% quality drop over five weeks.
  • Internal evals flagged a 3% coding-intelligence drop from one change before it shipped — and it shipped anyway.
  • This is Anthropic’s second silent-regression postmortem in eight months. The harness, not the weights, is now the dominant reliability variable.

The bugs Anthropic owned up to

The headline finding from the April 23 postmortem is that Claude’s weights were innocent. Three independent issues in the Claude Code harness — the orchestration layer that pipes context, tool calls, and “thinking” tokens into the model — produced the symptoms users had been reporting since February.

The most damaging was a March 26 optimization meant to drop stale “thinking” tokens once when a user resumed an idle session. A logic bug ran the deletion every turn for the rest of the session, giving Claude the appearance of progressive amnesia. Anyone who, like Simon Willison, keeps a dozen long-lived Claude Code sessions open across days got hit hardest, because the bug specifically targeted sessions idle for over an hour.

flowchart LR
    U[User prompt] --> H{Claude Code harness}
    T[Tool outputs] --> H
    M[Thinking tokens<br/>history] -. "bug: re-deleted<br/>every turn" .-> H
    H -->|"April 16:<br/>25-word cap"| L[LLM weights<br/>unchanged]
    L --> H
    H --> R[Response]

Why “the model didn’t change” stopped being reassuring

Anthropic admitted Claude Code “did get worse” but explicitly denied nerfing the model, and reset usage limits for all subscribers as compensation 13. The denial lands awkwardly against what outside parties measured. Veracode found Claude Opus 4.7 emitted vulnerabilities in 52% of tested coding tasks; TrustedSec recorded a 47% quality drop over five weeks and paused using the tool for defensive work 14.

The detail Anthropic buried, surfaced by VentureBeat: an April 16 system-prompt change capping intermediate text at 25 words between tool calls had already shown a 3% drop in coding intelligence in internal evals — and was deployed anyway 15. That reframes the incident. The signal wasn’t missing; the judgment to ship over it was the failure.

“The harness, not the weights, is now the dominant variable in perceived intelligence.”

And it isn’t a one-off. The Decoder points out this is Anthropic’s second postmortem in eight months for inference-layer bugs causing user-visible quality regressions; an August 2025 incident hit Sonnet 4 and Haiku 3.5 the same way 16. Two postmortems with the same shape look less like compounding bad luck and more like a recurring release-discipline failure.

The harness is the product now

The deeper point — which the postmortem itself does not commit to — is that pinning a model version no longer pins behavior. Sausheong Chang’s TerminalBench 2.0 work shows the same Opus model scoring materially lower in the default Claude Code harness than in optimized third-party environments; the orchestration layer is leaving capability on the floor 17. Sattyam Jain’s “policy-freeze” proposal pushes the implication: production callers need a TLS-pinning-style guarantee covering not just weights but reasoning effort, system prompts, and cache behavior — every server-side knob that a vendor can silently turn 18.

For anyone building on hosted models, the takeaway is uncomfortable. “We didn’t change the model” is technically true and operationally meaningless. Until vendors offer frozen-policy SKUs, the only defense is owning your own harness — or running the evals Anthropic’s own gating process apparently waved through.

Round-ups

Serving the For You feed

Source: simon-willison

Bluesky’s spacecowboy runs the 72,000-user For You feed from a single Go process on a 16-core, 96GB-RAM gaming PC in his living room, fronted by a $7/month OVH VPS over Tailscale. Total cost is $30/month, with 419GB of SQLite holding 90 days of like data.

russellromney/honker

Source: simon-willison

Honker is a Rust SQLite extension that brings Postgres-style NOTIFY/LISTEN, durable Kafka-like streams, and the transactional outbox pattern to SQLite. It adds 20+ custom SQL functions, requires WAL mode, and lets workers poll the .db-wal file every 1ms for near-real-time delivery.

Extract PDF text in your browser with LiteParse for the web

Source: simon-willison

Simon Willison ported LlamaIndex’s LiteParse PDF text extractor to run entirely in the browser via PDF.js and Tesseract.js, with no data leaving the machine. He vibe-coded it with Claude Code and Opus 4.7 in a 59-minute build session, deploying via GitHub Pages.

ChatGPT’s Nano Banana

Source: bens-bites

Ben’s Bites benchmarks ChatGPT’s new Nano Banana image model against popular design tools, putting the OpenAI release head-to-head with established creative software on real design tasks.

It’s a big one

Source: simon-willison

Simon Willison’s weekly newsletter rounds up coverage of GPT-5.5, ChatGPT Images 2.0, and Qwen3.6-27B, alongside 5 blog posts, 8 links, 3 quotes, and a new chapter of his Agentic Engineering Patterns guide.

Here’s how our TPUs power increasingly demanding AI workloads.

Source: google-ai-blog

Google publishes an explainer video on its Tensor Processing Units, walking through how the custom silicon handles training and inference for increasingly demanding AI workloads on Google Cloud.

Millisecond Converter

Source: simon-willison

Simon Willison ships a small browser tool that converts milliseconds to seconds and minutes, built to scratch his own itch reading prompt-duration outputs from his LLM command-line utility.

Footnotes

  1. The New Stack on GPT-5.5 securityhttps://thenewstack.io/openai-chatgpt-gpt-5-5-security/

    OpenAI’s Preparedness Framework rated GPT-5.5’s cyber and biological capabilities as ‘HIGH,’ triggering a targeted Bio Bug Bounty program — the stated reason API release was delayed pending ‘different safeguards.’

  2. Mission Cloud, ‘Why Anthropic was right to ban OpenClaw’https://www.missioncloud.com/blog/why-anthropic-was-right-to-ban-openclaw-a-cautionary-tale-for-ai-builders

    Boris Cherny explained consumer subscriptions were never designed for agentic reasoning loops; while human users hit ~95% prompt cache rates, harnesses bypass these optimizations, with single power users consuming compute equivalent to hundreds of standard users.

  3. Business Insider, ‘Anthropic cuts off OpenClaw support’https://www.businessinsider.com/anthropic-cuts-off-openclaw-support-claude-subscriptions-2026-4

    OpenClaw creator Peter Steinberger reported his personal Claude account was suspended for ‘suspicious activity’ — reversed within hours but ‘severely damaged developer trust’; Anthropic offered a $200 credit to affected users.

  4. llm-stats.com, GPT-5.5 vs GPT-5.4 analysishttps://llm-stats.com/blog/research/gpt-5-5-vs-gpt-5-4

    OpenAI defends the doubled pricing by claiming GPT-5.5 is roughly 40% more token-efficient, but for many high-volume, low-complexity tasks GPT-5.4 remains the more cost-effective default.

  5. Ethan Mollick, ‘Sign of the future: GPT-5.5’ (One Useful Thing)https://www.oneusefulthing.org/p/sign-of-the-future-gpt-55

    GPT-5.5 Pro took 20 minutes to complete a procedurally generated 3D harbor town simulation that took GPT-5.4 33 minutes — but the jagged frontier remains, with persistent flatness in long-form fiction and ‘uncanny’ metaphors.

  6. MindStudio GPT-5.5 reviewhttps://www.mindstudio.ai/blog/gpt-5-5-review-agentic-model

    Independent testing by Tom’s Guide saw GPT-5.5 lose to Anthropic’s Claude Opus 4.7 in seven separate categories, despite OpenAI’s reported 82.7% on Terminal-Bench 2.0 vs GPT-5.4’s 75.1%.

  7. VentureBeathttps://venturebeat.com/technology/deepseek-v4-arrives-with-near-state-of-the-art-intelligence-at-1-6th-the-cost-of-opus-4-7-gpt-5-5

    DeepSeek V4 arrives with near-state-of-the-art intelligence at 1/6th the cost of Opus 4.7 / GPT-5.5

  8. Artificial Analysishttps://artificialanalysis.ai/articles/deepseek-is-back-among-the-leading-open-weights-models-with-v4-pro-and-v4-flash

    While the model shows an 11-point improvement over V3 in its AA-Omniscience score, it maintains a strikingly high hallucination rate of 94% on queries where it lacks the answer.

  9. Medium / ByteWaveNetwork long-context testhttps://medium.com/@ByteWaveNetwork/i-tested-deepseek-v4s-long-context-claims-so-you-don-t-have-to-7bc0848bae18

    While DeepSeek reports a stable 0.82 accuracy on 8-needle tests up to 256K tokens, performance drops to 0.59 at the 1M-token limit… V4 displays random misses throughout the context window, making it harder for developers to build reliable verification layers.

  10. Atlas Cloud comparison (Kimi K2.6 / GLM-5.1 / Qwen 3.6 / DeepSeek V4)https://www.atlascloud.ai/blog/guides/kimi-k2-6-vs-glm-5-1-vs-qwen-3-6-plus-vs-minimax-m2-7-coding-2026

    Kimi K2.6 is praised for sustaining 4,000+ tool calls over 13-hour sessions… DeepSeek V4 Pro remains the leader in raw competitive coding, posting a Codeforces rating of 3206 and an 80.6% on SWE-bench Verified.

  11. Progressive Robothttps://www.progressiverobot.com/2026/04/25/deepseek-huawei-chips-model/

    DeepSeek V4 was natively built using Huawei’s CANN rather than being ported from CUDA… allowed the model to achieve hardware utilization rates exceeding 85% on Huawei silicon.

  12. Council on Foreign Relationshttps://www.cfr.org/articles/deepseek-v4-signals-a-new-phase-in-the-u-s-china-ai-rivalry

    Anthropic specifically reported identifying roughly 24,000 fraudulent accounts used to generate over 16 million exchanges with its Claude models… allegedly targeted complex reasoning pathways and chain-of-thought data to subsidize DeepSeek’s training.

  13. Business Insiderhttps://www.businessinsider.com/anthropic-admits-claude-code-issues-user-complaints-denies-nerfing-degrading-2026-4

    Anthropic says Claude Code did get worse but shoots down speculation it ‘nerfed’ the model — the company reset usage limits for all subscribers as compensation while denying any intentional degradation.

  14. Forbes (The Wiretap)https://www.forbes.com/sites/the-wiretap/2026/04/22/anthropics-claude-is-pumping-out-vulnerable-code-cyber-experts-warn/

    Veracode found Claude Opus 4.7 included vulnerabilities in 52% of tested tasks, and TrustedSec reported a 47% drop in code quality over a five-week span, leading them to pause use of the tool for defensive testing.

  15. VentureBeathttps://venturebeat.com/technology/mystery-solved-anthropic-reveals-changes-to-claudes-harnesses-and-operating-instructions-likely-caused-degradation

    An April 16 system prompt change capping intermediate text at 25 words showed a 3% drop in coding intelligence in internal evaluations — but was deployed anyway.

  16. The Decoderhttps://the-decoder.com/anthropic-confirms-technical-bugs-after-weeks-of-complaints-about-declining-claude-code-quality/

    This is the second time in eight months Anthropic has issued a postmortem for Claude quality regressions; an August 2025 incident similarly affected Sonnet 4 and Haiku 3.5 through inference-layer bugs.

  17. Sausheong Chang — ‘Own Your Harness’https://sausheong.com/own-your-harness-2f5299a855a7

    TerminalBench 2.0 showed the same Claude Opus model scored significantly lower in the default Claude Code harness than in optimized third-party environments — poor harness design leaves capability on the floor.

  18. Medium — Sattyam Jain (‘Policy-Freeze’)https://medium.com/@sattyamjain96/policy-freeze-is-the-missing-primitive-in-agent-infrastructure-and-anthropic-just-made-the-case-331be64fa00e

    The industry needs a ‘policy-freeze’ primitive analogous to TLS certificate pinning — pinning a model version is insufficient if the provider can still alter reasoning effort or system prompts that govern that model’s output.

Jack Sun

Jack Sun, writing.

Engineer · Bay Area

Hands-on with agentic AI all day — building frameworks, reading what industry ships, occasionally writing them down.

Digest
All · AI Tech · AI Research · AI News
Writing
Essays
Elsewhere
Subscribe
All · AI Tech · AI Research · AI News · Essays

© 2026 Wei (Jack) Sun · jacksunwei.me Built on Astro · hosted on Cloudflare