The model isn't the variable anymore — access, harness, and silicon are
Every URL the pipeline pulled into ranking for this issue — primary sources plus the supporting and contradicting findings each Researcher returned. Inline citations in the issue point back here.
Sources
A pelican for GPT-5.5 via the semi-official Codex backdoor API simonwillison.net
GPT-5.5 is out . It’s available in OpenAI Codex and is rolling out to paid ChatGPT subscribers. I’ve had some preview access and found it to be a fast, effective and highly capable model. As is usually the case these days, it’s hard to put into words what’s good about it - I ask it to build things and it builds exactly what I ask for! There’s one notable omission from today’s release - the API: API deployments require different safeguards and we are working closely with partners and customers o…
GPT-5.5 prompting guide simonwillison.net
GPT-5.5 prompting guide Now that GPT-5.5 is available in the API , OpenAI have released a wealth of useful tips on how best to prompt the new model. Here’s a neat trick they recommend for applications that might spend considerable time thinking before returning a user-visible response: Before any tool calls for a multi-step task, send a short user-visible update that acknowledges the request and states the first step. Keep it to one or two sentences. I’ve already noticed their Codex app doing t…
llm 0.31 simonwillison.net
Release: llm 0.31 New GPT-5.5 OpenAI model: llm -m gpt-5.5 . #1418 New option to set the text verbosity level for GPT-5+ OpenAI models: -o verbosity low . Values are low , medium , high . New option for setting the image detail level used for image attachments to OpenAI models: -o image_detail low - values are low , high and auto , and GPT-5.4 and 5.5 also accept original . Models listed in extra-openai-models.yaml are now also registered as asynchronous. #1395 Tags: gpt , openai , llm
llm-openai-via-codex 0.1a0 simonwillison.net
Release: llm-openai-via-codex 0.1a0 Hijacks your Codex CLI credentials to make API calls with LLM, as described in my post about GPT-5.5 . Tags: openai , llm , codex-cli
DeepSeek V4 - almost on the frontier, a fraction of the price simonwillison.net
Chinese AI lab DeepSeek’s last model release was V3.2 (and V3.2 Speciale) last December . They just dropped the first of their hotly anticipated V4 series in the shape of two preview models, DeepSeek-V4-Pro and DeepSeek-V4-Flash . Both models are 1 million token context Mixture of Experts. Pro is 1.6T total parameters, 49B active. Flash is 284B total, 13B active. They’re using the standard MIT license. I think this makes DeepSeek-V4-Pro the new largest open weights model. It’s larger than Kimi…
An update on recent Claude Code quality reports simonwillison.net
An update on recent Claude Code quality reports It turns out the high volume of complaints that Claude Code was providing worse quality results over the past two months was grounded in real problems. The models themselves were not to blame, but three separate issues in the Claude Code harness caused complex but material problems which directly affected users. Anthropic’s postmortem describes these in detail. This one in particular stood out to me: On March 26, we shipped a change to clear Claud…
Serving the For You feed simonwillison.net
Bluesky’s spacecowboy runs the 72,000-user For You feed from a single Go process on a 16-core, 96GB-RAM gaming PC in his living room, fronted by a $7/month OVH VPS over Tailscale. Total cost is $30/month, with 419GB of SQLite holding 90 days of like data.
russellromney/honker simonwillison.net
Honker is a Rust SQLite extension that brings Postgres-style NOTIFY/LISTEN, durable Kafka-like streams, and the transactional outbox pattern to SQLite. It adds 20+ custom SQL functions, requires WAL mode, and lets workers poll the .db-wal file every 1ms for near-real-time delivery.
Extract PDF text in your browser with LiteParse for the web simonwillison.net
Simon Willison ported LlamaIndex’s LiteParse PDF text extractor to run entirely in the browser via PDF.js and Tesseract.js, with no data leaving the machine. He vibe-coded it with Claude Code and Opus 4.7 in a 59-minute build session, deploying via GitHub Pages.
Here’s how our TPUs power increasingly demanding AI workloads. blog.google
Google publishes an explainer video on its Tensor Processing Units, walking through how the custom silicon handles training and inference for increasingly demanding AI workloads on Google Cloud.
ChatGPT’s Nano Banana bensbites.com
Ben’s Bites benchmarks ChatGPT’s new Nano Banana image model against popular design tools, putting the OpenAI release head-to-head with established creative software on real design tasks.
Millisecond Converter simonwillison.net
Simon Willison ships a small browser tool that converts milliseconds to seconds and minutes, built to scratch his own itch reading prompt-duration outputs from his LLM command-line utility.
It’s a big one simonwillison.net
Simon Willison’s weekly newsletter rounds up coverage of GPT-5.5, ChatGPT Images 2.0, and Qwen3.6-27B, alongside 5 blog posts, 8 links, 3 quotes, and a new chapter of his Agentic Engineering Patterns guide.
References
Ethan Mollick, ‘Sign of the future: GPT-5.5’ (One Useful Thing) oneusefulthing.org
GPT-5.5 Pro took 20 minutes to complete a procedurally generated 3D harbor town simulation that took GPT-5.4 33 minutes — but the jagged frontier remains, with persistent flatness in long-form fiction and ‘uncanny’ metaphors.
Business Insider, ‘Anthropic cuts off OpenClaw support’ businessinsider.com
OpenClaw creator Peter Steinberger reported his personal Claude account was suspended for ‘suspicious activity’ — reversed within hours but ‘severely damaged developer trust’; Anthropic offered a $200 credit to affected users.
llm-stats.com, GPT-5.5 vs GPT-5.4 analysis llm-stats.com
OpenAI defends the doubled pricing by claiming GPT-5.5 is roughly 40% more token-efficient, but for many high-volume, low-complexity tasks GPT-5.4 remains the more cost-effective default.
MindStudio GPT-5.5 review mindstudio.ai
Independent testing by Tom’s Guide saw GPT-5.5 lose to Anthropic’s Claude Opus 4.7 in seven separate categories, despite OpenAI’s reported 82.7% on Terminal-Bench 2.0 vs GPT-5.4’s 75.1%.
The New Stack on GPT-5.5 security thenewstack.io
OpenAI’s Preparedness Framework rated GPT-5.5’s cyber and biological capabilities as ‘HIGH,’ triggering a targeted Bio Bug Bounty program — the stated reason API release was delayed pending ‘different safeguards.’
Mission Cloud, ‘Why Anthropic was right to ban OpenClaw’ missioncloud.com
Boris Cherny explained consumer subscriptions were never designed for agentic reasoning loops; while human users hit ~95% prompt cache rates, harnesses bypass these optimizations, with single power users consuming compute equivalent to hundreds of standard users.
Artificial Analysis artificialanalysis.ai
While the model shows an 11-point improvement over V3 in its AA-Omniscience score, it maintains a strikingly high hallucination rate of 94% on queries where it lacks the answer.
VentureBeat venturebeat.com
DeepSeek V4 arrives with near-state-of-the-art intelligence at 1/6th the cost of Opus 4.7 / GPT-5.5
Council on Foreign Relations cfr.org
Anthropic specifically reported identifying roughly 24,000 fraudulent accounts used to generate over 16 million exchanges with its Claude models… allegedly targeted complex reasoning pathways and chain-of-thought data to subsidize DeepSeek’s training.
Progressive Robot progressiverobot.com
DeepSeek V4 was natively built using Huawei’s CANN rather than being ported from CUDA… allowed the model to achieve hardware utilization rates exceeding 85% on Huawei silicon.
Medium / ByteWaveNetwork long-context test medium.com
While DeepSeek reports a stable 0.82 accuracy on 8-needle tests up to 256K tokens, performance drops to 0.59 at the 1M-token limit… V4 displays random misses throughout the context window, making it harder for developers to build reliable verification layers.
Atlas Cloud comparison (Kimi K2.6 / GLM-5.1 / Qwen 3.6 / DeepSeek V4) atlascloud.ai
Kimi K2.6 is praised for sustaining 4,000+ tool calls over 13-hour sessions… DeepSeek V4 Pro remains the leader in raw competitive coding, posting a Codeforces rating of 3206 and an 80.6% on SWE-bench Verified.
Business Insider businessinsider.com
Anthropic says Claude Code did get worse but shoots down speculation it ‘nerfed’ the model — the company reset usage limits for all subscribers as compensation while denying any intentional degradation.
Forbes (The Wiretap) forbes.com
Veracode found Claude Opus 4.7 included vulnerabilities in 52% of tested tasks, and TrustedSec reported a 47% drop in code quality over a five-week span, leading them to pause use of the tool for defensive testing.
VentureBeat venturebeat.com
An April 16 system prompt change capping intermediate text at 25 words showed a 3% drop in coding intelligence in internal evaluations — but was deployed anyway.
The Decoder the-decoder.com
This is the second time in eight months Anthropic has issued a postmortem for Claude quality regressions; an August 2025 incident similarly affected Sonnet 4 and Haiku 3.5 through inference-layer bugs.
Medium — Sattyam Jain (‘Policy-Freeze’) medium.com
The industry needs a ‘policy-freeze’ primitive analogous to TLS certificate pinning — pinning a model version is insufficient if the provider can still alter reasoning effort or system prompts that govern that model’s output.
Sausheong Chang — ‘Own Your Harness’ sausheong.com
TerminalBench 2.0 showed the same Claude Opus model scored significantly lower in the default Claude Code harness than in optimized third-party environments — poor harness design leaves capability on the floor.