Sources

A pelican for GPT-5.5 via the semi-official Codex backdoor API simonwillison.net

GPT-5.5 is out . It’s available in OpenAI Codex and is rolling out to paid ChatGPT subscribers. I’ve had some preview access and found it to be a fast, effective and highly capable model. As is usually the case these days, it’s hard to put into words what’s good about it - I ask it to build things and it builds exactly what I ask for! There’s one notable omission from today’s release - the API: API deployments require different safeguards and we are working closely with partners and customers o…

GPT-5.5 prompting guide simonwillison.net

GPT-5.5 prompting guide Now that GPT-5.5 is available in the API , OpenAI have released a wealth of useful tips on how best to prompt the new model. Here’s a neat trick they recommend for applications that might spend considerable time thinking before returning a user-visible response: Before any tool calls for a multi-step task, send a short user-visible update that acknowledges the request and states the first step. Keep it to one or two sentences. I’ve already noticed their Codex app doing t…

llm 0.31 simonwillison.net

Release: llm 0.31 New GPT-5.5 OpenAI model: llm -m gpt-5.5 . #1418 New option to set the text verbosity level for GPT-5+ OpenAI models: -o verbosity low . Values are low , medium , high . New option for setting the image detail level used for image attachments to OpenAI models: -o image_detail low - values are low , high and auto , and GPT-5.4 and 5.5 also accept original . Models listed in extra-openai-models.yaml are now also registered as asynchronous. #1395 Tags: gpt , openai , llm

llm-openai-via-codex 0.1a0 simonwillison.net

Release: llm-openai-via-codex 0.1a0 Hijacks your Codex CLI credentials to make API calls with LLM, as described in my post about GPT-5.5 . Tags: openai , llm , codex-cli

DeepSeek V4 - almost on the frontier, a fraction of the price simonwillison.net

Chinese AI lab DeepSeek’s last model release was V3.2 (and V3.2 Speciale) last December . They just dropped the first of their hotly anticipated V4 series in the shape of two preview models, DeepSeek-V4-Pro and DeepSeek-V4-Flash . Both models are 1 million token context Mixture of Experts. Pro is 1.6T total parameters, 49B active. Flash is 284B total, 13B active. They’re using the standard MIT license. I think this makes DeepSeek-V4-Pro the new largest open weights model. It’s larger than Kimi…

An update on recent Claude Code quality reports simonwillison.net

An update on recent Claude Code quality reports It turns out the high volume of complaints that Claude Code was providing worse quality results over the past two months was grounded in real problems. The models themselves were not to blame, but three separate issues in the Claude Code harness caused complex but material problems which directly affected users. Anthropic’s postmortem describes these in detail. This one in particular stood out to me: On March 26, we shipped a change to clear Claud…

Serving the For You feed simonwillison.net

Bluesky’s spacecowboy runs the 72,000-user For You feed from a single Go process on a 16-core, 96GB-RAM gaming PC in his living room, fronted by a $7/month OVH VPS over Tailscale. Total cost is $30/month, with 419GB of SQLite holding 90 days of like data.

russellromney/honker simonwillison.net

Honker is a Rust SQLite extension that brings Postgres-style NOTIFY/LISTEN, durable Kafka-like streams, and the transactional outbox pattern to SQLite. It adds 20+ custom SQL functions, requires WAL mode, and lets workers poll the .db-wal file every 1ms for near-real-time delivery.

Extract PDF text in your browser with LiteParse for the web simonwillison.net

Simon Willison ported LlamaIndex’s LiteParse PDF text extractor to run entirely in the browser via PDF.js and Tesseract.js, with no data leaving the machine. He vibe-coded it with Claude Code and Opus 4.7 in a 59-minute build session, deploying via GitHub Pages.

Here’s how our TPUs power increasingly demanding AI workloads. blog.google

Google publishes an explainer video on its Tensor Processing Units, walking through how the custom silicon handles training and inference for increasingly demanding AI workloads on Google Cloud.

ChatGPT’s Nano Banana bensbites.com

Ben’s Bites benchmarks ChatGPT’s new Nano Banana image model against popular design tools, putting the OpenAI release head-to-head with established creative software on real design tasks.

Millisecond Converter simonwillison.net

Simon Willison ships a small browser tool that converts milliseconds to seconds and minutes, built to scratch his own itch reading prompt-duration outputs from his LLM command-line utility.

It’s a big one simonwillison.net

Simon Willison’s weekly newsletter rounds up coverage of GPT-5.5, ChatGPT Images 2.0, and Qwen3.6-27B, alongside 5 blog posts, 8 links, 3 quotes, and a new chapter of his Agentic Engineering Patterns guide.

References

Ethan Mollick, ‘Sign of the future: GPT-5.5’ (One Useful Thing) oneusefulthing.org

GPT-5.5 Pro took 20 minutes to complete a procedurally generated 3D harbor town simulation that took GPT-5.4 33 minutes — but the jagged frontier remains, with persistent flatness in long-form fiction and ‘uncanny’ metaphors.

Business Insider, ‘Anthropic cuts off OpenClaw support’ businessinsider.com

OpenClaw creator Peter Steinberger reported his personal Claude account was suspended for ‘suspicious activity’ — reversed within hours but ‘severely damaged developer trust’; Anthropic offered a $200 credit to affected users.

llm-stats.com, GPT-5.5 vs GPT-5.4 analysis llm-stats.com

OpenAI defends the doubled pricing by claiming GPT-5.5 is roughly 40% more token-efficient, but for many high-volume, low-complexity tasks GPT-5.4 remains the more cost-effective default.

MindStudio GPT-5.5 review mindstudio.ai

Independent testing by Tom’s Guide saw GPT-5.5 lose to Anthropic’s Claude Opus 4.7 in seven separate categories, despite OpenAI’s reported 82.7% on Terminal-Bench 2.0 vs GPT-5.4’s 75.1%.

The New Stack on GPT-5.5 security thenewstack.io

OpenAI’s Preparedness Framework rated GPT-5.5’s cyber and biological capabilities as ‘HIGH,’ triggering a targeted Bio Bug Bounty program — the stated reason API release was delayed pending ‘different safeguards.’

Mission Cloud, ‘Why Anthropic was right to ban OpenClaw’ missioncloud.com

Boris Cherny explained consumer subscriptions were never designed for agentic reasoning loops; while human users hit ~95% prompt cache rates, harnesses bypass these optimizations, with single power users consuming compute equivalent to hundreds of standard users.

Artificial Analysis artificialanalysis.ai

While the model shows an 11-point improvement over V3 in its AA-Omniscience score, it maintains a strikingly high hallucination rate of 94% on queries where it lacks the answer.

VentureBeat venturebeat.com

DeepSeek V4 arrives with near-state-of-the-art intelligence at 1/6th the cost of Opus 4.7 / GPT-5.5

Council on Foreign Relations cfr.org

Anthropic specifically reported identifying roughly 24,000 fraudulent accounts used to generate over 16 million exchanges with its Claude models… allegedly targeted complex reasoning pathways and chain-of-thought data to subsidize DeepSeek’s training.

Progressive Robot progressiverobot.com

DeepSeek V4 was natively built using Huawei’s CANN rather than being ported from CUDA… allowed the model to achieve hardware utilization rates exceeding 85% on Huawei silicon.

Medium / ByteWaveNetwork long-context test medium.com

While DeepSeek reports a stable 0.82 accuracy on 8-needle tests up to 256K tokens, performance drops to 0.59 at the 1M-token limit… V4 displays random misses throughout the context window, making it harder for developers to build reliable verification layers.

Atlas Cloud comparison (Kimi K2.6 / GLM-5.1 / Qwen 3.6 / DeepSeek V4) atlascloud.ai

Kimi K2.6 is praised for sustaining 4,000+ tool calls over 13-hour sessions… DeepSeek V4 Pro remains the leader in raw competitive coding, posting a Codeforces rating of 3206 and an 80.6% on SWE-bench Verified.

Business Insider businessinsider.com

Anthropic says Claude Code did get worse but shoots down speculation it ‘nerfed’ the model — the company reset usage limits for all subscribers as compensation while denying any intentional degradation.

Forbes (The Wiretap) forbes.com

Veracode found Claude Opus 4.7 included vulnerabilities in 52% of tested tasks, and TrustedSec reported a 47% drop in code quality over a five-week span, leading them to pause use of the tool for defensive testing.

VentureBeat venturebeat.com

An April 16 system prompt change capping intermediate text at 25 words showed a 3% drop in coding intelligence in internal evaluations — but was deployed anyway.

The Decoder the-decoder.com

This is the second time in eight months Anthropic has issued a postmortem for Claude quality regressions; an August 2025 incident similarly affected Sonnet 4 and Haiku 3.5 through inference-layer bugs.

Medium — Sattyam Jain (‘Policy-Freeze’) medium.com

The industry needs a ‘policy-freeze’ primitive analogous to TLS certificate pinning — pinning a model version is insufficient if the provider can still alter reasoning effort or system prompts that govern that model’s output.

Sausheong Chang — ‘Own Your Harness’ sausheong.com

TerminalBench 2.0 showed the same Claude Opus model scored significantly lower in the default Claude Code harness than in optimized third-party environments — poor harness design leaves capability on the floor.

Sources

References

Jack Sun, writing.