OpenAI's release cadence is outrunning its deployment story
OpenAI shipped three polished launches today, and independent testers found the seams in each before the marketing settled.
OpenAI’s release cadence is outrunning its deployment story
TL;DR
- Symphony promises 500% more PRs by running Codex off Linear tickets, but reviewers and a known prompt injection eat the gains.
- GPT-5.5 ships with a rebuild-your-prompts memo, a Codex migration skill, and OpenAI’s first High Preparedness rating for bio/cyber.
- Independent GPT-5.5 evals split: agentic wins, skill-injected regressions vs GPT-5.4, and rough hallucination numbers.
- The Privacy Filter Gradio demo looks clean; recall drops to 38% on EHR notes and 10% on web text without OpenAI’s scaffolding.
- ChatGPT Images 2.0 renders a horse-astronaut-pelican-bicycle stack and signs it WHY ARE YOU LIKE THIS.
Today is an OpenAI day across the tech track, and the pattern repeats three times: a glossy launch, a confident framing, and an independent tester quietly demonstrating that the production story is harder than the demo.
Symphony reframes Linear as an agent control plane and claims a 5× PR throughput jump — until you notice the reviewer queue, the token bill, and the prompt injection that already drains CI secrets from this exact pattern. GPT-5.5 lands with a memo telling developers to throw out their prompts, a Codex skill to automate the migration, and an evals split that’s hard to reconcile with the marketing. The Privacy Filter ships a slick Gradio reference app whose underlying model OpenAI itself calls a redaction aid — recall collapses the moment you take it out of OpenAI’s own contextual scaffolding.
The throughline isn’t that any single release is broken. It’s that OpenAI’s shipping cadence has decoupled from the work of making each thing safe to deploy, and that work is now happening in public, by other people.
GPT-5.5 lands with a “throw out your prompts” memo — and a split verdict
Source: simon-willison · published 2026-04-25
TL;DR
- OpenAI is telling developers to treat GPT-5.5 as a new model family and rebuild prompts from scratch, not port them.
- A Codex skill (
$openai-docs migrate) automates the model-string swap and light prompt rewrites. - Independent evals split hard: agentic gains are real, but skill-injected tasks regress vs GPT-5.4 and hallucination numbers are ugly.
- It’s OpenAI’s first “High” Preparedness classification for bio/cyber, with a $25K bug bounty gated behind Codex Desktop.
OpenAI’s pitch: rip out the prompt stack
The release-week guidance from OpenAI is unusually blunt: GPT-5.5 is “a new model family to tune for, not a drop-in replacement,” and developers should “begin migration with a fresh baseline” rather than carrying over instructions tuned for 5.2 or 5.4. The Decoder’s read is that legacy prompts which over-specify every step now create “noise” that constrains the model’s search space 1. To smooth the transition, OpenAI shipped a Codex skill — $openai-docs migrate this project to gpt-5.5 — that scans for legacy model strings and applies light rewrites to remove the now-counterproductive “think step by step” scaffolding.
The new levers are reasoning_effort, verbosity control, and tighter tool descriptions (typed JSON schemas, not prose). There’s also a recommended UX pattern: emit a one- or two-sentence preamble before any tool call, so long agentic runs don’t feel hung. Ethan Mollick frames the whole release as a “category change,” citing a demo where GPT-5.5 ingested a decade of raw research data and produced a cited paper end-to-end without step-by-step prompting 2.
The counter-evidence is louder than usual
The cleanest dissent comes from Tessl’s agentic evaluation. On raw baselines GPT-5.5 narrowly beats 5.4 (77.5 vs 75.9) — but inject domain-specific skills as structured markdown context and the ranking flips: 5.4 climbs to 92.7, 5.5 stalls at 87.4, and on node-best-practices the older model wins by 19.8 points (97.4 vs 77.6) 3. The implication directly contradicts OpenAI’s “fresh baseline” advice 1: teams who’ve invested in skill libraries may find 5.5 over-trusts its priors and ignores their opinionated guidance.
Independent leaderboards aren’t kind either. One developer survey of LiveBench’s xHigh-effort tier put GPT-5.5 at #11 — behind 5.4, Claude 4.6, and Gemini 3.1 Pro — with an 86% hallucination rate on Artificial Analysis Omniscience 4. On SWE-bench Pro, Claude Opus 4.7 still leads on real GitHub issues, and the new pricing has practitioners “burning through $100 in an hour” 5.
| Eval | GPT-5.5 | Best competitor |
|---|---|---|
| Tessl agentic + skills 3 | 87.4 | GPT-5.4 — 92.7 |
| SWE-bench Pro 5 | 58.6% | Claude Opus 4.7 — 64.3% |
| LiveBench xHigh rank 4 | #11 | GPT-5.4, Claude 4.6, Gemini 3.1 Pro |
| API price (in/out per M) 5 | $5 / $30 | GPT-5.4 — half that |
Safety posture and the takeaway
GPT-5.5 is OpenAI’s first model classified “High” for both cybersecurity and biology under the Preparedness Framework, which is why the API rollout was staggered and the Bio Bug Bounty — up to $25,000 for a universal jailbreak — is restricted to Codex Desktop under NDA 6.
The broader story is that OpenAI is asking developers to do real migration work for a model whose independent benchmarks don’t yet replicate the leaderboard pitch.
If your stack is plain-prompt and agentic, the upgrade probably pays off. If you’ve spent six months building a skills library on 5.4, the honest answer this week is: stay put and wait for 5.5.1.
Further reading
- llm 0.31 — simon-willison
- Sign of the future: GPT-5.5 — one-useful-thing
Symphony turns issue trackers into agent control planes — and reviewers into the bottleneck
Source: openai-blog · published 2026-04-27
TL;DR
- OpenAI’s Symphony spec runs Codex agents as a background daemon driven by Linear tickets, claiming a 500% PR throughput jump.
- Analysts counter that a 5× PR surge just relocates the bottleneck to human reviewers, at an estimated ~$5,000/user/month in tokens.
- A documented “Comment and Control” prompt injection already exfiltrates CI secrets from this exact orchestration pattern.
- The most production-ready implementation isn’t OpenAI’s — it’s the Sortie fork, with SQLite, Go, and GitHub/Jira adapters.
From sessions to a daemon
Symphony reframes coding agents as a background process, not a chat. An orchestrator polls Linear, assigns each open ticket a Codex app-server subprocess in an isolated workspace, and walks it through a state machine (Unclaimed → Claimed → Running → RetryQueued → Released) until a PR is ready for review. Engineers write a WORKFLOW.md — YAML front matter plus a Markdown prompt body — and let the daemon handle rebases, conflict resolution, and flaky CI retries. Default concurrency is 10 agents per user, well past the 3–5 sessions humans can juggle before context-switching tanks productivity.
That’s the pitch. The reference implementation is in Elixir (for OTP concurrency), but the spec is language-agnostic and has already been ported to TypeScript, Go, Rust, Java, and Python. 15,000 GitHub stars in under a month suggest the orchestration pattern has resonated even where the artifact hasn’t.
The 500% number, examined
OpenAI reports some internal teams landed 5× more PRs in three weeks. Forrester’s read, surfaced by InfoWorld, is that this is the wrong metric: “code generation scales effortlessly, the burden of validation and review does not,” and a fivefold PR surge “may actually decrease total team velocity by overwhelming human reviewers” 7. Community usage adds an economic counterweight — running the default 10 concurrent agents has been estimated at roughly $5,000 per user per month in token spend, with users routinely blowing through weekly Codex limits 8. Neither figure is independently audited, but together they reframe Symphony as an expensive throughput pump whose downstream cost lands on the humans doing review.
The spec itself is taking flak too. Hacker News commenters called SPEC.md “inscrutable agent slop” and pointed out that despite advertising a state machine, it “fails to actually describe the state transitions” 9 — awkward for a document whose entire premise is enabling clean re-implementations.
Security: the trifecta is the architecture
Help Net Security documented an April 2026 “Comment and Control” prompt-injection class that hijacked Codex, Claude Code, and Copilot agents via malicious PR titles, exfiltrating ANTHROPIC_API_KEY and GITHUB_TOKEN as public PR comments 10. Symphony’s design is that attack surface:
flowchart LR
A[Linear ticket / PR title<br/>untrusted text] --> B{Codex agent<br/>in workspace}
C[Repo + CI secrets<br/>ANTHROPIC_API_KEY, GITHUB_TOKEN] --> B
B --> D[linear_graphql tool]
B -. comment exfiltration .-> E((Public PR thread))
The linear_graphql indirection keeps the tracker token out of the agent’s env, but the spec explicitly punts sandboxing policy to implementers. The secret-exfiltration class isn’t solved — just relocated.
The forks are doing the interesting work
The community is already routing around Symphony’s hard dependencies. Sortie re-implements the spec as a single Go binary backed by SQLite, with working adapters for GitHub Issues and Jira Cloud and pluggable agents including Claude Code and Copilot 11 — directly attacking the Linear/Codex/Elixir lock-in. That a fork is the production-ready path within weeks of launch tells you where the real validation is happening.
The labor backdrop sharpens the stakes: junior developer postings have reportedly fallen 60% since 2022, with Microsoft leadership warning of a “hollowed out” engineering pipeline 12. The tickets Symphony eats first are precisely the ones juniors used to learn on.
OpenAI’s Privacy Filter ships a slick Gradio demo and a quietly fragile model
Source: huggingface-blog · published 2026-04-27
TL;DR
- Hugging Face’s tutorial wires three web apps to
openai/privacy-filterviagradio.Server— clean reference, optimistic framing. - Independent testing shows recall collapses to 38% on EHR notes and 10% on web-crawl text once OpenAI’s contextual scaffolding disappears.
- The custom-frontend story locks you into
@gradio/clientbecause ZeroGPU’s rate-limiter needs anX-IP-Tokenheaderfetch()won’t send. - OpenAI itself calls it a “redaction aid, not anonymization” — a caveat the tutorial omits.
The demo apps are fine. The model under them is precision-biased.
The walkthrough builds three things on top of gradio.Server: a Document Privacy Explorer that highlights PII spans in PDFs, an Image Anonymizer that pairs Tesseract OCR with redaction bars, and SmartRedact Paste, a token-gated pastebin that swaps PII for <CATEGORY> placeholders on the public URL. All three lean on OpenAI’s headline number: state-of-the-art on PII-Masking-300k, 1.5B params with 50M active, 128k context in a single forward pass.
That benchmark is not what you’ll get in production. Tonic.ai’s head-to-head evaluation found precision holds up (0.77–0.85), but default recall craters on realistic inputs 13:
| Dataset | OPF default recall |
|---|---|
| PII-Masking-300k (claim) | ~96% F1 |
| EHR clinical notes | 38% |
| Web-crawl PII | 10% |
Stephen Turner’s probing explains the mechanism: the model is leaning hard on linguistic context. Strip the phrase “my phone number is” and account-number recall drops from ~80% to 21% 14. That is exactly the opposite of what the Document Privacy Explorer’s “long varied PDF” use case will throw at it.
False positives quietly corrupt the SmartRedact use case
Practitioner reports on r/LocalLLaMA flag the inverse problem: common nouns like “matter” and “end,” and acronyms like “MCP,” get classified as private organizations 15. For the highlight-and-review Document Explorer this is annoying. For SmartRedact Paste — where the public link silently substitutes <CATEGORY> placeholders — every false positive destroys semantic content before the user sees it. The tutorial doesn’t surface OpenAI’s own caveat that the model “is not an anonymization tool” or compliance certification 16; it’s a redaction aid that needs human review or domain fine-tuning before anything regulated touches it.
The gradio.Server story has a lock-in footnote
The post pitches @server.api and @server.get/post as a clean way to bring your own HTML/JS frontend while keeping ZeroGPU queueing. True, with an asterisk: the official gradio.Server announcement notes that plain fetch() calls fail ZeroGPU’s rate-limiter because only the @gradio/client JS SDK forwards the X-IP-Token auth header from the HF iframe to the server 17. “Custom frontend” effectively means “custom frontend that imports @gradio/client” — otherwise you lose the free GPU tier.
On the model side, the tutorial elides architecture detail worth knowing if you plan to fine-tune: 128 experts with top-4 routing, Grouped-Query Attention, only 8 pre-norm transformer blocks, and a training pipeline that converted a gpt-oss-family autoregressive decoder into a bidirectional classifier with a 33-logit BIOES token-classification head 18. That’s a small, sparse, repurposed encoder — which helps explain both the impressive 50M active-parameter latency story and the brittleness to missing context.
Takeaway
Use the tutorial as a gradio.Server reference; treat the redaction layer as advisory. If you ship SmartRedact Paste behind a public URL without a recall-calibration pass and a human in the loop, you are publishing PII, not hiding it.
Round-ups
WHY ARE YOU LIKE THIS
Source: simon-willison
ChatGPT Images 2.0 renders a stacked-absurdity prompt — horse riding astronaut riding pelican riding bicycle — and unprompted adds a road sign reading ‘WHY ARE YOU LIKE THIS,’ a riff on Simon Willison’s pelican-on-a-bicycle benchmark shared by @scottjla on Twitter.
Footnotes
-
The Decoder — https://the-decoder.com/openai-says-old-prompts-are-holding-gpt-5-5-back-and-developers-need-a-fresh-baseline/
↩ ↩2OpenAI says old prompts are holding GPT-5.5 back and developers need a fresh baseline — legacy prompts that over-specify every step now create ‘noise’ that restricts the model’s search space.
-
Towards AI (Mollick analysis recap) — https://pub.towardsai.net/openai-was-losing-the-enterprise-market-for-six-months-last-thursday-they-hit-back-5bbad0a55c02
↩GPT-5.5 represents a ‘category change’ — it can ingest a decade of raw research data and independently formulate hypotheses, run statistical tests, and draft a paper with accurate citations, but creative fiction still lacks ‘taste’.
-
Tessl blog — https://tessl.io/blog/gpt-55-is-openais-best-model-its-also-the-worst-at-using-the-knowledge-you-give-it/
↩ ↩2In raw baseline testing GPT-5.5 outperforms GPT-5.4 (77.5 vs 75.9), but when domain-specific skills are injected, GPT-5.4 jumps to 92.7 while GPT-5.5 lags at 87.4 — on node-best-practices the gap is 19.8 points (77.6 vs 97.4).
-
dev.to / Kowshik Jallipalli — https://dev.to/kowshik_jallipalli_a7e0a5/gpt-55-just-dropped-heres-what-the-benchmarks-are-hiding-3ich
↩ ↩2Independent testing showed an 86% hallucination rate on Artificial Analysis Omniscience, and on LiveBench the xHigh-effort tier ranked 11th — behind GPT-5.4, Claude 4.6 and Gemini 3.1 Pro.
-
Digital Applied frontier comparison — https://www.digitalapplied.com/blog/gpt-5-5-vs-claude-opus-4-7-frontier-comparison
↩ ↩2 ↩3Claude Opus 4.7 still leads SWE-bench Pro at 64.3% versus GPT-5.5’s 58.6%; developers report ‘burning through $100 in an hour’ at the new $5/$30 per-million-token pricing, double GPT-5.4.
-
OpenAI Bio Bug Bounty announcement — https://openai.com/index/gpt-5-5-bio-bug-bounty/
↩GPT-5.5 is OpenAI’s first model classified ‘High’ for cybersecurity and biology under the Preparedness Framework; the Bio Bug Bounty offers up to $25,000 for a universal jailbreak that answers all five bio-safety challenge questions, restricted to Codex Desktop under NDA.
-
InfoWorld (Sanchit Vir Gogia / Forrester analysis) — https://www.infoworld.com/article/4164173/openais-symphony-spec-pushes-coding-agents-from-prompts-to-orchestration.html
↩while code generation scales effortlessly, the burden of validation and review does not… a fivefold increase in PRs may actually decrease total team velocity by overwhelming human reviewers
-
LetsDataScience / community feedback — https://letsdatascience.com/news/openai-releases-symphony-project-management-spec-for-coding-21c19085
↩the cost of running 10 concurrent agents can be prohibitive, with estimates reaching $5,000 per user monthly due to high token consumption across multiple ‘continuation’ turns
-
Hacker News discussion — https://news.ycombinator.com/item?id=47798887
↩inscrutable agent slop… mentions a state machine for driving agent behavior, [but] fails to actually describe the state transitions
-
Help Net Security — https://www.helpnetsecurity.com/2026/04/28/openai-symphony-codex-orchestration-linear/
↩‘Comment and Control’ class of prompt injection demonstrated how untrusted inputs—such as a malicious PR title—could hijack agents… tricked into exfiltrating production secrets (e.g., ANTHROPIC_API_KEY, GITHUB_TOKEN) from the CI/CD environment
-
DevOps.com (Sortie fork coverage) — https://devops.com/openai-debuts-symphony-to-orchestrate-coding-agents-at-scale/
↩Sortie… provides functional, production-ready adapters for GitHub Issues and Jira Cloud… replaces Symphony’s in-memory Elixir/OTP state with a SQLite backend and a single Go binary
-
CodeConductor.ai analysis — https://codeconductor.ai/blog/future-of-junior-developers-ai/
↩junior developer postings dropped by 60% between 2022 and 2024, a trend that has accelerated in 2026… Microsoft leadership has warned that this technology may ‘hollow out’ the engineering pipeline
-
Tonic.ai benchmark report — https://www.tonic.ai/blog/benchmarking-openai-privacy-filter-pii-detection
↩OPF’s precision remains high (0.77–0.85), but its default recall on EHR notes is a mere 38%… on stubborn data like web-crawl PII, the default recall drops to 10%
-
Stephen Turner blog — https://blog.stephenturner.us/p/privacy-filter-openai
↩Removing contextual hints—such as the phrase ‘my phone number is’—can cause recall to plummet… recall for account numbers dropped from approximately 80% to just 21% when the identifying prefix was removed
-
r/LocalLLaMA discussion — https://www.reddit.com/r/LocalLLaMA/comments/1sw000s/new_model_for_detecting_and_masking_pii_from/
↩Developers reported the model incorrectly flagging common nouns such as ‘matter,’ ‘end,’ and technical acronyms like ‘MCP’ as private organizations
-
Help Net Security — https://www.helpnetsecurity.com/2026/04/23/openai-privacy-filter-personally-identifiable-information/
↩OpenAI explicitly warns that the Privacy Filter is ‘not an anonymization tool’ or a ‘compliance certification’—a redaction aid rather than a safety guarantee
-
Hugging Face — Introducing gradio.Server — https://huggingface.co/blog/introducing-gradio-server
↩Standard fetch() calls will fail ZeroGPU’s rate-limiting because the @gradio/client JS library is specifically designed to forward necessary X-IP-Token auth headers from the HF iframe to the server
-
MarkTechPost architecture deep-dive — https://www.marktechpost.com/2026/04/28/openai-releases-privacy-filter-a-1-5b-parameter-open-source-pii-redaction-model-with-50m-active-parameters/
↩128 total experts with top-4 routing per token, 8 pre-norm transformer blocks with Grouped-Query Attention… the model began as an autoregressive decoder before being converted into a bidirectional classifier with a 33-logit token-classification head over a BIOES scheme