JS Wei (Jack) Sun

Scaffolding and subsidies, not weights, are carrying AI's headline numbers

Scaffolding, subsidized credits, and customer fine-tuning — not frontier weights — are quietly carrying the headline numbers behind today's biggest AI announcements.

Scaffolding and subsidies, not weights, are carrying AI’s headline numbers

TL;DR

  • Mozilla’s official advisory credits Claude Mythos with 3 CVEs, not the 271-vulnerability figure both companies publicized this week.
  • An AISLE replication reproduced Mythos’s marquee Firefox bugs using 3.6B–5.1B open-weight models, pointing at scaffolding rather than the frontier model.
  • OpenAI’s Privacy Filter scores 96% F1 in-house but 0.18–0.65 on real corpora; ~2,000 in-domain docs of fine-tuning recover 0.95.
  • GitHub joins Anthropic and Windsurf in repricing coding agents this quarter; Copilot was reportedly losing about $20 per user per month.
  • Codex hits 4M weekly active users with Accenture, PwC and Infosys as enterprise partners, while Anthropic walks back a $100/month Claude Code test.

Three of today’s biggest AI stories share a structural feature that the press releases bury: the marquee number depends on something other than the model. Mozilla and Anthropic’s 271-vulnerability headline rests on scaffolding that open-weight 4B-class models can replicate, and on $100M of Anthropic credits subsidizing the per-finding cost. OpenAI’s Privacy Filter advertises 96% F1, but only after you fine-tune it on your own data — out of the box it’s a base model wearing a product’s clothes. And GitHub’s Copilot repricing — joining Anthropic and Windsurf in the same quarter — admits what the unit economics have said for a year: flat-rate frontier coding was a subsidy, not a business.

The connective tissue is that the moats and the margins are migrating. Capability claims increasingly live in tooling and credit lines; product claims increasingly require customer fine-tuning; pricing claims increasingly require a tier you didn’t buy. Today is a good day to read the second paragraph.

The 271-vulnerability headline that Mozilla’s own advisory doesn’t back up

Source: simon-willison · published 2026-04-22

TL;DR

  • Mozilla and Anthropic claim Claude Mythos Preview found 271 vulnerabilities in Firefox 150; the official advisory credits Mythos with just three CVEs.
  • An AISLE replication recovered the same marquee bugs using 3.6B–5.1B open-weight models, suggesting the scaffolding is the moat, not the frontier model.
  • Per-finding cost is reportedly ~$50, propped up by Anthropic’s $100M Project Glasswing credits — a program already breached via a contractor leak.
  • Mozilla concedes Mythos found no new bug classes, just memory-safety flaws faster.

A headline number that doesn’t reconcile

Firefox CTO Bobby Holley’s blog post — amplified by Simon Willison’s pull-quote and Ars Technica’s writeup — frames Firefox 150 as the moment “defenders finally have a chance to win, decisively,” courtesy of an early build of Anthropic’s Claude Mythos Preview catching 271 vulnerabilities in a single audit pass.

The arithmetic falls apart on inspection. Mozilla’s own canonical advisory, MFSA 2026-30, credits Claude Mythos with exactly three CVEs: 2026-6746, 2026-6757, and 2026-6758 12. The other 268 entries appear to be duplicates, low-severity hardening tickets, or defects swept into omnibus memory-safety bulletins. Davi Ottenheimer’s teardown goes harder, alleging the flagship Firefox demo ran against a stripped harness with sandboxing disabled — and that re-enabling production defenses dropped exploit success from roughly 70% to 4.4% 1.

Call it a 90× gap between the marketing number and the CVE-grade number.

The model probably isn’t the moat

The most damaging dissent comes from AISLE’s “Jagged Frontier” report, which replicated Mythos’s headline finds — including the 27-year-old OpenBSD bug Anthropic showcased and a 17-year-old FreeBSD RCE — using open-weight models as small as 3.6B and 5.1B parameters, at roughly $0.11 per million tokens 3. Their conclusion: the real differentiator is the discovery scaffolding (parallel scanners, context isolation, agentic loops), not frontier reasoning.

Mozilla’s security team, talking to TechRadar, quietly confirmed the ceiling: Mythos is “every bit as capable” as elite human researchers, but it has not surfaced any new class of “AI-exclusive” bug — it finds the same memory-safety flaws, just faster and in parallel 4.

The flagship Firefox demonstration may have been conducted against a ‘stripped out’ test harness with modern sandboxing and defenses removed. 1

The economics work — until the gating doesn’t

ArmorCode’s cost analysis is the most credible data point in the bundle: roughly $10,000 in tokens to grind out a serious FFmpeg finding, with marginal cost per valid bug landing around $50, underwritten by $100M in Anthropic credits to Project Glasswing partners 5. At those prices, burning compute genuinely is cheaper than retaining a senior vulnerability researcher.

But the “defenders win” thesis assumes Mythos stays gated to defenders. It hasn’t. The Glasswing program was breached via a Mercor contractor whose URL-naming conventions let unauthorized users on Discord query the model directly 2. Cisco’s parallel research shows open-weight models — the same class AISLE used to replicate Mythos — are markedly more effective at producing working exploits when prompted across multiple turns 6.

flowchart LR
    A[Frontier model<br/>Claude Mythos] --> S[Agentic scaffolding<br/>parallel scanners]
    O[Open-weight 5B model] --> S
    S --> F[Memory-safety bugs<br/>at scale]
    F --> D[Defenders patch]
    F -. leaked access .-> X[Attackers weaponize]

What’s actually new

Strip the “271” branding and what remains is still meaningful: three real CVEs, a credible cost-per-bug curve, and proof that agentic harnesses around capable LLMs can outpace human auditors on memory-safety drudgework. That’s a genuine shift in defensive economics. It is not “defenders win, decisively” — it’s “the scanner got cheaper, and the same scanner runs for the other side too.”

Further reading


OpenAI’s Privacy Filter is a fine-tunable base, not a turnkey redactor

Source: openai-blog · published 2026-04-22

TL;DR

  • OpenAI shipped a 1.5B-param (50M active) MoE token classifier under Apache 2.0 for local PII redaction, claiming 96% F1.
  • Independent benchmarks on real-world corpora collapse that to 0.18–0.65 F1, with recall as low as 10% on web crawls.
  • Fine-tuning on ~2,000 in-domain docs recovers 0.95 F1 — so treat OPF as a base model, not a drop-in HIPAA shield.
  • Strategically, the open-weight release exists to neutralize data-sovereignty objections blocking GPT-5.5 in regulated verticals.

The 96% number doesn’t survive contact with real data

OpenAI’s headline figure — 96% F1 on PII-Masking-300k, 97.4% after annotation cleanup — comes from a single synthetic English benchmark. Tonic.ai re-ran the model on four messier corpora (EHR notes, call-center transcripts, loan contracts, web crawls) and watched F1 collapse to between 0.18 and 0.65, with recall bottoming out at 10% on web crawls and 38% on medical notes 7. Precision held at 0.77–0.85, which tells you what’s happening: OpenAI picked a conservative operating point to avoid over-redaction, and that choice silently leaves the majority of sensitive tokens exposed in noisy text 8.

BenchmarkF1RecallNotes
PII-Masking-300k (OpenAI)0.960.98Synthetic English [primary]
Japanese (model card)0.880.87Non-Latin degradation 9
Arabic MSA (model card)0.889
Tonic real-world (OOTB)0.18–0.650.10–~0.6EHR, web, contracts 7
Tonic, fine-tuned on 2k docs0.95Matches Tonic Textual 7

The redeeming result: with roughly 2,000 labeled in-domain documents, fine-tuned OPF matches Tonic’s production redactor at 0.95 F1 7. That reframes the release. OPF is a strong base classifier for privacy work, not the turnkey replacement OpenAI’s blog post implies.

What the announcement glossed over

The “1.5B total / 50M active” line hides the actual architecture. Independent writeups confirm a sparse MoE with 128 experts and top-4 routing, eight pre-norm encoder blocks at residual width 640, grouped-query attention (14 query / 2 KV heads) to make the 128K context affordable, and a constrained Viterbi decoder enforcing valid BIOES transitions 10. That last piece matters: it’s why span boundaries stay coherent without an autoregressive decode pass.

The model card also buries multilingual numbers OpenAI didn’t headline — Japanese F1 88.1%, Arabic 87.8% — about nine points below English, with explicit warnings about non-Latin script degradation 9. If your pipeline touches Asian-language customer data, plan to fine-tune.

The dual-use objection — and why this shipped now

Hacker News went after the framing directly:

A 4% failure rate is unacceptable for security-critical redaction… and a tool optimized to find and mask PII can just as easily be repurposed to efficiently extract sensitive data from large datasets. 11

The point lands. Unlike deterministic regex, a stochastic classifier gives no signal about which tokens slipped through — exactly the property you want absent in a compliance control. Commenters also flagged “openwashing”: weights ship under Apache 2.0, but training data and pipeline code do not 11.

The strategic read explains the timing. OPF is a defensive move to capture the enterprise deployment layer and neutralize the data-sovereignty objection that has slowed GPT-5.5 adoption in regulated industries 12. Ship an open-weight redactor that runs locally, and the legal team’s “but the PII leaves our VPC” line stops working.

Takeaway

OpenAI built a competent base model and oversold its out-of-the-box performance. If you’re deploying OPF, budget for in-domain fine-tuning, push the operating point toward recall, keep human review on legal/medical/financial spans, and don’t market it internally as anonymization. The interesting consequence isn’t the model — it’s that frontier labs now treat enterprise privacy infrastructure as a wedge worth giving away.


The flat-rate coding agent is dead: GitHub joins Anthropic and Windsurf in repricing

Source: simon-willison · published 2026-04-22

TL;DR

  • GitHub’s Copilot Individual repricing isn’t a one-off — Anthropic and Windsurf made the same move in the same quarter.
  • “Same $10” hides the real change: Claude Opus requests now reportedly carry a 27× multiplier, gated to a $39 Pro+ tier.
  • GitHub is offering full prorated refunds through May 20, 2026 — they’re pricing in churn, not denying it.
  • The economics: frontier labs spend ~$1.35 in compute per $1 of revenue, and Copilot was reportedly losing ~$20/user/month.

Three vendors, one capitulation

Simon Willison framed GitHub’s April 22 announcement as a Copilot story. It’s bigger than that. The same week, Anthropic briefly pulled Claude Code from the $20 Pro tier and pushed agentic users onto $100+ Max plans, then walked it back under pressure — but admitted in the process that Pro pricing predated agentic workloads and couldn’t absorb them 13. A month earlier, Windsurf abandoned its credit pool for hard quotas, with one user reporting that an identical workflow’s cost jumped “from a few dollars to nearly $80 in a single day” 14. Three vendors, one quarter, same conclusion: long-running agents broke the unit economics of the seat license.

VendorOld modelNew modelUser reaction
GitHub CopilotPer-request, ~unlimitedToken-based weekly/session caps; Opus gated to $39 Pro+“You will get less but pay the same price” 15
Anthropic Claude Code$20 Pro included itPushed to $100+ Max for heavy useReversed after backlash 13
WindsurfCredit poolDaily/weekly hard quotasSame workflow: $5 → $80/day 14

The number missing from the announcement

GitHub’s blog post leans on the phrase “agentic workflows have fundamentally changed Copilot’s compute demands.” What it doesn’t quantify is the multiplier. Community threads document the answer: a Claude Opus request now reportedly consumes up to 27× a standard premium request 16. That’s the mechanism by which a notionally unchanged $10 plan becomes materially smaller, and why even the new $39 Pro+ tier burns through its 1,500-request allowance an order of magnitude faster on the model people are paying extra to use 16.

Visual Studio Magazine’s developer survey caught the resulting mood:

You will get less but pay the same price… each extended session carries a cost. 15

That last clause is the real shift. Per-token accounting means experimentation has a meter running on it — a different psychological posture than the autocomplete era.

Why GitHub will accept the churn

Buried in the changelog is an unusual concession: any Individual subscriber, monthly or annual, can cancel and receive a prorated refund through May 20, 2026 17. Vendors don’t open refund windows unless they expect to use them. The macro context explains the willingness to eat the loss: CIO.com reports Uber exhausted its entire 2026 AI budget by April, and frontier labs are spending roughly $1.35 in compute per $1 of revenue 18. Internal figures circulating around the announcement suggested Copilot was losing about $20/user/month under flat pricing, with power users costing up to $80 18.

What’s actually at stake

The fault line is now explicit: vendors say the math is unsustainable 1318; heavy users experience opaque multipliers as a rugpull 151614. Neither side is wrong. What’s ending is the assumption — inherited from SaaS — that a developer tool’s marginal cost is near zero. Coding agents have marginal cost, it scales with session length, and the first vendor to communicate that honestly (with visible token meters, not hidden multipliers) will win the trust the others are spending down. The May 20 refund window is GitHub’s tell that it knows the bill is coming.

Round-ups

Is Claude Code going to cost $100/month? Probably not - it’s all very confusing

Source: simon-willison

Anthropic silently updated its pricing page to make Claude Code exclusive to the $100/month Max plan, then reverted within hours after Reddit, HN and Twitter erupted. Growth lead Amol Avasare called it a test on 2% of new prosumer signups; OpenAI’s Codex team pounced, reaffirming a free tier.

Scaling Codex to enterprises worldwide

Source: openai-blog

OpenAI launched Codex Labs and named Accenture, PwC and Infosys as deployment partners for enterprise Codex rollouts across the software development lifecycle. The company also disclosed Codex has reached 4 million weekly active users, framing the product as a serious challenger to Claude Code in the enterprise.

[AINews] Moonshot Kimi K2.6: the world’s leading Open Model refreshes to catch up to Opus 4.6 (ahead of DeepSeek v4?)

Source: latent-space

Moonshot refreshed its open-weights Kimi K2.6, which Latent Space’s AINews pegs as roughly matching Anthropic’s Claude Opus 4.6 on benchmarks. The update lands ahead of an anticipated DeepSeek v4 release and reclaims the top open-model slot.

Reading today’s open-closed performance gap

Source: interconnects

Nathan Lambert breaks down what actually drives the headline benchmark gap between open and closed frontier models, arguing the single eval number obscures training-compute, data and post-training factors, and sketches how the gap is likely to evolve as open labs catch up.

OpenAI launches ChatGPT Images 2.0 / GPT-Image-2

Source: openai-blog

OpenAI rolled out ChatGPT Images 2.0, powered by GPT-Image-2, pitching gains in text rendering, multilingual prompts and visual reasoning. The release pairs a consumer-facing ChatGPT update with the underlying API model, targeting the gap with Google’s Imagen and Midjourney on typography-heavy generations.

Further reading:

That’s my designer - Claude

Source: bens-bites

Ben’s Bites flags a Claude design-focused update that ships alongside a new model, Opus 4.7. The snippet is thin on specifics, but the headline framing positions Claude as a designer-replacement tool rather than just a chat or coding assistant.

Announcing the Anthropic Economic Index Survey

Source: anthropic-research

Anthropic opens its Economic Index Survey, soliciting first-party data from workers and businesses on how they use AI. The effort extends the Economic Index beyond Claude.ai conversation logs, aiming to capture adoption and task patterns the telemetry alone misses.

Footnotes

  1. flyingpenguin.com — ‘Mythos Mystery in Mozilla Numbers’https://www.flyingpenguin.com/mythos-mystery-in-mozilla-numbers-how-22-vulns-became-271-or-maybe-3-in-april/

    271 vulnerabilities… the canonical security advisory (MFSA 2026-30) credited Anthropic with only three specific CVEs… a ‘90x difference’… the flagship Firefox demonstration may have been conducted against a ‘stripped out’ test harness with modern sandboxing and defenses removed… exploit success rate reportedly plummeted from over 70% to just 4.4%

    2 3
  2. The Hacker News — community discussion writeuphttps://thehackernews.com/2026/04/anthropics-claude-mythos-finds.html

    only three issues (CVE-2026-6746, CVE-2026-6757, and CVE-2026-6758) were officially credited to Claude in the high-severity advisory, suggesting many of the 271 were likely lower-severity flaws, ‘defense-in-depth’ issues, or non-exploitable code paths… unauthorized users on Discord managed to access the model by guessing naming conventions and exploiting a third-party contractor’s breach

    2
  3. AISLE — ‘AI Cybersecurity After Mythos: The Jagged Frontier’https://aisle.com/blog/ai-cybersecurity-after-mythos-the-jagged-frontier

    a 5.1B-parameter open model was able to recover the logic for a 27-year-old OpenBSD bug that Anthropic had used as a marquee example… the true ‘moat’… is the discovery scaffolding, rather than the raw intelligence of the model itself

  4. TechRadar — Mozilla quoteshttps://www.techradar.com/pro/mozilla-says-anthropics-mythos-is-every-bit-as-capable-as-the-worlds-best-security-researchers-after-firefox-experiment-and-says-the-zero-days-are-numbered

    Mythos is ‘every bit as capable’ as elite human researchers, [but] it has not yet found a new class of ‘AI-exclusive’ bugs; it simply finds existing flaws at an unprecedented scale

  5. ArmorCode blog — token economics analysishttps://www.armorcode.com/blog/anthropics-claude-mythos-and-what-it-means-for-security

    identifying the FFmpeg vulnerabilities required several hundred runs at an estimated cost of $10,000… the cost per valid finding has dropped to roughly $50, making it cheaper to burn compute than to hire human researchers… Anthropic committed $100 million in usage credits to its Project Glasswing partners

  6. HackRead — Cisco open-weight researchhttps://hackread.com/cisco-open-weight-ai-models-long-chat-exploit/

    open-weight models… can also be modified to bypass safety guardrails or used in multi-turn attacks that are significantly more effective at generating malicious exploits than single-turn prompts

  7. Tonic.ai benchmark bloghttps://www.tonic.ai/blog/benchmarking-openai-privacy-filter-pii-detection

    Out-of-the-box F1 scores on four real-world datasets ranged from 0.18 to 0.65… recall dropped to 10% on web crawls and 38% on EHR notes, but with 2,000 labeled documents the fine-tuned OPF matched Tonic Textual at 0.95 F1.

    2 3 4
  8. Security Boulevard (Tonic syndication)https://securityboulevard.com/2026/04/benchmarking-openais-privacy-filter-what-it-gets-right-and-where-pii-detection-still-needs-real-data/

    OPF maintained respectable precision of 0.77–0.85 but its conservative default operating point — designed to avoid over-redaction — leaves substantial PII exposed in noisy unstructured text.

  9. OpenAI Privacy Filter Model Cardhttps://cdn.openai.com/pdf/c66281ed-b638-456a-8ce1-97e9f5264a90/OpenAI-Privacy-Filter-Model-Card.pdf

    Japanese F1 88.1% (recall 86.6%, precision 89.7%) and Modern Standard Arabic F1 87.8% — performance significantly degrades on non-Latin characters and notations that differ from the training distribution.

    2 3
  10. MarkTechPost technical writeuphttps://www.marktechpost.com/2026/04/28/openai-releases-privacy-filter-a-1-5b-parameter-open-source-pii-redaction-model-with-50m-active-parameters/

    Sparse MoE with 128 experts and top-4 routing per token, eight pre-norm transformer encoder blocks at width 640, grouped-query attention (14 query / 2 KV heads), and a constrained Viterbi decoder enforcing valid BIOES transitions.

  11. Hacker News discussionhttps://news.ycombinator.com/item?id=47870901

    A 4% failure rate is unacceptable for security-critical redaction… and a tool optimized to find and mask PII can just as easily be repurposed to efficiently extract sensitive data from large datasets.

    2
  12. Startup Fortune analysishttps://startupfortune.com/openais-open-weight-privacy-filter-kills-the-last-excuse-against-enterprise-ai/

    A tactical move to capture the enterprise deployment layer and neutralize the data sovereignty objections that have slowed adoption of GPT-5.5 and other cloud-based frontier models in regulated industries.

  13. Ed Zitron, Where’s Your Ed At — ‘Anthropic Removes Pro CC’https://www.wheresyoured.at/news-anthropic-removes-pro-cc/

    Anthropic later admitted that its ‘Max’ plans were specifically designed to handle these ‘heavy’ usage patterns, which did not exist when the Pro tier was first conceived

    2 3
  14. r/windsurf — ‘Windsurf is simply destroying its reputation’https://www.reddit.com/r/windsurf/comments/1s1es3f/windsurf_is_simply_destroying_its_reputation_with/

    costs for the same level of output exploded from a few dollars to nearly $80 in a single day due to the new overage calculations

    2 3
  15. Visual Studio Magazine — ‘Devs Sound Off on Usage-Based Copilot Pricing’https://visualstudiomagazine.com/articles/2026/04/27/devs-sound-off-on-usage-based-copilot-pricing-change-you-will-get-less-but-pay-the-same-price.aspx

    You will get less but pay the same price… each extended session carries a cost

    2 3
  16. GitHub gh-aw discussion #15139https://github.com/github/gh-aw/discussions/15139

    The money-burning party is coming to a close… using Claude Opus now carries a 27x cost multiplier compared to previous tiers

    2 3
  17. GitHub Changelog — Changes to Copilot plans for individuals (Apr 20, 2026)https://github.blog/changelog/2026-04-20-changes-to-github-copilot-plans-for-individuals/

    Until May 20, 2026, both monthly and annual users can cancel their subscriptions and receive a full refund for the time remaining on their current term

  18. CIO.com — ‘The inference bill nobody budgeted for’https://www.cio.com/article/4163877/the-inference-bill-nobody-budgeted-for.html

    Uber reportedly exhausted its entire annual AI budget by April due to deep integration of agentic workflows… AI labs are spending an estimated $1.35 for every $1 in revenue to subsidize compute

    2 3
Jack Sun

Jack Sun, writing.

Engineer · Bay Area

Hands-on with agentic AI all day — building frameworks, reading what industry ships, occasionally writing them down.

Digest
All · AI Tech · AI Research · AI News
Writing
Essays
Elsewhere
Subscribe
All · AI Tech · AI Research · AI News

© 2026 Wei (Jack) Sun · jacksunwei.me Built on Astro · hosted on Cloudflare