JS Wei (Jack) Sun

Vendors ship the wins; deployers inherit the work

A model release, a faster API, and an AI-assisted security audit all post genuine wins that depend on conditions deployers must absorb themselves.

Vendors ship the wins; deployers inherit the work

TL;DR

  • Qwen3.6-27B beats its 397B MoE predecessor on SWE-bench Verified at 77.2%, but burned 140M tokens versus a 23M average to do it.
  • OpenAI’s WebSocket Responses API cuts agent-loop latency up to 40%, but only with the weaker Spark model on Cerebras and dev-managed session routing.
  • Anthropic credited Claude Mythos with 271 Firefox fixes; the formal advisory ties just three CVEs directly to Claude.
  • UK AISI confirms a real capability jump for AI bug-finding, but a 3.6B open model reproduced Anthropic’s headline FreeBSD bug at ~1/200th the cost.
  • Hugging Face, OpenAI Academy, and TII round out the day with a Chrome-extension Transformers.js tutorial, a Workspace Agents course, and an Arabic LLM leaderboard.

Three launches today, three genuine technical wins — and three different parties left holding the integration bill. Alibaba’s Qwen3.6-27B really does beat its 397B MoE predecessor on SWE-bench Verified, but the workload that produced the number burned six times the tokens of the field average, and early GGUFs shipped with a quantization-breaking case-sensitivity bug. OpenAI’s new WebSocket transport for the Responses API really does cut agent-loop latency up to 40%, but the headline depends on a weaker model running on Cerebras silicon, and developers inherit session-aware load balancing as a side effect of the persistent connection. And Anthropic’s Claude-assisted Firefox audit really did find vulnerabilities — three of them, directly credited in the formal advisory, not the 271 in the launch-post screenshot.

The pattern isn’t that vendors are overstating; it’s that each “win” is load-bearing on a configuration the launch post compresses away. Read the features for what those configurations actually demand.

Qwen3.6-27B clears the SWE-bench bar — but read the asterisks

Source: simon-willison · published 2026-04-22

TL;DR

  • Qwen3.6-27B beats its 397B MoE predecessor on SWE-bench Verified (77.2% vs 76.2%), in a 55.6GB dense package.
  • Independent trackers confirm the headline, but the model burned 140M tokens on Artificial Analysis’s suite vs a 23M average.
  • A case-sensitivity bug in early GGUFs broke Gated DeltaNet quantization; vLLM tool-calling still regresses unless you disable “thinking.”
  • Practical frontier for a single 4090 — but the polished launch post elides real deployment friction.

The headline holds up

Qwen’s claim that a 27B dense model outperforms its own 397B-total / 17B-active MoE flagship across coding benchmarks is, surprisingly, mostly true. The Decoder’s writeup confirms 77.2% on SWE-bench Verified against the older Qwen3.5-397B-A17B’s 76.2% 1, and Artificial Analysis places the model’s Agentic Index alongside Claude 3.5 Sonnet despite the order-of-magnitude smaller footprint 2. Simon Willison’s local run on a 16.8GB Q4_K_M GGUF — a pelican-on-bicycle SVG with correct frame geometry, chain, and spokes at ~25 tok/s — is the qualitative version of the same story.

For deployment economics, the dense-vs-MoE distinction is what makes this interesting:

ModelTotal paramsActiveDisk (full)Runs on
Qwen3.6-27B27B dense27B55.6GBSingle 4090 (Q4)
Qwen3.5-397B-A17B397B MoE17B807GBMulti-GPU rig
GLM-5.1754B MoE40B~236GB at 2-bit 3Mac Ultra / H100s

GLM-5.1 isn’t really in the same conversation for solo developers 3. Qwen3.6-27B is.

The asterisks the launch post skips

Two findings cut against the marketing. First, Artificial Analysis measured Qwen3.6-27B emitting 140M tokens to finish their evaluation suite versus a 23M average, dropping effective throughput to roughly 64.6 tok/s 2. The “thinking preservation” mechanism that powers the agentic scores is also what makes the model expensive to actually run end-to-end. Benchmark wins, wallclock losses.

Second, the contamination question. Critics on r/LocalLLaMA point to the developers’ own admission that they “corrected some problematic tasks” in SWE-bench Pro before running internal evaluations 4 — exactly the kind of touch that makes rapid 27B-scale benchmark gains look suspicious to anyone who’s been watching this cycle.

Quantization is fragile, tools regress

The 16.8GB GGUF that makes the local-LLM story compelling shipped with a real bug. r/LocalLLaMA traced looping and runaway thinking to a case-sensitivity error in the conversion script — a_log vs A_log — that downcast Gated DeltaNet parameters meant to stay in FP32 5. Unsloth and llama.cpp pushed a fix and “dynamic” GGUFs, but users on lower-bit quants still hit stuck-thinking states.

Worse, vLLM users running the model as an agent hit tool-call failures, and the community workaround is to disable Thinking Preservation entirely 6:

some developers advise disabling ‘Thinking Preservation’ to resolve tool-call misses, though this negates the model’s primary architectural advantage in long-context sessions

In other words: the feature Qwen markets as the architectural win is the same feature you turn off to make agents work in the most popular inference server.

What’s actually new here

Strip the caveats and the signal is still real: a dense 27B is now genuinely competitive with last generation’s 400B-class MoE on coding, and it fits on hardware a single developer owns. That’s the shift worth tracking. The launch post’s silence on verbose thinking, quant bugs, and SWE-bench Pro task edits is the part that should make you wait for the second wave of independent harness results before betting an agent pipeline on it.


OpenAI’s WebSocket Responses API: a real 40% win, on a narrow workload

Source: openai-blog · published 2026-04-22

TL;DR

  • OpenAI’s new WebSocket transport for the Responses API cuts agent-loop latency up to 40% by caching conversation state on a persistent connection.
  • The headline gains depend on GPT-5.3-Codex-Spark hitting 1,000+ TPS on Cerebras hardware — but Spark scores 58.4% on Terminal-Bench 2.0 vs. 77.3% for the flagship.
  • Stateful connections inherit Realtime API billing pathologies and push session-aware load balancing back onto developers.
  • Anthropic is not following: public Messages API stays on SSE, with persistence hidden inside managed harnesses.

What actually changed

OpenAI rebuilt the Responses API transport so a single WebSocket connection holds an in-memory cache of prior response objects, tool definitions, sampling artifacts, and rendered token text. New turns skip re-tokenization, safety classifiers run only on the delta, and billing/post-processing overlap with the next inference. Developers keep the familiar response.create shape and chain calls with previous_response_id. The reported wins: Vercel AI SDK –40% latency, Cline multi-file workflows 39% faster, Cursor +30%, and ~45% better TTFT overall.

sequenceDiagram
    participant C as Client
    participant API as Responses API
    participant M as Model
    Note over C,API: Old: stateless HTTP per turn
    C->>API: POST full history + tools
    API->>API: re-tokenize, re-validate
    API->>M: infer
    M-->>API: tokens
    API-->>C: response (then teardown)
    Note over C,API: New: WebSocket, connection-scoped cache
    C->>API: open WS, send delta + previous_response_id
    API->>M: infer (cached state)
    M-->>API: tokens
    API-->>C: response (billing runs in parallel)

The model behind the numbers is weaker

The 1,000 TPS figure comes from Codex-Spark, a distilled/pruned variant on Cerebras silicon. Independent review finds Spark scores roughly 58.4% on Terminal-Bench 2.0 against the flagship’s 77.3%, and notes the model “over-calls tools and generates excessive tokens, sometimes taking a longer path to reach a solution than a slower, more surgical model” 7. Wall-clock gains on real engineering tasks are therefore narrower than the TPS headline; you’re paying in solution quality for some of that throughput.

Persistent connections, persistent costs

The transport inherits operational baggage already visible in the Realtime API. Developers in OpenAI’s own forums report $5+ bills for 75-second sessions and note that tokens generated during edge-to-internal disconnects are billed even when the client never receives them 8. Infrastructure-side, WebSockets are harder to scale than SSE: load balancers must be session-aware, and reconnect storms create thundering-herd risk that stateless HTTP avoided 9. Sticky sessions and connection caps push real DevOps work onto API consumers.

The latency win is also workload-specific. A Vercel AI Gateway benchmark found native SDKs remain 15–20% faster than routed/persistent paths for ~10-token prompts; the gap only disappears at ~120K-token context sizes 10. Most API traffic isn’t tool-heavy coding agents — many callers inherit the complexity for negligible benefit.

The industry isn’t converging here

Anthropic has pointedly not followed. Its public Messages API stays on SSE, with persistence surfaced only through managed agent harnesses, betting on MCP as the standardization layer 11. Practitioners read OpenAI’s design more sharply: connection-scoped server-side state means conversation history lives on OpenAI’s servers, making it “a competitive moat designed to make speed addictive while securing higher margins through stateful lock-in” 12.

Takeaway

The 40% is real, but it’s a point claim: long, tool-heavy loops on a fast-but-weaker model, paid for with billing opacity, sticky-session ops, and a lock-in shape competitors are deliberately avoiding. Treat WebSocket Responses as a useful tool for Codex-style agents — not the new default shape of LLM APIs.


Firefox’s “271 AI-found vulnerabilities” is mostly three CVEs in a trench coat

Source: simon-willison · published 2026-04-22

TL;DR

  • Bobby Holley credited Claude Mythos with 271 Firefox 150 fixes; the formal advisory lists 41 CVEs and credits just three directly to Claude.
  • UK AISI confirms a real capability jump (73% on expert CTFs), but a 3.6B open model reproduced Anthropic’s headline FreeBSD bug at ~1/200th the token cost.
  • curl killed its bug bounty in January over AI slop — the “defenders win” story only holds inside Anthropic’s curated Glasswing pipeline.
  • Two fast-follow point releases patched regressions caused by the audit-driven churn.

The headline number doesn’t survive the advisory

Firefox CTO Bobby Holley’s blog post — quoted approvingly by Simon Willison — claims Firefox 150 ships fixes for 271 vulnerabilities found by an early build of Claude Mythos, and concludes that “defenders finally have a chance to win, decisively.”

The number melts on contact with MFSA2026-30. FlyingPenguin’s reconciliation finds only 41 CVE entries in the formal advisory, and just three credited to Claude directly: CVE-2026-6746 (High) plus two Mediums 13. The rest are hardening fixes, defense-in-depth tweaks, and memory-safety roll-ups that never met the CVE threshold. The Register pushes further — Holley himself conceded that Mythos found “no category of vulnerability an elite human researcher couldn’t eventually spot,” and several Anthropic-published exploits required “substantial human guidance” rather than autonomous discovery 14. The decisive-win framing rests on the 271 number being load-bearing. It isn’t.

The capability is real. The moat is not.

This is where the story gets more interesting than either side admits. The UK AI Security Institute’s independent evaluation is the strongest data point in Mythos’s favour: 73% on expert-level CTF challenges no 2025 model could solve, and the first model ever to complete 32-step end-to-end corporate network attacks autonomously 15. Something genuinely shifted.

What’s contested is whether the shift is Mythos-shaped. AISLE replicated several of Anthropic’s marquee finds — including the 17-year FreeBSD RCE — using open-weights models as small as 3.6 billion parameters:

Every model tested — including those as small as 3.6 billion parameters — successfully identified the FreeBSD exploit. Claude Mythos is estimated to cost approximately $25 per million input tokens, whereas the 3.6B parameter model used by AISLE operates at just $0.11 per million tokens — a cost difference of over 200x. 16

AISLE’s “jagged frontier” reading: once you’ve isolated the function, the intelligence floor for spotting these bugs is low. The moat is scaffolding, not weights — which undercuts the implication that Mozilla needed Anthropic specifically.

Defenders win — if they’re inside the perimeter

Holley’s optimism reads very differently next to curl. Daniel Stenberg terminated curl’s bug bounty program in January 2026 because AI-generated slop had driven the genuine-vulnerability rate from 1-in-6 to 1-in-30 17. Same technology, opposite outcome.

flowchart LR
    M[Claude Mythos raw output] --> G[Project Glasswing<br/>curation + repros]
    G --> Mz[Mozilla<br/>271 actionable findings]
    M -. uncurated .-> S[Public bug bounties]
    S --> C[curl: 1-in-30 signal<br/>program shut down]

Mozilla can absorb a firehose because Anthropic delivered curated, reproducible reports through Project Glasswing. An unfunded maintainer staring down the same model output without the curation layer gets denial-of-service instead.

And the stability tax is real even for Mozilla: Firefox 150.0.1 fixed a Bitdefender conflict and broken pinch-zoom rendering, and 150.0.2 was fast-tracked after corporate SSO logins started returning blank pages 18. “271 fixes in one release” elides that the audit churn shipped its own regressions.

The honest read: Mythos is a real capability step, the headline number is inflated by CVE roll-ups, and the defenders-win narrative is true exactly for the organisations inside the gatekept pipeline.

Round-ups

Workspace agents

Source: openai-blog

OpenAI Academy adds a Workspace Agents track teaching non-developers to build ChatGPT agents that automate repeatable workflows, connect tools, and coordinate team operations. The course targets business users scaling agent deployments across departments rather than engineers writing custom integrations.

How to Use Transformers.js in a Chrome Extension

Source: huggingface-blog

Hugging Face walks through embedding Transformers.js inside a Chrome extension, running models client-side in the browser without server calls. The tutorial covers manifest setup, background workers, and message passing so extensions can ship local inference rather than piping page content to a remote API.

QIMMA قِمّة ⛰: A Quality-First Arabic LLM Leaderboard

Source: huggingface-blog

TII launches QIMMA, an Arabic LLM leaderboard built around curated, quality-filtered evaluation sets rather than scraped benchmarks, aiming to surface genuine Arabic reasoning and generation capability. Hosted on Hugging Face, it positions itself as a stricter alternative to existing multilingual rankings for Arabic models.

Footnotes

  1. The Decoderhttps://the-decoder.com/qwen3-6-27b-beats-much-larger-predecessor-on-most-coding-benchmarks/

    Qwen3.6-27B achieved a score of 77.2% on SWE-bench Verified, surpassing the 76.2% mark set by the much larger Qwen3.5-397B MoE

  2. Smol AI News (Artificial Analysis recap)https://news.smol.ai/issues/26-04-22-not-much/

    the model often generates excessive tokens (140M in AA’s test suite vs. a 23M average), which results in slower effective throughput of roughly 64.6 tokens per second

    2
  3. Qubrid — GLM-5.1 vs Qwen 3.6 comparisonhttps://www.qubrid.com/blog/glm-5-1-vs-qwen-3-6-plus-the-next-generation-of-enterprise-ai-on-qubrid

    GLM-5.1 utilizes a colossal MoE architecture with 754 billion total parameters… Even with extreme 2-bit quantization via Unsloth, the model requires roughly 236GB of VRAM/RAM

    2
  4. Tosea.ai guide / community discussionhttps://tosea.ai/blog/qwen-3-6-27b-complete-guide

    Critics on platforms like Reddit argue that such rapid gains in 27B-scale models suggest 100% benchmark contamination… the developers’ own admission that they ‘corrected some problematic tasks’ in public datasets like SWE-bench Pro before running their internal evaluations

  5. r/LocalLLaMA — Q6_K_XL GGUF looping reporthttps://www.reddit.com/r/LocalLLaMA/comments/1sz9f6f/qwen3627budq6_k_xlgguf_sometimes_gets_stuck_in_a/

    a case-sensitivity bug in the quantization script (checking for a_log instead of A_log) caused these critical [Gated DeltaNet] parameters to be quantized to 4-bit instead of being preserved in FP32

  6. r/LocalLLaMA — ‘3.6 27B tool calling issues vLLM’https://www.reddit.com/r/LocalLLaMA/comments/1syh4sd/36_27b_tool_calling_issues_vllm/

    some developers advise disabling ‘Thinking Preservation’ to resolve tool-call misses, though this negates the model’s primary architectural advantage in long-context sessions

  7. Adam Holter blog — ‘GPT-5.3-Codex-Spark: 1000 TPS but is it actually faster?’https://adam.holter.com/gpt-5-3-codex-spark-1000-tokens-per-second-but-is-it-actually-faster/

    On Terminal-Bench 2.0, Spark scored approximately 58.4%, a sharp decline from the flagship’s 77.3%… the model has a tendency to over-call tools and generate excessive tokens, sometimes taking a longer path to reach a solution than a slower, more surgical model.

  8. OpenAI Developer Community — Realtime API pricing threadhttps://community.openai.com/t/realtime-api-pricing-is-wrong-will-overcharge/971012

    Developers reported being billed over $5.00 for a 75-second session… OpenAI bills for generated tokens even if the connection times out between their edge and internal servers, developers often pay for data they never actually receive.

  9. cloudops.consulting — Real-time messaging protocols deep divehttps://cloudops.consulting/articles/real-time-messaging-protocols-grpc-websocket-sse-deep-dive.html

    WebSockets are harder to scale horizontally; because they are stateful, load balancers must be session-aware, and sudden mass-reconnections can cause thundering herd issues that simpler, stateless SSE connections avoid.

  10. dev.to — Benchmarking Vercel AI Gateway against native Anthropic SDKhttps://dev.to/cliftonz/benchmarking-vercel-ai-gateway-against-the-native-anthropic-sdk-21g5

    For small prompts (roughly 10 tokens), direct calls to native provider SDKs remain about 15-20% faster than routed solutions like the Vercel AI Gateway… for large-context workloads of 120,000 tokens, the latency difference effectively disappears.

  11. jetbi.com — Streaming Architecture 2026: Beyond WebSocketshttps://jetbi.com/blog/streaming-architecture-2026-beyond-websockets

    Anthropic largely retains Server-Sent Events for its public Messages API, prioritizing a clean, predictable developer experience… Anthropic’s ‘persistent’ equivalent is primarily surfaced through managed harnesses that abstract the connection layer.

  12. Medium — ‘Death of the REST API for AI Agents: Inside OpenAI’s WebSocket Strategy’https://medium.com/@kdineshkvkl/the-death-of-the-rest-api-for-ai-agents-inside-openais-1000-iq-websocket-strategy-afe58628b811

    Some observers suggest the transition to this model is a competitive moat designed to make speed addictive while securing higher margins through stateful lock-in… conversation state is managed on OpenAI’s servers rather than locally.

  13. FlyingPenguin (security blog)https://www.flyingpenguin.com/mythos-mystery-in-mozilla-numbers-how-22-vulns-became-271-or-maybe-3-in-april/

    While the blog post by Firefox CTO Bobby Holley cited 271 findings, the formal security advisory only lists 41 CVE entries… Only three specific CVEs were directly credited to Claude: CVE-2026-6746 (High), CVE-2026-6757 (Medium), and CVE-2026-6758 (Medium). Most findings were classified as lower-severity hardening issues or defense-in-depth bugs that did not meet the threshold for a public CVE.

  14. The Register — ‘Anthropic Mythos hype: nothingburger?’https://www.theregister.com/2026/04/22/anthropic_mythos_hype_nothingburger/

    Holley himself admitted that Mythos did not find any category of vulnerability an elite human researcher couldn’t eventually spot… Anthropic’s own reports showed some exploits required ‘substantial human guidance’ rather than being fully autonomous.

  15. UK AI Security Institute evaluationhttps://www.aisi.gov.uk/blog/our-evaluation-of-claude-mythos-previews-cyber-capabilities

    Mythos Preview achieved a 73% success rate on expert-level Capture-the-Flag challenges — tasks no model could complete as recently as 2025… AISI independently confirmed Mythos’s ability to complete 32-step corporate network attacks, marking it as the first AI to cross the threshold of ‘end-to-end’ autonomous offensive operations.

  16. AISLE — ‘AI Cybersecurity After Mythos: The Jagged Frontier’https://aisle.com/blog/ai-cybersecurity-after-mythos-the-jagged-frontier

    Every model tested — including those as small as 3.6 billion parameters — successfully identified the FreeBSD exploit. Claude Mythos is estimated to cost approximately $25 per million input tokens, whereas the 3.6B parameter model used by AISLE operates at just $0.11 per million tokens — a cost difference of over 200x.

  17. Simon Willison on curl / Daniel Stenberghttps://simonwillison.net/2025/Oct/2/curl/

    Curl officially terminated its bug bounty program in January 2026 to ‘remove the money’ as an incentive for low-effort submissions… the rate of genuine vulnerabilities in curl submissions plummeted from roughly one-in-six to as low as one-in-thirty.

  18. AndroidHeadlines — Firefox 150.0.1/.0.2 regressionshttps://www.androidheadlines.com/2026/04/firefox-150-update-claude-mythos-security-patches.html

    Firefox 150.0.1 fixed a Bitdefender-conflict that broke Facebook loading, dropdown menus that expanded incorrectly, and disappearing borders during pinch-zoom on macOS/Windows. Version 150.0.2 was fast-tracked to fix a regression where corporate login prompts appeared as blank pages, blocking access to internal networks.

Jack Sun

Jack Sun, writing.

Engineer · Bay Area

Hands-on with agentic AI all day — building frameworks, reading what industry ships, occasionally writing them down.

Digest
All · AI Tech · AI Research · AI News
Writing
Essays
Elsewhere
Subscribe
All · AI Tech · AI Research · AI News · Essays

© 2026 Wei (Jack) Sun · jacksunwei.me Built on Astro · hosted on Cloudflare