Sources

Qwen3.6-27B: Flagship-Level Coding in a 27B Dense Model simonwillison.net

Qwen3.6-27B: Flagship-Level Coding in a 27B Dense Model Big claims from Qwen about their latest open weight model: Qwen3.6-27B delivers flagship-level agentic coding performance, surpassing the previous-generation open-source flagship Qwen3.5-397B-A17B (397B total / 17B active MoE) across all major coding benchmarks. On Hugging Face Qwen3.5-397B-A17B is 807GB, this new Qwen3.6-27B is 55.6GB. I tried it out with the 16.8GB Unsloth Qwen3.6-27B-GGUF:Q4_K_M quantized version and llama-server using…

Speeding up agentic workflows with WebSockets in the Responses API openai.com

A deep dive into the Codex agent loop, showing how WebSockets and connection-scoped caching reduced API overhead and improved model latency.

Quoting Bobby Holley simonwillison.net

As part of our continued collaboration with Anthropic, we had the opportunity to apply an early version of Claude Mythos Preview to Firefox. This week’s release of Firefox 150 includes fixes for 271 vulnerabilities identified during this initial evaluation. […] Our experience is a hopeful one for teams who shake off the vertigo and get to work. You may need to reprioritize everything else to bring relentless and single-minded focus to the task, but there is light at the end of the tunnel. We…

How to Use Transformers.js in a Chrome Extension huggingface.co

Hugging Face walks through embedding Transformers.js inside a Chrome extension, running models client-side in the browser without server calls. The tutorial covers manifest setup, background workers, and message passing so extensions can ship local inference rather than piping page content to a remote API.

Workspace agents openai.com

OpenAI Academy adds a Workspace Agents track teaching non-developers to build ChatGPT agents that automate repeatable workflows, connect tools, and coordinate team operations. The course targets business users scaling agent deployments across departments rather than engineers writing custom integrations.

QIMMA قِمّة ⛰: A Quality-First Arabic LLM Leaderboard huggingface.co

TII launches QIMMA, an Arabic LLM leaderboard built around curated, quality-filtered evaluation sets rather than scraped benchmarks, aiming to surface genuine Arabic reasoning and generation capability. Hosted on Hugging Face, it positions itself as a stricter alternative to existing multilingual rankings for Arabic models.

References

The Decoder the-decoder.com

Qwen3.6-27B achieved a score of 77.2% on SWE-bench Verified, surpassing the 76.2% mark set by the much larger Qwen3.5-397B MoE

Smol AI News (Artificial Analysis recap) news.smol.ai

the model often generates excessive tokens (140M in AA’s test suite vs. a 23M average), which results in slower effective throughput of roughly 64.6 tokens per second

r/LocalLLaMA — Q6_K_XL GGUF looping report reddit.com

a case-sensitivity bug in the quantization script (checking for a_log instead of A_log) caused these critical [Gated DeltaNet] parameters to be quantized to 4-bit instead of being preserved in FP32

r/LocalLLaMA — ‘3.6 27B tool calling issues vLLM’ reddit.com

some developers advise disabling ‘Thinking Preservation’ to resolve tool-call misses, though this negates the model’s primary architectural advantage in long-context sessions

Qubrid — GLM-5.1 vs Qwen 3.6 comparison qubrid.com

GLM-5.1 utilizes a colossal MoE architecture with 754 billion total parameters… Even with extreme 2-bit quantization via Unsloth, the model requires roughly 236GB of VRAM/RAM

Tosea.ai guide / community discussion tosea.ai

Critics on platforms like Reddit argue that such rapid gains in 27B-scale models suggest 100% benchmark contamination… the developers’ own admission that they ‘corrected some problematic tasks’ in public datasets like SWE-bench Pro before running their internal evaluations

Adam Holter blog — ‘GPT-5.3-Codex-Spark: 1000 TPS but is it actually faster?’ adam.holter.com

On Terminal-Bench 2.0, Spark scored approximately 58.4%, a sharp decline from the flagship’s 77.3%… the model has a tendency to over-call tools and generate excessive tokens, sometimes taking a longer path to reach a solution than a slower, more surgical model.

OpenAI Developer Community — Realtime API pricing thread community.openai.com

Developers reported being billed over $5.00 for a 75-second session… OpenAI bills for generated tokens even if the connection times out between their edge and internal servers, developers often pay for data they never actually receive.

dev.to — Benchmarking Vercel AI Gateway against native Anthropic SDK dev.to

For small prompts (roughly 10 tokens), direct calls to native provider SDKs remain about 15-20% faster than routed solutions like the Vercel AI Gateway… for large-context workloads of 120,000 tokens, the latency difference effectively disappears.

cloudops.consulting — Real-time messaging protocols deep dive cloudops.consulting

WebSockets are harder to scale horizontally; because they are stateful, load balancers must be session-aware, and sudden mass-reconnections can cause thundering herd issues that simpler, stateless SSE connections avoid.

jetbi.com — Streaming Architecture 2026: Beyond WebSockets jetbi.com

Anthropic largely retains Server-Sent Events for its public Messages API, prioritizing a clean, predictable developer experience… Anthropic’s ‘persistent’ equivalent is primarily surfaced through managed harnesses that abstract the connection layer.

Medium — ‘Death of the REST API for AI Agents: Inside OpenAI’s WebSocket Strategy’ medium.com

Some observers suggest the transition to this model is a competitive moat designed to make speed addictive while securing higher margins through stateful lock-in… conversation state is managed on OpenAI’s servers rather than locally.

FlyingPenguin (security blog) flyingpenguin.com

While the blog post by Firefox CTO Bobby Holley cited 271 findings, the formal security advisory only lists 41 CVE entries… Only three specific CVEs were directly credited to Claude: CVE-2026-6746 (High), CVE-2026-6757 (Medium), and CVE-2026-6758 (Medium). Most findings were classified as lower-severity hardening issues or defense-in-depth bugs that did not meet the threshold for a public CVE.

AISLE — ‘AI Cybersecurity After Mythos: The Jagged Frontier’ aisle.com

Every model tested — including those as small as 3.6 billion parameters — successfully identified the FreeBSD exploit. Claude Mythos is estimated to cost approximately $25 per million input tokens, whereas the 3.6B parameter model used by AISLE operates at just $0.11 per million tokens — a cost difference of over 200x.

The Register — ‘Anthropic Mythos hype: nothingburger?’ theregister.com

Holley himself admitted that Mythos did not find any category of vulnerability an elite human researcher couldn’t eventually spot… Anthropic’s own reports showed some exploits required ‘substantial human guidance’ rather than being fully autonomous.

UK AI Security Institute evaluation aisi.gov.uk

Mythos Preview achieved a 73% success rate on expert-level Capture-the-Flag challenges — tasks no model could complete as recently as 2025… AISI independently confirmed Mythos’s ability to complete 32-step corporate network attacks, marking it as the first AI to cross the threshold of ‘end-to-end’ autonomous offensive operations.

Simon Willison on curl / Daniel Stenberg simonwillison.net

Curl officially terminated its bug bounty program in January 2026 to ‘remove the money’ as an incentive for low-effort submissions… the rate of genuine vulnerabilities in curl submissions plummeted from roughly one-in-six to as low as one-in-thirty.

AndroidHeadlines — Firefox 150.0.1/.0.2 regressions androidheadlines.com

Firefox 150.0.1 fixed a Bitdefender-conflict that broke Facebook loading, dropdown menus that expanded incorrectly, and disappearing borders during pinch-zoom on macOS/Windows. Version 150.0.2 was fast-tracked to fix a regression where corporate login prompts appeared as blank pages, blocking access to internal networks.

Sources

References

Jack Sun, writing.