JS Wei (Jack) Sun

GLM-5.1 ties frontier on SWE-bench, curl swamped by AI reports, BadHost narrows

GLM-5.1 ties frontier on SWE-bench, curl's inbox floods with AI-generated bug reports, and the BadHost MCP CVE narrows on inspection.

GLM-5.1 ties frontier on SWE-bench, curl swamped by AI reports, BadHost narrows

TL;DR

  • GLM-5.1 scores 58.4 on SWE-bench Pro, edging GPT-5.4 (57.7) and Opus 4.6 (57.3).
  • curl’s security inbox crosses 1+ AI bug report/day, 2× the 2025 rate, 15% valid.
  • BadHost CVE bypasses Starlette host auth, blocked upstream by Cloudflare, Nginx, AWS ALBs.
  • Uber burned its 2026 AI budget in 4 months, says Claude Code output lags spend.
  • OpenRouter raises $113M Series B at $1.3B, joining Fireworks and Baseten above $10B.

Three AI leads today, and they don’t braid into one story. GLM-5.1 posts a 58.4 on SWE-bench Pro — edging GPT-5.4 and Opus 4.6 on real GitHub issues — even as Nathan Lambert pegs open models 12 months behind the frontier. curl’s security inbox now fields more than one AI-generated report per day, double the 2025 rate, with the valid rate stuck near 15%. And the BadHost CVE in Starlette, initially framed as exposing millions of MCP servers, lands with a much smaller blast radius once Cloudflare, Nginx, and AWS ALBs sit in front of it.

The round-ups orbit the enterprise side of the same week. Uber says it burned its 2026 AI budget in four months without matching engineering output — a rare named ROI complaint from a frontier-spend buyer. OpenRouter closes a $113M Series B at $1.3B, and Anthropic stands up a Seoul office under new country lead KiYoung Choi.

Lambert pegs open models 12 months behind; SWE-bench disagrees

Source: interconnects · published 2026-05-26

TL;DR

  • GLM-5.1 scores 58.4 on SWE-bench Pro, edging GPT-5.4 (57.7) and Opus 4.6 (57.3) on real GitHub issues
  • Claude Code is at ~$2.5B ARR after enterprises like Uber burned a full year’s AI budget in 4 months
  • Mythos surfaced 10,000+ critical flaws, including a 27-year-old OpenBSD bug that survived years of fuzzing
  • Anthropic’s Chris Olah appeared at the Vatican rollout, conceding safety labs face pressures that “conflict with doing the right thing”

Nathan Lambert’s May 2026 state-of-the-field essay gets the macro story right — compute is concentrating, agents are funding the next scale-up, and the social backlash is hardening into law and land-use fights. But the single most quotable line in the piece — that open-weight models trail closed ones by “12 months or more” in agentic capability — is the part that doesn’t survive contact with independent benchmarks.

The gap is in the harness, not the weights

On SWE-bench Pro in April 2026, Zai’s open-weight GLM-5.1 scored 58.4 on real GitHub issue resolution, narrowly beating GPT-5.4 at 57.7 and Claude Opus 4.6 at 57.3 1. That is not a 12-month gap; that is parity, with the open model on top. The reconciliation Lambert’s essay never quite makes is that the moat isn’t the base model anymore — it’s the harness, the distribution, and the long tail of reliability inside products like Claude Code and Codex. Read his “12 months” as 12 months of agent-product engineering, and the claim holds. Read it as raw capability, and GLM-5.1 is a counterexample sitting in plain sight.

Claude Code’s revenue is real, and its billing model is a time bomb

Lambert calls Claude Code a “monetization engine” without numbers. The number is roughly $2.5B ARR by February 2026, and the demand-side detail is sharper than his essay lets on: Uber reportedly exhausted its entire 2026 AI budget in four months after 84% of its engineering organization adopted Claude Code’s agentic workflows 2. That single data point does two things at once. It confirms his thesis that agents are the funding flywheel for the next scaling round. It also flags a fragility he skips — enterprises priced token consumption like SaaS seats and got something closer to AWS bills, which is the kind of surprise that ends procurement relationships.

Mythos has receipts; the Pentagon fight has a judge

The Mythos reference reads as vague in Lambert’s piece, and the “supply chain risk” designation gets a single line. Both deserve more weight. Under Project Glasswing, Mythos identified more than 10,000 high- or critical-severity software flaws, including a 27-year-old remote-crash bug in OpenBSD and a 16-year-old FFmpeg flaw that had survived millions of automated fuzzing attempts 3. That is the concrete grounding for “peak achievement in cybersecurity.”

The Anthropic-DoD posture is also more adversarial than “contradictory policy” suggests. Defense Secretary Pete Hegseth publicly demanded “unrestricted access” for “any lawful purpose” and accused Anthropic of holding “veto power” over military operations 4. A California federal judge then granted Anthropic a preliminary injunction, finding the government likely violated the company’s due process and First Amendment rights 5. The NSA and Pentagon are using Mythos while losing in court to its vendor. That is the actual shape of the story.

The Vatican has a frontier-lab co-signer

Lambert treats the Pope’s 40,000-word document as cultural color. The detail that elevates it: Anthropic co-founder Chris Olah appeared at the Magnifica Humanitas presentation and acknowledged that even the most safety-focused labs operate under commercial pressures that “conflict with doing the right thing” 6. A frontier-lab insider endorsing the Vatican’s framing from the stage is a stronger signal of Lambert’s “rising tide of consequence” than the document’s word count.


curl now fields 1+ AI security report per day, 2× since 2025

Source: simon-willison · published 2026-05-26

TL;DR

  • curl’s security inbox now averages more than one report per day — 4-5× the 2024 rate and double 2025.
  • Bounty removal didn’t shrink the firehose — volume doubled again at a 15-16% valid-rate 78.
  • Anthropic’s Mythos audit produced 5 “confirmed” findings — 1 real bug after triage 9.
  • Same pattern is hitting the Linux kernel: ~10 reports/day, mostly duplicates from the same tools 10.

The pressure is structural, not personal

Daniel Stenberg’s latest post reads as a personal lament — his wife has, for the first time, raised concerns about his work hours — but it’s the third installment in a year-long arc that’s worth reading structurally. In January 2026 he killed curl’s HackerOne bounty after the confirmed-vulnerability rate collapsed from a historical 15% to under 5% under a flood of fabricated AI submissions, having paid out $100K+ across 87 valid bugs 7. The intuition was that removing cash would shrink the queue.

It didn’t. By April, with bounties gone, report volume had doubled and the valid-rate rebounded to 15-16% — what Stenberg called “high-quality chaos,” noting that “more convincing crap is worse than obvious crap” 8. “The Pressure” is the coda: the obvious slop is filtered, but every remaining report looks credible enough to demand hours of human verification. That’s the load that’s breaking the team.

Frontier audits over-promise on hardened code

The optimistic story is that frontier models are about to surface latent bugs in critical infrastructure. The curl numbers don’t support it. Anthropic ran its Mythos system across curl’s 176,000 lines and produced five “confirmed” findings; Stenberg’s manual triage reduced that to one low-severity bug, with three findings turning out to be documented API behaviour and one a non-security defect 9.

Compare that to Joshua Rogers, who used ZeroPath to file roughly 170 valid bugs that Stenberg publicly called “actually truly awesome” 11. The distinguishing variable isn’t the model — it’s whether a human verifies before submitting. Unattended frontier audits against a 25-year-old codebase have a near-zero marginal yield and a high per-report triage cost. That asymmetry is the entire problem.

The cost of looking credible has dropped to near zero. The cost of verifying credibility hasn’t moved.

Bigger than curl

Read in isolation, Stenberg’s post sounds like one maintainer’s burnout. It isn’t. Linus Torvalds recently called the kernel’s private security list “almost entirely unmanageable,” and maintainer Willy Tarreau described a jump from 2-3 reports per week to nearly 10 per day — driven largely by multiple researchers pointing the same AI tools at the same code and filing the same duplicate findings 10. Every critical OSS project with a public security inbox is paying the same tax simultaneously.

The industry response so far: in March 2026, OpenAI, Anthropic, Google and Microsoft jointly pledged $12.5M to Alpha-Omega and OpenSSF, explicitly framed as helping maintainers handle the load their own tools created 12. That’s a one-time pool against an ongoing, per-project drain on a handful of named humans. For software that ships in roughly 30 billion installations, “reactive and undersized” is generous.

What’s actually at stake

The lever everyone assumed would work — bounty economics — has been pulled and produced the opposite of the expected result. The lever everyone hopes will work — frontier-model auditing — over-delivers on plausibility and under-delivers on severity against mature code. What’s left is Stenberg’s inbox, his conscience, and the absence of a sustainable triage model for the next maintainer to hit the wall.


BadHost flaw in Starlette bypasses auth on exposed MCP servers

Source: ars-technica-ai · published 2026-05-26

TL;DR

  • CVE-2026-48710 (“BadHost”) lets a single malformed Host header bypass Starlette’s host-allowlist auth, hitting a package with 325M weekly downloads.
  • Patched in commit 764dab0d, which adds RFC 9112/3986 grammar checks and falls back to scope["server"] when the header is junk.
  • Practical blast radius is far smaller than “millions of agents” — Cloudflare, Nginx, and AWS ALBs already reject the malformed headers the exploit requires.
  • A Knostic Shodan census found 1,862 internet-exposed MCP servers, 100% of a sampled subset allowing unauthenticated tool access even pre-BadHost.

The bug and the one-line fix

BadHost is exactly the kind of vulnerability that should never have shipped in routing-layer code. Starlette’s TrustedHostMiddleware parsed the inbound Host header without validating it against RFC grammar, so an attacker who sent a deliberately malformed value could slip past the allowlist and reach endpoints the app thought were locked down. The patch in commit 764dab0d validates against RFC 9112/3986 and, on malformed input, falls back to the safe scope["server"] value the ASGI server already resolved 13. Small fix, foundational layer.

OSTIF — which sponsored the X41 D-Sec audit of vLLM that surfaced the bug — used the disclosure to highlight maintainer burden. Lead maintainer Marcelo Trylesinski “rapidly integrated the fix” while juggling “a large pile of other reports” as an unpaid volunteer 14. Starlette underpins FastAPI, vLLM, LiteLLM, and most modern Python MCP servers; the bus factor here is uncomfortably low for code at that tier of the stack.

”Critical” vs. CVSS 6.5

The Ars headline calls BadHost critical. The official CVSS is 6.5–7.0 — Medium-to-High. X41 and OSTIF argue that rating “materially understates” the risk because Starlette is the routing core of the Python AI stack 15. The counter-argument, made loudest in CyberKendra’s writeup and on HN, is more grounded: any deployment fronted by Cloudflare, Nginx, or an AWS ALB is already protected, because those proxies reject the non-RFC-compliant Host headers the exploit requires before traffic ever reaches Starlette 16.

The directly-exploitable surface isn’t “millions of agents” — it’s the subset of MCP/vLLM/LiteLLM endpoints that publish themselves straight to the public internet.

That subset is measurable. A Knostic Shodan sweep cited in regional coverage found 1,862 internet-exposed MCP servers, and manual verification of a sample showed 100% already allowed unauthenticated access to internal tool listings before BadHost was even disclosed 17. The deeper story isn’t this one CVE — it’s that the MCP ecosystem is shipping production endpoints with no auth at all, then layering host-allowlist middleware on top as the only gate.

Disclosure timing earned pushback

The fix landed in Starlette 1.0.1 on May 21. The branded “BadHost” announcement and scanner went public on May 22 — a ~24-hour window dropped immediately before the US Memorial Day weekend 18. Transitive dependents who don’t import Starlette directly (every FastAPI shop, effectively) got the news at the same time as attackers, with on-call thinned out for the holiday. Combined with growing fatigue at logo-and-landing-page vulnerabilities, the cadence drew accusations that the marketing optimized for attention over operator lead time.

What to actually do

Upgrade Starlette to 1.0.1 — the fix is mechanical and low-risk. If you run MCP servers, the more important audit is the one BadHost incidentally surfaced: are any of your agent endpoints reachable from the public internet at all, and if so, what’s gating tool execution besides a host header?

Round-ups

Uber says AI spend is getting harder to justify

Source: the-verge-ai

Uber burned through its 2026 AI budget in four months, and president Andrew Macdonald says rising Claude Code token consumption isn’t translating into matching engineering output. The candid pullback is a rare public ROI complaint from a major enterprise buyer.

OpenRouter hits $1.3B as AI infra decacorns mint fast

Source: techcrunch-ai, latent-space

OpenRouter’s $113M Series B led by CapitalG more than doubles its valuation in a year, riding 5x usage growth over six months. The round lands alongside Fireworks and Baseten crossing $10B, marking a new tier of model-routing and inference infrastructure winners.

Anthropic names KiYoung Choi to lead Korea ahead of Seoul office

Source: anthropic-news

Choi becomes Representative Director of Anthropic Korea as the company prepares to open a Seoul office, extending its APAC footprint after Tokyo. The hire signals deeper enterprise and government engagement in a market where Claude faces OpenAI and local LLM competition.

Human Archive pays Indian gig workers to film robot training data

Source: techcrunch-ai

Founded by UC Berkeley and Stanford researchers, the startup outfits gig workers with camera caps and sensor rigs to capture real-world manipulation footage. The bet: physical AI labs will pay premium rates for the embodied data that scraped web text can’t supply.

Hugging Face debuts $2,500 3D-printable bipedal robot

Source: ars-technica-ai

The open humanoid-legs project targets researchers and hobbyists priced out of $50K+ platforms, shipping printable parts and training code. It extends Hugging Face’s push into physical AI after its LeRobot library and earlier desktop arm releases.

Import AI 458 weighs near-term miracles against singularity risk

Source: import-ai

Jack Clark’s latest issue pairs a forecast of plausible 2026 AI breakthroughs with original fiction exploring a singularity scenario. The dual framing — concrete capability calls plus narrative — is Clark’s recurring vehicle for policy-adjacent foresight.

Ethan Mollick on resisting AI-generated sameness online

Source: one-useful-thing

Mollick observes social feeds filling with posts that read interchangeably as LLM-assisted writing spreads. His essay argues for deliberately preserving human voice and judgment as a competitive and cultural choice, not a nostalgic one.

Footnotes

  1. IBL.ai analysis of SWE-bench Prohttps://ibl.ai/blog/open-source-ai-swe-bench-pro-2026

    Zai’s open-source GLM-5.1 achieved a score of 58.4, marginally outperforming OpenAI’s GPT-5.4 (57.7) and Anthropic’s Claude Opus 4.6 (57.3) in resolving real-world GitHub issues

  2. Context Studios — Claude Code ARR breakdownhttps://www.contextstudios.ai/blog/claude-code-25b-arr-what-it-means-for-builders

    Uber reportedly exhausted its entire 2026 AI budget in just four months after 84% of its engineering org adopted Claude Code’s agentic workflows

  3. CyberScoop on Anthropic Project Glasswinghttps://cyberscoop.com/anthropic-mythos-software-flaws-glasswing/

    Mythos identified over 10,000 high- or critical-severity software flaws… including a 27-year-old remote-crash vulnerability in OpenBSD and a 16-year-old flaw in FFmpeg that had previously withstood millions of automated fuzzing attempts

  4. DefenseScoop — Pentagon blacklist detailshttps://defensescoop.com/2026/02/27/pentagon-threat-blacklist-anthropic-ai-experts-raise-concerns/

    Defense Secretary Pete Hegseth demanded ‘unrestricted access’ for ‘any lawful purpose,’ arguing that a private vendor should not possess ‘veto power’ over military operations

  5. Pearl Cohen — Anthropic v. DoD litigationhttps://www.pearlcohen.com/anthropic-sues-department-of-defense-over-supply-chain-risk-designation/

    A California federal judge granted a preliminary injunction, finding that the government likely violated Anthropic’s due process rights and First Amendment protections

  6. Forbes coverage of Magnifica Humanitas launchhttps://www.forbes.com/sites/aliciapark/2026/05/25/anthropic-billionaire-cofounder-joins-pope-leo-warns-ai-job-losses-will-spark-moral-imperative-of-historic-proportions/

    The Vatican’s presentation of the document featured an unprecedented appearance by Christopher Olah, co-founder of Anthropic, who acknowledged that even the most safety-focused labs operate under commercial pressures that ‘conflict with doing the right thing’

  7. Daniel Stenberg — ‘The end of the curl bug bounty’ (Jan 2026)https://daniel.haxx.se/blog/2026/01/26/the-end-of-the-curl-bug-bounty/

    The rate of confirmed vulnerabilities dropped from a historical 15% to less than 5% as AI-assisted submissions spiked, prompting curl to end its HackerOne monetary program after paying out over $100,000 across 87 confirmed vulnerabilities.

    2
  8. Daniel Stenberg — ‘High-quality chaos’ (Apr 2026)https://daniel.haxx.se/blog/2026/04/22/high-quality-chaos/

    After reopening reporting without bounties, report volume doubled but confirmed-vulnerability rate climbed back to 15–16% — better than pre-AI levels. ‘More convincing crap is worse than obvious crap.’

    2
  9. Daniel Stenberg — ‘Mythos finds a curl vulnerability’ (May 2026)https://daniel.haxx.se/blog/2026/05/11/mythos-finds-a-curl-vulnerability/

    Anthropic’s Mythos audit of curl’s 176,000 lines produced five ‘confirmed’ findings; manual triage reduced these to one low-severity bug, with three being documented API behaviour and one a non-security bug.

    2
  10. LWN — Linux kernel security list discussionhttps://lwn.net/Articles/1074449/

    Linus Torvalds called the kernel’s private security mailing list ‘almost entirely unmanageable’ due to massive duplication from multiple researchers running the same AI tools; Willy Tarreau reported a jump from 2–3 reports/week to nearly 10/day.

    2
  11. ZeroPath — ‘How ZeroPath won over curl with 170 valid bugs’https://zeropath.com/blog/how-zeropath-won-over-curl-with-170-valid-bugs

    Researcher Joshua Rogers used ZeroPath to file nearly 170 valid bugs against curl, which Stenberg called ‘actually truly awesome’ — distinguishing human-verified AI-assisted research from ‘slop’.

  12. SecurityWeek — Tech giants invest $12.5M in OSS securityhttps://www.securityweek.com/tech-giants-invest-12-5-million-in-open-source-security/

    OpenAI, Anthropic, Google and Microsoft committed $12.5 million in March 2026 to the Linux Foundation’s Alpha-Omega and OpenSSF, specifically to help maintainers handle the reporting volume their own AI tools created.

  13. MLQ.ai analysishttps://mlq.ai/news/critical-badhost-flaw-in-starlette-exposes-millions-of-ai-agent-deployments-to-auth-bypass/

    the fix was implemented in commit 764dab0dcfb9… ensures that if a Host header contains malformed characters, the framework falls back to the safe scope[‘server’] value

  14. OSTIF disclosure posthttps://ostif.org/disclosing-the-badhost-vulnerability-in-starlette/

    an extraordinary step of stewardship… despite being an independent volunteer dealing with a large pile of other reports, the maintainer rapidly integrated the fix

  15. Risky Business Bulletinhttps://news.risky.biz/risky-bulletin-badhost-vulnerability-bypasses-authentication-on-ai-infrastructure/

    the official CVSS score of 6.5–7.0 ‘materially understates’ the risk given Starlette’s role as the routing core of modern Python web services

  16. CyberKendra technical writeuphttps://www.cyberkendra.com/2026/05/badhost-cve-2026-48710-one-rogue-header.html

    production deployments fronted by reverse proxies or CDNs such as Cloudflare, Nginx, or AWS ALBs inherently reject the malformed Host headers required for the exploit

  17. ProductNation / Knostic Shodan censushttps://productnation.co/my/31455/ai-agent-starlette-badhost-vulnerability-malaysia/

    a Shodan census identified 1,862 exposed MCP servers globally; manual verification of a sample showed 100% allowed unauthenticated access to internal tool listings

  18. NewsHeadlineAlert / disclosure-timing critiquehttps://www.newsheadlinealert.com/news/millions-of-ai-agents-imperiled-by-critical-vulnerability-in-open-source-package-6a162d879876f

    the patch was finalized on May 21, 2026, and public disclosure followed just one day later on May 22, leaving DevOps teams with virtually no lead time before a holiday weekend

Jack Sun

Jack Sun, writing.

Engineer · Bay Area

Hands-on with agentic AI all day — building frameworks, reading what industry ships, occasionally writing them down.

Digest
All · AI Tech · AI Research · AI News
Writing
Essays
Elsewhere
Subscribe
All · AI Tech · AI Research · AI News · Essays

© 2026 Wei (Jack) Sun · jacksunwei.me Built on Astro · hosted on Cloudflare