JS Wei (Jack) Sun

Apple's Siri runs Gemini, OpenEnv risks a fork, AIFS savings drop to 21×

Apple hands Siri's top reasoning to Gemini, OpenEnv adopts committee governance, and ECMWF's 1,000× energy claim shrinks to 21× over the full lifecycle.

Apple’s Siri runs Gemini, OpenEnv risks a fork, AIFS savings drop to 21×

TL;DR

  • Apple licenses a custom Gemini derivative as Siri’s top reasoning tier on Google Cloud Blackwell.
  • Siri AI is indefinitely delayed in the EU after regulators rejected Apple’s DMA compromise.
  • OpenEnv moves to a 9-member committee with Meta, Nvidia, and Hugging Face backing.
  • Calendar Gym scores agents at 89% on explicit IDs and 41% under natural-language ambiguity.
  • ECMWF’s 1,000× energy claim drops to 21× once training is amortized across a year.

Today’s three AI-tech stories all turn on a layer the vendor’s headline doesn’t put up front. Apple’s new Siri AI is the lede, but the top reasoning tier is a custom Gemini derivative running on Google Cloud Blackwell B200s, with the EU rollout indefinitely delayed after regulators rejected Apple’s DMA workaround. OpenEnv moves to community governance under a 9-member committee that includes Hugging Face, Meta-PyTorch, Nvidia, and Prime Intellect — the same Prime Intellect already shipping a competing 2,500-task hub, so the fork risk is built into the launch. And ECMWF’s widely-cited 1,000× less energy claim for AI weather forecasting holds only per inference; amortize training across a year of runs and the realistic saving drops to roughly 21×, with pure ML still underestimating record-breaking events.

Apple outsources Siri’s top tier to Gemini on Google Cloud

Source: simon-willison · published 2026-06-08

TL;DR

  • Apple licensed a custom Gemini-derived model as Siri AI’s top reasoning tier, conceding the foundation-model race to Google.
  • Private Cloud Compute now runs on NVIDIA Blackwell B200s in Google Cloud for agentic and reasoning workloads.
  • Siri AI is delayed indefinitely in the EU after regulators rejected Apple’s “Trusted System Agent” compromise under the DMA.
  • Vision LLMs read the screen directly, letting Siri act on third-party apps without any App Intents integration work from developers.

The architecture Apple actually shipped

Apple’s 2026 WWDC AI story is, structurally, a concession. The top tier of the new Siri runs on a custom Gemini-derived model, and when that tier fires it doesn’t run on Apple silicon at all — it runs on NVIDIA Blackwell B200 GPUs inside Google Cloud, wrapped in Apple’s Private Cloud Compute attestation stack 1. Apple’s own foundation models handle lighter on-device and small-PCC workloads; Google handles the hard stuff.

flowchart LR
    A[On-device<br/>Apple foundation model] -->|fallback| B[PCC on Apple Silicon]
    B -->|agentic / reasoning| C[PCC on Google Cloud<br/>NVIDIA B200 + Titan + Intel TDX]
    D[Vision LLM<br/>screen capture] --> A
    D --> B
    D --> C

The privacy story has real receipts. The Google Cloud tier combines NVIDIA Confidential Computing on B200s, Intel TDX CPUs, and Google Titan security chips, with Apple maintaining a cryptographically verifiable ledger of every piece of hardware in the fleet to defend against supply-chain substitution 1. Binaries are published for inspection, as on Apple-silicon PCC — though independent researchers note they remain unsymbolicated and non-reproducible, which dulls the audit story somewhat.

The other quiet architectural bet is the use of vision LLMs to extract information directly from the screen. That sidesteps the entire App Intents project: every existing iOS app becomes Siri-controllable without the developer shipping a single line of integration code. In 2024 this wasn’t credible; in 2026 it is.

The bills come due elsewhere

The strategic question is whether renting your assistant’s brain from your largest competitor is a clever buy decision or a structural surrender. Ming-Chi Kuo’s framing is blunt: Google now “sets the ceiling” for Apple’s AI experience, and if Apple can’t out-engineer Google on top of Google’s own model, the premium iPhone differentiation thins out 2. The deal also draws antitrust attention that the search-default arrangement never quite did — commentators are already calling the Gemini-Siri pipeline “exclusive in effect” and a candidate for the next Microsoft-style tying case rather than another App Store skirmish 3.

The regulatory bill is already landing. EU regulators rejected Apple’s proposed “Trusted System Agent” and 18-month phased rollout, insisting that any system-level AI hook be opened to Gemini, OpenAI, and others; the result is an indefinite delay of Siri AI on iOS and iPadOS in the EU 4. China is excluded for separate data-localization reasons. “Siri AI” is, for now, a US-first feature.

Believe it when you see it

John Gruber, no stranger to Apple’s recent AI vaporware cycle, flagged the conspicuous absence of the rumored “Siri AI Extensions” that would have let users swap Gemini for Claude or ChatGPT — cut late, or simply the next thing to slip 5. Early hands-on signals are muted: MacRumors reports the iOS 27 beta waitlist works but ships with no in-system notification, so testers install the beta, see no new Siri, and assume it’s broken 6.

The architecture is the most interesting part of the keynote. Whether the product clears the bar Apple set for itself in 2024 is a question the waitlist, not the slide deck, will answer.


OpenEnv lands Meta, Nvidia, HF backing as agent RL standard

Source: huggingface-blog · published 2026-06-08

TL;DR

  • OpenEnv moves to community governance under a 9-member committee led by Hugging Face, Meta-PyTorch, Nvidia and Prime Intellect.
  • OpenEnv is a protocol layer, not a trainer — Gymnasium-style reset/step/state served over HTTP and Docker, with MCP first-class.
  • Launch benchmark Calendar Gym shows agents at 89% on explicit-ID tasks, 41% under natural-language ambiguity.
  • Competes for mindshare with committee-member Prime Intellect’s own 2,500-task Environments Hub, raising fork risk.

What actually shipped

The news isn’t the code — OpenEnv has been public since the October PyTorch Conference — it’s the governance handoff. Hugging Face now hosts the repo, and a nine-member committee (Meta-PyTorch, Reflection, Unsloth, Modal, Prime Intellect, Nvidia, Mercor, Fleet AI, HF) controls the spec. The pitch is straightforward: closed labs ship models co-trained with bespoke harnesses (“GPT-5.5 plus its tools”), and open-source loses ground because every trainer reinvents the environment glue. OpenEnv tries to be that glue.

Concretely, it standardises a thin contract between three roles:

flowchart LR
    T[Trainer<br/>TRL / Unsloth / SkyRL / prime-rl] -->|reset/step/state| P[OpenEnv protocol<br/>HTTP + WebSockets]
    P --> E[Environment container<br/>Docker + MCP tools]
    E -->|observations, rewards| P
    P --> T
    E -.->|same image.| Prod[Production deployment]

The “same image in training and prod” claim is the load-bearing one — it’s what justifies the Docker tax over an in-process Gymnasium loop.

The Calendar Gym reality check

Turing’s Calendar Gym is the headline evaluation, exposing 25+ MCP tools for scheduling under realistic permissioning 7. The numbers are honest in a way launch benchmarks usually aren’t: 89% success when entities are passed as explicit IDs, 41% once the prompt uses natural language 8. Independent trace analysis attributes the majority of failures to malformed tool arguments rather than wrong tool choice — a reasoning-to-execution gap a harness standard can expose but can’t fix. Read alongside Unsloth’s reported 2–6× RL throughput and ~70% VRAM reduction, OpenEnv’s value proposition reads as “cheaper iteration on a problem we now have a thermometer for,” not “we solved agent RL.”

A crowded plumbing layer

The awkward subtext is that Prime Intellect — now on the governance committee — runs its own Environments Hub, marketed as “the GitHub for RL environments” with thousands of train-ready tasks, plus its own verifiers library and prime-rl trainer 9. The official line is complementarity (verifiers ships an OpenEnvEnv wrapper), but the XKCD-927 risk is real: a unifying standard layered on top of SkyRL, TRL, verl and Terminal-Bench can quietly become one more adapter target. DeepFabric’s hands-on writeup is blunt about the cost — tasks that take minutes to scaffold in Gymnasium “may require multiple iterations and Docker configurations” in OpenEnv 10.

What’s still unresolved

The repo’s own README warns it’s experimental with API churn, and asks contributors to clear breaking changes with the committee 11 — useful context for anyone treating the coalition announcement as a 1.0. The sharper dissent comes from the evals-vs-vibes camp: Claude Code reportedly shipped without a public eval harness at all, and critics argue elaborate RL hubs become echo chambers where popularity substitutes for utility 12. Safety questions — reward hacking in sandboxed interpreters, prompt-injection surface in long-running MCP sessions — are explicitly punted to downstream reward libraries via RFC 007.

The two signals worth tracking: whether Calendar Gym’s 41% number moves under OpenEnv-trained models, and whether the Prime Intellect hub and HF repo stay one ecosystem or quietly fork.


ECMWF’s AIFS saves 21× energy lifecycle, not 1,000× per run

Source: ars-technica-ai · published 2026-06-08

TL;DR

  • ECMWF’s “1,000× less energy” figure measures one inference, not the model’s full lifecycle.
  • Amortizing training costs drops realistic full-year savings to roughly 21×.
  • Pure ML forecasters lowball record-breaking events, with error growing as records are exceeded.
  • Google’s NeuralGCM beats IFS on top-0.1% precipitation by keeping a physics core.

The 1,000× energy claim is per-inference

The Ars piece faithfully reports ECMWF’s headline: an AIFS run uses ~1,000× less energy than the physics-based IFS, and completes in 3 minutes instead of 30. That number is real, but it measures one forecast cycle in isolation. A lifecycle accounting that amortizes training puts the realistic full-year savings closer to 21× less energy than traditional NWP 13. Training is not cheap: Huawei’s Pangu-Weather alone required 192 NVIDIA V100 GPUs running for 16 days 14, and that’s one model among many being iterated by Google, Nvidia, Microsoft, and the national centers. Twenty-one-fold is still a huge win for operational meteorology — but the gap between the marketing figure and the honest one is itself an order of magnitude.

The extremes gap is structural, not a quirk

Ars notes that ML models underestimate record-breaking events. The independent literature treats this as the central indictment of pure data-driven forecasting, not a footnote. The Karlsruhe team’s Science Advances analysis found AI models — GraphCast, Pangu-Weather, Fuxi — consistently predicted less extreme values than what actually occurred, and the magnitude of the error grew larger as the record exceedance increased 15. Physics-based models still beat AI for predicting extreme weather events 16. The University of Chicago group sharpens the mechanism: models trained on ERA5 reanalysis regress toward statistically likely outcomes, defaulting away from the physically possible but rare “gray swan” extremes 17.

AI models default to more moderate, statistically likely outcomes rather than the physically possible but rare extremes often termed ‘gray swans’.

This is the opposite of what you want from an early-warning system. Smoothing isn’t a tunable bug — it’s what regression toward the training distribution does.

Hybrids are already winning

Ars introduces Caltech’s CliMA and Tapio Schneider’s parameterization work as one promising option. The independent evidence suggests it’s the convergent trajectory. Google’s NeuralGCM, which couples a differentiable dynamical core to learned components, demonstrated superior skill on top-0.1% precipitation and remained stable across 15-day weather and multi-decadal climate simulations — exactly the regimes where pure ML breaks down 18. Operational centers appear to be following the same pattern, reintroducing physical guardrails (mass conservation, output bounding to suppress artifacts like negative precipitation) that the first generation of pure-ML forecasters lacked.

flowchart LR
    A[ERA5 reanalysis] --> B[Differentiable physics core]
    B --> C[ML parameterizations<br/>clouds, snow, convection]
    C --> D[Bounded outputs<br/>mass/energy conservation]
    D --> E[Forecast or<br/>multi-decadal run]
    B -.stability.-> E

What’s actually at stake

The Ars framing — “evolution, not revolution” — undersells the architectural verdict. The productive question isn’t whether ML beats physics; it’s which subroutines you replace and which you protect. Cloud behavior in a warming world has no training analogue, so it stays physical. Snow cover and small-scale mixing have present-day data, so they go to neural nets. Forecasters get the 3-minute inference; the dynamical core keeps the mass-conservation grown-ups in the room. That’s the consensus the operational centers are converging on, even if the marketing decks still lead with 1,000×.

Footnotes

  1. Neowinhttps://www.neowin.net/news/apple-is-expanding-private-cloud-compute-beyond-its-own-data-centers/

    Apple utilizes NVIDIA Confidential Computing on Blackwell B200 GPUs, Intel CPUs with TDX, and Google’s Titan security chips… Apple maintains a cryptographically verifiable ledger of all Google Cloud hardware in the PCC fleet to mitigate supply chain attacks.

    2
  2. Medium / Creative Compiler citing Ming-Chi Kuohttps://medium.com/@creativecompiler/apple-google-and-the-ai-dependency-they-cant-escape-2e88203bdc4b

    By using Gemini, Google effectively ‘sets the ceiling’ for Apple’s AI experience… if Apple cannot innovate on top of the model better than Google does on Android, it risks losing its premium differentiation.

  3. The Antitrust Attorney bloghttps://www.theantitrustattorney.com/apples-gemini-siri-deal-is-the-next-microsoft-antitrust-case-not-the-next-app-store-fight/

    The Gemini integration creates a second exclusive pipeline that is ‘exclusive in effect,’ even if not formally branded as such — foreclosing rivals from the most lucrative distribution channel, the iPhone’s native assistant.

  4. Engadgethttps://www.engadget.com/2189932/siri-ai-for-iphones-and-ipads-will-be-delayed-indefinitely-in-the-eu/

    Siri AI for iPhones and iPads will be delayed indefinitely in the EU… Apple’s proposed ‘Trusted System Agent’ and an 18-month phased rollout were rejected by regulators who insist any system-level AI access must be opened to third-party rivals like Gemini and OpenAI.

  5. Daring Fireball (John Gruber)https://daringfireball.net/

    Apple had ‘burned its reputation’ over previous years by announcing Siri features that frequently failed to materialize… [Gruber] specifically questioned the absence of ‘Siri AI Extensions’ — rumored tools that would allow users to swap Gemini for other assistants like Claude or ChatGPT.

  6. MacRumorshttps://www.macrumors.com/2026/06/08/ios-27-siri-ai-waitlist/

    Users must manually opt-in to a ‘Siri AI’ waitlist after installation… a common point of criticism is the lack of a system-level notification informing users that a waitlist exists; many updated to the beta and were confused when the new Siri interface failed to appear.

  7. Hugging Face blog — OpenEnv x Turing Calendar Gymhttps://huggingface.co/blog/openenv-turing

    Calendar Gym… exposes over 25 MCP tools for scheduling and coordination tasks, testing an agent’s ability to handle permissions and partial information

  8. Turing blog (Calendar Gym benchmark)https://www.turing.com/blog/evaluating-tool-using-agents-in-production-oriented-environments-with-openenv

    agents achieved an 89% success rate on tasks with explicit identifiers, [but this] plummeted to 41% when faced with natural language ambiguity

  9. Prime Intellect — Environments Hub launchhttps://www.primeintellect.ai/blog/environments

    the ‘GitHub for RL environments,’ serving as a centralized marketplace for sharing and discovering train-ready tasks

  10. DeepFabric — Introduction to OpenEnvhttps://www.deepfabric.dev/blog/introduction_to_openenv

    scaffolding a simple task can take minutes in an in-process framework like Gymnasium but may require multiple iterations and Docker configurations in HTTP-based frameworks like OpenEnv

  11. huggingface/OpenEnv GitHub repohttps://github.com/huggingface/OpenEnv

    OpenEnv is in an ‘experimental stage’ with frequent API changes… contributors [should] coordinate significant changes with the technical committee to maintain API compatibility

  12. pashpashpash Substack — ‘A response to everyone bashing evals’https://pashpashpash.substack.com/p/a-response-to-everyone-bashing-evals

    Claude Code reportedly launched without traditional public evaluation frameworks, relying instead on domain-expert feedback and ‘vibes’… elaborate RL hubs can become an ‘echo chamber’ where popularity is prioritized over real-world utility

  13. Towards Data Science — ‘Rethinking Environmental Costs of Training AI’https://towardsdatascience.com/rethinking-environmental-costs-of-training-ai-why-we-should-look-beyond-hardware/

    even when accounting for this heavy upfront training cost, AI models are estimated to consume at least 21 times less energy than traditional systems over a one-year operational cycle

  14. Huawei Cloud blog — Pangu-Weather training detailshttps://www.huaweicloud.com/intl/en-us/about/blogs/20230707.html

    Pangu-Weather’s training involved 192 NVIDIA V100 GPUs running for 16 days

  15. Karlsruhe Institute of Technology (Science Advances preprint repository)https://publikationen.bibliothek.kit.edu/1000192852

    AI models tended to predict less extreme values than what actually occurred… the magnitude of these errors grew larger as the record exceedance increased

  16. Physics Worldhttps://physicsworld.com/a/physics-based-models-still-beat-ai-for-predicting-extreme-weather-events/

    Physics-based models still beat AI for predicting extreme weather events

  17. University of Chicago Climate Institute — ‘Forecasting the Unseen: AI Weather Models and Gray Swan Extreme Events’https://climate.uchicago.edu/insights/forecasting-the-unseen-ai-weather-models-and-gray-swan-extreme-events/

    AI models default to more moderate, statistically likely outcomes rather than the physically possible but rare extremes often termed ‘gray swans’

  18. Science Media Centre — expert reaction to NeuralGCMhttps://staging.sciencemediacentre.org/expert-reaction-to-machine-learning-model-for-accurate-weather-predictions-and-climate-simulations/

    NeuralGCM demonstrated superior skill in predicting precipitation, particularly for extreme rainfall events in the top 0.1%… its physics-based core prevents drift, allowing realistic 15-day forecasts and multi-decadal climate simulations

Jack Sun

Jack Sun, writing.

Engineer · Bay Area

Hands-on with agentic AI all day — building frameworks, reading what industry ships, occasionally writing them down.

Digest
All · AI Tech · AI Research · AI News
Writing
Essays
Elsewhere
Subscribe
All · AI Tech · AI Research · AI News · Essays

© 2026 Wei (Jack) Sun · jacksunwei.me Built on Astro · hosted on Cloudflare