JS Wei (Jack) Sun

GM cuts crash sims to 60s, IBM scaffolds enterprise LLMs, Google answers MCP

GM swaps 15-hour FEA for one-minute ML inference, IBM wraps LLMs in agent logic, Google ships an A2UI counter to MCP.

GM cuts crash sims to 60s, IBM scaffolds enterprise LLMs, Google answers MCP

TL;DR

  • GM replaced 15-hour crash simulations with 60-second PhysicsNeMo inference, training cost excluded.
  • Chalmers benchmark finds ML surrogates miss peak in-crash head acceleration, the signal injury criteria need.
  • IBM Research reports 30× fewer tokens on COBOL with Mistral Medium plus program analysis.
  • Google’s A2UI ships declarative JSON components, a direct counter to Anthropic and OpenAI’s MCP Apps.
  • Antigravity developers hit 7-day organizational lockouts the same week as the I/O recap.

Three independent AI-tech stories today, sitting in different parts of the stack. GM is leaning on NVIDIA’s PhysicsNeMo to compress crash and FEA runs from 15-18 hours down to about a minute — though that number is inference only, and a fresh Chalmers benchmark catches the surrogates missing the peak head-acceleration spikes that injury criteria actually depend on. IBM Research is making the parallel argument one layer up: frontier LLMs alone don’t carry enterprise workflows, and wrapping them in knowledge graphs, program analysis, and policy-as-code is what produces the 30× and 4× lifts in their writeup.

And Google’s I/O 2026 recap is doing two things at once — the coffee-ordering demo is a quiet shot at Anthropic and OpenAI’s MCP Apps, branded with a Nano Banana visual identity that has documented drift on faces and logos. The same week, Antigravity developers were locked out of their orgs for seven days. A Simon Willison Codex desktop experiment closes out the briefs.

GM swaps 15-hour crash simulations for one-minute ML inference

Source: ars-technica-ai · published 2026-06-01

TL;DR

  • GM’s crash and FEA simulations now finish in ~60 seconds vs. 15–18 hours, using NVIDIA PhysicsNeMo surrogates.
  • The 1-minute figure is inference only; each surrogate needs 100s–1,000s of high-fidelity FEA runs to train.
  • A Chalmers benchmark finds ML surrogates miss peak in-crash head acceleration, the signal injury criteria depend on.
  • CPO Sterling Anderson ($40M package) is now floated as Mary Barra’s likely successor.

What GM actually shipped

GM’s pitch — three “epochs” of engineering culminating in probabilistic, AI-driven co-simulation — is wrapped around a concrete tooling decision: the automaker has adopted NVIDIA’s PhysicsNeMo framework, running MeshGraphNet and Transolver architectures to predict structural deformation in Body-in-White crash scenarios without the full finite-element overhead 1. That’s where the headline “15 hours to one minute” for FEA, and “15–18 hours to under 60 seconds” for crash, comes from. Independent coverage of the same framework puts the per-evaluation speedup in the 500×–900× range, which makes GM’s numbers credible rather than marketing inflation 2.

The asterisk GM didn’t print

What the Ars piece glosses over is the cost structure of a surrogate model. Training “a reliable surrogate typically requires hundreds or thousands of high-fidelity FEA runs” before you ever get the millisecond inference 2. The one-minute figure is what an engineer sees at their desk; it is not the total compute cost when a body panel changes enough to invalidate the training distribution.

The accuracy story has a sharper edge. A SAFER/Chalmers benchmark on a combined pre-crash/in-crash ML pipeline found the surrogates predicted head displacement, rib strain, and seatbelt forces well — but head acceleration during the peak in-crash phase exhibited much lower accuracy 3. That is precisely the signal HIC and other injury criteria are computed from. The reasonable read: ML surrogates are excellent for design-space exploration and low-frequency structural response, and you still want a physics solver in the loop before you certify anything. GM’s public materials don’t disclose where it draws that line.

Why this is also a CEO audition

Sterling Anderson joined GM in May 2025 as Chief Product Officer on a reported $40M compensation package, with end-to-end authority across ICE and EV portfolios 4. By December, Electrek was reporting he is “increasingly viewed as a potential successor to CEO Mary Barra,” explicitly tied to GM’s pivot away from the Cruise robotaxi strategy toward eyes-off personal autonomy 5. The “three epochs” framing isn’t a CPO’s slide deck — it’s the public thesis of a likely next CEO. Which means the simulation claims need to survive more scrutiny than a typical engineering keynote.

Compressing engineering vs. compressing updates

GM’s bet is that collapsing the engineering cycle wins the next decade: more virtual experiments, fewer physical prototypes, shorter time to a hardened design. Toyota is betting the other direction. Its Arene OS, debuting in the 2026 RAV4, is built to stretch hardware generations toward a decade while keeping vehicles “modern” through OTA software updates 6. Both companies agree the old mechanical-refresh cadence is dead. They disagree about which half of the product lifecycle is the bottleneck — and the answer implies very different org charts, capex profiles, and supplier relationships.

The honest scorecard: GM’s surrogate-modeling stack is real and the speedups are defensible. The unanswered questions are how often those surrogates need to be retrained on new geometries, and whether GM is candid internally about where ML stops and physics has to take over.


IBM wraps frontier LLMs in “agent logic” for enterprise work

Source: huggingface-blog · published 2026-06-01

TL;DR

  • IBM Research argues frontier LLMs alone fail enterprise workflows without scaffolding.
  • The fix: knowledge graphs, program analysis, and policy-as-code wrapped around the model.
  • 30× fewer tokens on COBOL modernization with Mistral Medium 250B plus program analysis.
  • 4× over ReAct on the ITBench incident benchmark, using Gemini 2.5 Flash.
  • Independent reruns put Claude 4.7 and GPT-5.5 below 50% on Kubernetes root-cause analysis.
  • CUGA’s 15–26% accuracy lift assumes heavy bespoke “Playbook” authoring the post never prices.

The thesis: scaffolding beats scale

IBM Research’s pitch is blunt: bigger context windows and better frontier models won’t get enterprises to reliable, long-running agents. The bottleneck is structure, not intelligence. Their answer is a set of “agent logic” primitives — knowledge graphs that map microservices and database schemas, static program analysis libraries, adaptive planners with feedback loops, and policy-as-code enforced at runtime rather than via prompts. The LLM stays in the loop, but its job shrinks: reason over pre-indexed, deterministic structure rather than raw, noisy context.

flowchart LR
    A[Raw enterprise context<br/>code, tickets, schemas, policies] --> B[Agent logic layer<br/>KGs · program analysis · policy-as-code]
    B --> C[Bounded reasoning task]
    C --> D[Frontier LLM<br/>GPT-5.1 · Claude 4.5 · Mistral]
    D --> E[Verified action]
    E -. feedback .-> B

The receipts

The blog stacks four flagship results across different model sizes, and the pattern is consistent: a small specialized model plus heavy scaffolding beats a frontier model running ReAct alone.

SystemModelHeadline gainToken efficiency
WCA4Z (COBOL/PL-1 modernization)Mistral Medium 250Bmatched or exceeded baseline LLM~30× fewer tokens
Aster (test generation)Devstral 24B+20–45% coverageup to 15× fewer tokens
I3 (incident response)Gemini 2.5 Flash4.0× over ReAct on ITBench3.7–5.9× fewer tokens
Maximo (industrial assets)GPT OSS 120B20 min → 15–30 sec77% fewer tokens on AssetOpsBench

The CUGA healthcare agent adds a different shape of evidence: a 15–26% accuracy lift across Claude 4.5 Opus, GPT OSS 120B, and GPT-4.1 from policy-as-code alone, no fine-tuning. That portability across model families is the strongest argument in the post — the gains aren’t a quirk of one checkpoint.

The asterisks independent reviewers add

The benchmark IBM dominates is one where everyone is still failing badly. An independent ITBench-AA reimplementation found Claude 4.7 and GPT-5.5 below 50% accuracy on Kubernetes RCA, and IBM’s own baselines resolved 11.4% of SRE scenarios and 0% of initial FinOps tasks 7. A 4× lift on those numbers is real but bounded — a large multiple of a small base. The joint IBM/Berkeley MAST analysis names the dominant failure mode the blog skips: “Incorrect Verification,” where agents declare victory without checking the alert actually cleared 8. That reframes agent logic’s win as governance, not reasoning.

CUGA’s story has the same shape. Peer-reviewed follow-up shows success rates collapse to ~38% on Level 3 branching tasks with unexpected tool outputs 9, and practitioners describe setup as “cumbersome” versus CrewAI or LangChain, requiring substantial domain-specific Playbook authoring to reach the cited gains 10. The post never quantifies that integration cost.

Production economics are the third asterisk. Futurum’s 2026 vendor review — which puts IBM’s stack alongside Microsoft’s Magentic-One, Salesforce Atlas, and Google’s Vertex Reasoning Engine, all converging on the same neuro-symbolic pattern — flags a 37% gap between lab and production, with up to 50× token-cost variance in multi-agent loops 11. CrewAI’s own engineers add a sequencing warning relevant to IBM’s policy-as-code emphasis:

Companies often build security before ensuring their agents work… over-instrumentation can kill throughput. 12

Takeaway

IBM’s architectural bet — that enterprise agents need deterministic scaffolding, not just smarter models — is the direction the whole field is moving. The verifiable win is governance: bounded context, auditable policy, and forced verification steps that catch the “declared victory” failure mode. The numbers in the post are best read as upper bounds from controlled benchmarks, not deployment-ready economics. The question for buyers isn’t whether agent logic helps; it’s whether the Playbook-authoring tax is cheaper than the alternative of waiting for a frontier model that doesn’t need it.


Google’s I/O 2026 recap buries an A2UI vs MCP fight

Source: google-ai-blog · published 2026-06-01

TL;DR

  • A2UI, the protocol under Google’s coffee demo, ships declarative JSON from a host-whitelisted component catalog.
  • That’s a direct counter to Anthropic and OpenAI’s iframe-based MCP Apps — a quiet protocol war.
  • Nano Banana, credited with the entire visual identity, has documented drift on faces and logos across edits.
  • Same week as the recap, Antigravity developers hit 7-day organizational lockouts after a few hours of work.

The coffee cup is a protocol bet

Strip the latte art off “Antigravity Coffee Co.” and what’s left is Google’s opening move in a UI-protocol war. A2UI, the scheme Flutter clients used to render those generative coffee interfaces, transmits declarative JSON blueprints — and crucially, agents can only request components from a host-defined Trusted Catalog, which structurally blocks code injection 13. Anthropic and OpenAI’s MCP Apps take the opposite posture: sandboxed web widgets shipped via iframe, with the server owning the visual experience but routinely breaking the host’s design system 14.

AxisA2UI (Google)MCP Apps (Anthropic/OpenAI)
TransportDeclarative JSON blueprintsBundled web widgets in iframe
Component sourceHost-whitelisted catalogServer-shipped, sandboxed
Brand consistencyHigh (native components)Low (visual breaks common)
Third-party flexibilityLow (catalog-bound)High (arbitrary UI)
Injection surfaceMinimal by constructionIframe sandbox dependent

Google’s blog post buries this as a Flutter party trick. It’s actually the most consequential thing in the recap.

Nano Banana’s pixel-perfect tool, explained

Google says a “custom tool in AI Studio” enforced pixel-perfect consistency for the TPU Training Day film. Independent reviews of Nano Banana suggest why that tool had to exist: facial expressions and brand logos drift across refinement rounds, fine-grained edits in busy scenes are unreliable, and outputs ship with mandatory visible plus SynthID watermarks that block clean commercial use 15. The branding workflow Google describes — flat icons promoted to textured 3D while preserving shape and lighting — is exactly the failure mode practitioners flag. The custom consistency tool reads less like a flourish and more like a patch.

The hardware story Google didn’t tell

Jellectronica gets two sentences in the recap and deserves more. The pipeline — YOLOv8 vision model streaming jellyfish movement into Lyria-generated stems over WebSocket — runs on a Synaptics Coralboard whose 1 TOPS Coral NPU is the first generation to support transformer workloads at the silicon level, not just convolutional nets 16. That’s why the whole loop fits on-device with no cloud round-trip. It’s a credible edge-AI story Google chose to file under “immersive experience.”

The week the recap doesn’t mention

While the marketing team celebrated Antigravity as a creative platform, its developer users were in open revolt. Sanj.dev documented “7-day organizational lockouts” hitting after only hours of intensive work, with single complex requests silently exhausting a week’s quota despite advertised 5-hour rolling refresh windows 17. Google has since walked back some of this (not counting failed requests, for instance), but the gap between the I/O collage and the developer experience is wide.

A single complex request could silently exhaust a week’s worth of tokens. 17

The benchmark picture is similarly mixed: Gemini 3.5 Flash trades blows with GPT-5.5 and Claude 4.7 on Chatbot Arena (~1504 Elo) but regresses against Google’s own 3.1 Pro on ARC-AGI-2 reasoning and long-document retrieval 18.

Takeaway

The recap is a collage. The substance underneath it — a declarative agent-UI protocol, the first transformer-native edge NPU, a generative image model with brand-drift caveats, and a developer-tools quota crisis — is four separate stories Google would rather you read as one happy montage.

Round-ups

Simon Willison clones Claude’s paste-to-file UX with Codex desktop

Source: simon-willison

Willison prototyped a browser-based Pasted File Editor that mimics Claude’s trick of converting large text pastes into file attachments. Codex desktop built the tool, which also accepts drag-and-drop files and renders pasted images as thumbnails inside the textarea.

Footnotes

  1. NVIDIA PhysicsNeMo crash simulation docshttps://docs.nvidia.com/physicsnemo/25.11/physicsnemo/examples/structural_mechanics/crash/README.html

    GM is testing architectures like MeshGraphNet and Transolver to predict structural deformation in ‘Body-in-White’ crash scenarios without the massive computational overhead of finite element analysis

  2. Quantum Zeitgeist on PhysicsNeMo automotive crash frameworkhttps://quantumzeitgeist.com/machine-learning-accelerates-automotive-crash-dynamics-modeling-physicsnemo-framework/

    AI-accelerated FEA… surrogate models can reduce individual design evaluations from minutes or hours to milliseconds, but training a reliable surrogate typically requires hundreds or thousands of high-fidelity FEA runs

    2
  3. SAFER Research (Chalmers University benchmark)https://www.saferresearch.com/library/applicability-using-machine-learning-improve-computational-efficiency-combined-pre-crash

    ML models could predict head node displacement, rib strain, and seatbelt forces with high accuracy… however, metrics such as head node acceleration—especially during the peak ‘in-crash’ phase—exhibit much lower accuracy

  4. GM press release (May 2025)https://news.gm.com/home.detail.html/Pages/topic/us/en/2025/may/0512-GM-names-Sterling-Anderson-chief-product-officer.html

    GM names Sterling Anderson chief product officer… recruited with a $40 million compensation package to oversee the end-to-end lifecycle of its entire portfolio, including gas and electric vehicles

  5. Electrekhttps://electrek.co/2025/12/18/gm-considers-former-tesla-autopilot-head-sterling-anderson-next-ceo-report/

    GM considers former Tesla Autopilot head Sterling Anderson as next CEO… increasingly viewed as a potential successor to CEO Mary Barra, as he leads GM’s pivot away from the robotaxi-centric strategy of its former Cruise unit

  6. TopSpeed on Toyota Arene OShttps://www.topspeed.com/toyota-reportedly-lengthens-generational-cycles-to-nearly-a-decade/

    Toyota lengthens generational cycles to nearly a decade… debuting in the 2026 RAV4, Arene OS enables continuous over-the-air updates, allowing Toyota to extend vehicle lifecycles by keeping features ‘modern’ via software rather than mechanical refreshes

  7. OpenReview: ITBench-AA independent evaluationhttps://openreview.net/forum?id=jP59rz1bZk

    Frontier models like Claude 4.7 and GPT-5.5 scored below 50% accuracy on Kubernetes incident root-cause analysis; IBM’s own baseline agents resolved only 11.4% of SRE scenarios and 0% of initial FinOps tasks.

  8. IBM Research / UC Berkeley on Hugging Face (ITBench + MAST)https://huggingface.co/blog/ibm-research/itbenchandmast

    The strongest predictor of failure is ‘Incorrect Verification’ — agents ‘declare victory’ without checking ground truth evidence, such as whether an alert actually cleared.

  9. ResearchGate: ‘From Benchmarks to Business Impact — Deploying IBM Generalist Agent’https://www.researchgate.net/publication/397006420_From_Benchmarks_to_Business_Impact_Deploying_IBM_Generalist_Agent_in_Enterprise_Production

    CUGA achieves near-perfect scores on linear Level 1 tasks, [but] success rates dip to approximately 38% on Level 3 tasks involving complex branching and unexpected tool outputs.

  10. dev.to practitioner write-up on CUGAhttps://dev.to/aairom/when-i-ask-bob-to-build-a-project-with-cuga-4m0b

    Some early adopters describe the setup process as ‘cumbersome’ compared to more streamlined frameworks like CrewAI or LangChain… it requires significant domain-specific configuration—such as defining precise ‘Playbooks’—to achieve its reported enterprise gains.

  11. Futurum Group: ‘Agentic AI — Leading Vendors Winning the Enterprise in 2026’https://futurumgroup.com/press-release/agentic-ai-the-leading-vendors-winning-the-enterprise-in-2026/

    Independent experts note a ‘37% gap’ between lab benchmarks and real-world production performance… platforms often face 50x cost variations in production due to inefficient token usage during multi-agent loops.

  12. CrewAI blog: ‘You’re Building Agent Security in the Wrong Order’https://blog.crewai.com/youre-building-agent-security-in-the-wrong-order/

    A critical ‘sequencing problem’ exists: companies often build security before ensuring their agents work… over-instrumentation can kill throughput.

  13. a2ui.sh — A2UI vs MCP Appshttps://a2ui.sh/articles/a2ui-vs-mcp-apps

    A2UI utilizes a ‘native-first’ approach where agents transmit declarative JSON blueprints rather than executable code… it only allows the agent to request components from a predefined client whitelist, systemically preventing code injection attacks.

  14. sunpeak.ai — MCP Apps vs A2UIhttps://sunpeak.ai/blogs/mcp-apps-vs-a2ui/

    MCP Apps are typically bundled web widgets that run within sandboxed iframes, giving the server total control over the visual experience at the cost of potential ‘visual breaks’ from the host app’s design system.

  15. Milvus — Nano Banana known limitationshttps://milvus.io/ai-quick-reference/what-are-the-known-limitations-or-challenges-of-nano-banana

    Independent reviews highlight that the model often struggles with fine-grained edits in busy photos… facial expressions or brand logos can subtly shift after several rounds of refinement, making it less reliable for strict brand-compliant workflows.

  16. pasqualepillitteri.it — Coralboard 1 TOPS edge AIhttps://pasqualepillitteri.it/en/news/3583/coralboard-google-gemma-edge-ai-1-tops-en

    Unlike previous edge accelerators that were limited to convolutional neural networks (CNNs), the Coral NPU on this board supports transformer-based workloads from the silicon level up, allowing for more sophisticated multimodal logic.

  17. sanj.dev — Antigravity quota problemshttps://sanj.dev/post/google-antigravity-quota-problems-fix

    Developers reported hitting a ‘7-day organizational lockout’ after only a few hours of intensive work, despite advertised rolling 5-hour refresh windows… a single complex request could silently exhaust a week’s worth of tokens.

    2
  18. openlm.ai — Chatbot Arena leaderboardhttps://openlm.ai/chatbot-arena/

    Gemini 3.5 Flash (Elo ~1504) in a near-dead heat with GPT-5.5 and Claude 4.7… Flash still lags behind 3.1 Pro in deep reasoning (ARC-AGI-2) and long-document needle-in-a-haystack retrieval.

Jack Sun

Jack Sun, writing.

Engineer · Bay Area

Hands-on with agentic AI all day — building frameworks, reading what industry ships, occasionally writing them down.

Digest
All · AI Tech · AI Research · AI News
Writing
Essays
Elsewhere
Subscribe
All · AI Tech · AI Research · AI News · Essays

© 2026 Wei (Jack) Sun · jacksunwei.me Built on Astro · hosted on Cloudflare