Google I/O 2026 leans on Flash as prices triple and agent safeguards crack
Google's I/O 2026 ties every agent gain to Gemini 3.5 Flash while tripling prices and shipping cracked safeguards.
Google I/O 2026 leans on Flash as prices triple and agent safeguards crack
TL;DR
- Gemini 3.5 Flash launches at $1.50/$9 per 1M tokens, 3× prior Flash pricing.
- Antigravity 2.0 ships with prompt-injection RCE bypassing Secure Mode via a native tool.
- Google’s SynthID watermark was reverse-engineered using Nano Banana reference images.
- Karpathy joins Anthropic pre-training after leaving Tesla, OpenAI, and Eureka Labs.
- KPMG deploys Claude to its 276,000-person global audit and advisory workforce.
Today is a Google I/O 2026 day, and the keynote’s three loudest beats all share a pattern: each headline number is pegged to Gemini 3.5 Flash running inside a Google harness, and each is paired — in the same writeup — with a cost the demo glossed over. Flash’s benchmark wins arrive with a 3× price hike. The 12× speed claim for Antigravity 2.0 is portable to nothing outside Google’s IDE, and Pillar Security has already shown prompt-injection RCE that bypasses its Secure Mode. SynthID, the watermark Google has spent two years pitching as the provenance standard, was reverse-engineered with reference images from Google’s own Nano Banana.
The rest of the day reads as counter-programming. OpenAI joins C2PA and adopts SynthID anyway. Anthropic lands Andrej Karpathy on its pre-training team and embeds Claude across KPMG’s 276,000 employees — the kind of structural wins that don’t need a keynote to register.
Gemini 3.5 Flash hikes prices 3x, pivots to agent workloads
Source: google-ai-blog · published 2026-05-19
TL;DR
- Gemini 3.5 Flash launched at $1.50/$9 per 1M tokens — 3× the prior Flash, ~75% of Gemini 3.1 Pro, and 5.5× costlier per benchmark run.
- Google’s headline 76.2% on Terminal-Bench 2.1 rewards harness integration more than raw reasoning skill.
- On SWE-Bench Pro, Claude Opus 4.7 still leads at 64.3% vs. Flash’s 55.1% for repo-scale coding.
- New Gemini Spark agent ships with onboarding strings warning it “may share your info or make purchases without asking.”
The Flash brand is the casualty
Google’s I/O drop spans the blog post, a DeepMind technical note, and a TechCrunch framing piece, and they all point the same direction: “Flash” is no longer the cheap-and-dumb tier. At $1.50 input / $9 output per million tokens, Gemini 3.5 Flash costs 3× its predecessor and roughly 75% of Gemini 3.1 Pro — close enough that independent analysts are calling it a second Pro-class model with a different latency curve rather than a budget option 1. Artificial Analysis’s own benchmark harness saw a 5.5× jump in total run cost versus the prior Flash, because the new model also burns more reasoning tokens per task 2. Hacker News commenters framed the move as “Walmart in reverse” — subsidize to dominance, then ratchet 3.
That repositioning is the actual news. Google is no longer racing to the bottom; it’s testing how much developers will pay for a frontier model that answers fast.
Benchmarks the practitioners don’t trust
The launch leans heavily on 76.2% on Terminal-Bench 2.1 and 83.6% on MCP Atlas. Both measure agent scaffolding as much as model intelligence. Latent Space’s teardown noted that many Terminal-Bench tasks require pre-installed domain tools — Stockfish for chess, folding engines for protein problems — so general-purpose harnesses like Claude Code fail the same prompts despite stronger coding logic 4. On SWE-Bench Pro, which stresses multi-file edits in real repos, Claude Opus 4.7 still leads at 64.3%, with GPT-5.5 at 58.6% and Gemini 3.5 Flash at 55.1% 5. A viral “Fish Slop” game-rewrite demo produced broken mechanics, which critics seized on as evidence the 3× price hike outruns the quality jump 3.
The “rivals flagship models for coding” framing in the blog post doesn’t survive contact with repo-scale work.
Where the pivot makes sense
The unambiguous win is throughput inside tool-calling loops. Independent testing clocked roughly 4× the speed of GPT-5.5 on 14-step tool chains with 91% first-try parameter accuracy 1. That’s the only metric that matters for the products Google actually wants to sell: Antigravity subagents, Salesforce Agentforce automations, Shopify forecasting fan-outs, and the new always-on Spark agent. When wall-clock dominates per-token cost — because the agent is going to call the model 40 times — Flash’s latency curve is the right tradeoff.
flowchart LR
A[User intent] --> B[Spark / Antigravity harness]
B --> C[Gemini 3.5 Flash<br/>4x faster tool loops]
C --> D[Gmail / Drive / Photos]
C --> E[Salesforce, Shopify APIs]
C -. "may purchase<br/>without asking" .-> F((External actions))
The Spark default is the story nobody’s telling
Leaked onboarding strings for Gemini Spark — the 24/7 personal agent rolling out to testers — explicitly warn that it “may share your info or make purchases without asking,” with granular controls buried behind a default-permissive setup 6. For an agent wired into Gmail, Drive, Photos, and third-party connectors, that’s a meaningful posture choice. Google’s Frontier Safety Framework gets a paragraph in the launch post; the autonomy defaults shipping under it get none.
The net: a defensible pivot to agent-first economics, sold with benchmark claims that don’t hold up and a consumer agent whose blast radius is larger than the marketing admits.
Further reading
- Gemini 3.5: frontier intelligence with action — deepmind-blog
- With Gemini 3.5 Flash, Google bets its next AI wave on agents, not chatbots — techcrunch-ai
Google’s Antigravity 2.0 puts agents above the IDE
Source: deepmind-blog · published 2026-05-17
TL;DR
- Antigravity 2.0 splits into a standalone Agent Manager desktop app plus a separate IDE and CLI.
- Google’s keynote demo ran 93 parallel subagents to build an OS kernel in 12 hours for under $1,000 in credits.
- A Pillar Security disclosure shows prompt injection achieving full RCE that bypasses “Secure Mode” via the
find_by_namenative tool. - The headline 12× speed claim is harness-bound to Gemini 3.5 Flash inside Antigravity — portable to nothing.
The pivot: agents are the product, the IDE is the dashboard
Antigravity 2.0’s structural move is the news. Google severed the VS Code fork that defined v1.0 and reorganized the product around a standalone “Agent Manager” desktop app, with the IDE and a new CLI orbiting it. The framing in the I/O 2026 keynote was unambiguous: the unit of work is a swarm of subagents, and the editor is where you review their output, not where you type. The proof point Google chose to ship the story was a demo of 93 parallel subagents assembling a functional OS kernel in under 12 hours for under $1,000 of compute 7.
That demo is doing a lot of load-bearing work, because the developer reception has been hostile in a way Google clearly didn’t plan for.
The backlash is on Google’s own forum
The most-upvoted thread on Google’s AI Developer Forum is titled “Antigravity v2.0: the worst IDE for development” and calls the release “an absolute disaster for actual development,” arguing the new UI buries source code behind agent-review screens and optimizes for stakeholder demos over the feedback loops working engineers need 8. Hacker News converged on the same trade-off in blunter language — a “productivity hand grenade” that’s “perfect for a one-man band startup where the massive upside outweighs the risk of the project eventually collapsing under its own technical debt” 9. Both communities are reading the same signal: the v2.0 surface is tuned for vibe-coding throughput, not maintenance.
The benchmark story is narrower than the marketing
Independent benchmarks tell a more constrained story than “12× faster.” Gemini 3.5 Flash tops MCP Atlas at 83.6% on tool-use latency, but Claude Opus 4.7 still leads SWE-bench Pro at 64.3% and GPT-5.5 leads Terminal-Bench 2.1 at 78.2% 10. Antigravity wins the orchestration-speed axis; it doesn’t win the reasoning-depth axes that actually decide whether a 93-agent swarm produces working code or expensive garbage. And the 12× multiplier itself is harness-bound — co-optimized with Gemini 3.5 Flash inside the Antigravity runtime — which Business Engineer reads as deliberate lock-in: teams that adopt the harness can’t migrate workloads without eating the speedup 11.
Secure Mode isn’t
The hardest contradiction landed from Pillar Security. Their disclosure chains a prompt injection through the agent’s find_by_name native tool into full RCE and sandbox escape — and the exploit specifically bypasses Antigravity’s strictest “Secure Mode” because native tools are evaluated outside the shell-command guardrail that mode enforces 12. Google patched that flaw, but the structural issue (background “Managed Agents” invoking native tools without human-in-the-loop approval) is the trust model, not a bug. For a launch whose enterprise pitch leans on CodeMender, isolated Linux sandboxes, and credential masking, “Secure Mode let an injected prompt run arbitrary binaries” is the wrong sentence to be defending in week one.
What’s actually at stake
flowchart LR
U[Developer prompt] --> AM[Agent Manager]
W[Untrusted web / repo content] -. injection .-> AM
AM --> S[93 parallel subagents]
S --> NT[Native tools: find_by_name, etc.]
S --> SH[Shell tools - Secure Mode gated]
NT -. bypasses Secure Mode .-> RCE((Arbitrary execution))
S --> IDE[IDE review screens]
The agent-first architecture is a real bet, and the 93-agent demo is a real capability. But Google shipped it with a UX that working developers are openly rejecting, a speed claim that only holds inside Google’s harness, and a sandbox that researchers walked through in the first month. The interesting question isn’t whether Antigravity 2.0 is impressive — it’s whether the orchestration moat is wide enough to survive the trust deficit it just opened.
Further reading
Google’s I/O 2026 agent push outruns its safety work
Source: google-ai-blog · published 2026-05-19
TL;DR
- Gemini 3.5 Flash hits 83.6% on MCP Atlas, beating GPT-5.5 (75.3%) and Claude Opus 4.7 (79.1%).
- Antigravity 2.0 leaks auth tokens from IDE process memory, which Google calls “out of scope.”
- TPU 8t/8i split delivers 121 exaflops per training superpod and ~80% better inference perf-per-dollar.
- SynthID was reverse-engineered using Nano Banana reference images, despite 100B+ watermarked assets.
The agent stack, in one keynote
Across the I/O 2026 keynote and the day’s coverage — six pieces spanning Google’s own posts, The Verge’s recap and follow-ups, and Latent Space’s developer-side read — Sundar Pichai compressed Google’s roadmap into a single word: agentic. Gemini Spark runs as a 24/7 background agent on dedicated cloud VMs. Antigravity 2.0 is a desktop IDE for orchestrating cohorts of those agents. Gemini 3.5 Flash is pitched as the workhorse model behind both — “4x faster than other frontier models,” with a 12x-faster variant inside Antigravity. Search itself gets “information agents” that monitor the web on your behalf, and Gemini Omni Flash extends the family into any-to-any multimodal generation, starting with video.
The infrastructure story underneath is the most concretely defensible part of the day. Hyperframe Research independently corroborates the TPU 8t/8i bifurcation Pichai sketched: 8t scales to 9,600-chip superpods at 121 exaflops for training, while 8i triples on-chip SRAM to 384MB and delivers roughly 80% better inference perf-per-dollar 13. Combined with JAX/Pathways training across a million-plus TPUs, this is the substrate that lets Google claim 3.2 quadrillion tokens served monthly.
Benchmarks are real, framing is selective
Gemini 3.5 Flash earns its agentic billing where it counts. On MCP Atlas, an independent agentic benchmark, it scored 83.6% — clear of Claude Opus 4.7 at 79.1% and GPT-5.5 at 75.3% — and chewed through 14-step tool chains in ~11 seconds versus GPT-5.5’s 45 14. For latency-bound agent loops, that gap is the whole ballgame.
But the “4x–12x faster” framing reads as latency-specific, not capability-general. The same testers note Claude still wins on idiosyncratic codebases and SWE-bench Verified. And Demis Hassabis’s “foothills of the singularity” framing for the broader Gemini trajectory hit definitional pushback fast: physicist Paul Pallaghy argues Hassabis’s own “Einstein Test” (re-derive general relativity from pre-1911 data) describes superintelligence, not AGI — most humans are generally intelligent and could never pass it 15. Hassabis’s parallel admission that current systems show “jagged intelligence” (IMO-level math but tripping on reformatted arithmetic) makes the rhetoric land awkwardly.
Safety debt is the throughline
The trust scaffolding is visibly trailing the product. Lumia Security’s “AIKatz-style” attack siphoned authentication tokens from Antigravity’s internal IPC; Google patched the network-facing leak but reportedly dismissed local process-memory vulnerabilities as “out of scope” 16. That posture is hard to defend for an IDE whose entire pitch is executing autonomous shell commands on the developer’s behalf.
SynthID is the consumer-facing equivalent. The adoption numbers (OpenAI, Nvidia, Eleven Labs; 100B images, 60,000 years of audio) are real, but a developer has already published a reverse-engineering technique using black-and-white reference images from Google’s own Nano Banana model to confuse the decoder 17. Provenance theatre against motivated adversaries is not provenance.
flowchart LR
U[User prompt] --> S{Gemini Spark<br/>24/7 cloud agent}
D[Developer] --> A{Antigravity 2.0<br/>desktop IDE}
S --> M[MCP connectors]
A --> M
M --> W[Web / shell / 3rd-party tools]
A -. token leak .-> X((Local process memory))
W -. AI summaries .-> P[Publisher sites<br/>~15% traffic loss]
Who pays: the open web
The unnamed losers are publishers. Independent projections put AI-summary-driven traffic loss at ~15% with zero-click queries now 60–69% of all searches, and roughly a third of publishers intend to block Google’s AI surfaces outright 18. “Agentic Search” and “Generative UI” mini-apps accelerate that curve — the agent answers, the source page never loads. Read across all six pieces of coverage, I/O 2026 is best understood not as the coronation of an agentic era but as a high-capex bet whose hardware is real, whose benchmarks are mixed-honest, and whose safety and economic externalities the keynote did not address.
Further reading
- I/O 2026 — google-ai-blog
- The 13 biggest announcements at Google I/O 2026 — the-verge-ai
- [AINews] Google I/O 2026: Gemini 3.5 Flash, Omni (NanoBanana for Video), Spark (background agents), and Antigravity 2.0 — latent-space
- Demis Hassabis said this might be the ‘foothills of the singularity.’ What? — the-verge-ai
- Google’s AI future demands trust — and your personal data — the-verge-ai
Round-ups
Google rebuilds Search around AI Mode and agentic Gemini box
Source: google-ai-blog, google-ai-blog, techcrunch-ai, the-verge-ai, the-verge-ai
Google’s I/O 2026 keynote recast Search as a single box that blends AI Overviews, conversational AI Mode, and autonomous agents powered by Gemini. The shift moves users from blue links to interactive answers, and threatens to further cut referral traffic to publishers across the web.
Google DeepMind launches Gemini Omni for conversational video generation
Source: deepmind-blog, techcrunch-ai
Gemini Omni is a multimodal model that reasons across text, images, audio, and video, generating and editing clips through plain conversation. The first release, Omni Flash, targets fast iteration and positions Google against OpenAI’s Sora and Runway in creator workflows.
Project Genie pairs with Street View for real-world simulations
Source: deepmind-blog, techcrunch-ai
DeepMind is wiring Street View imagery into Project Genie so users can explore real locations as interactive, weather-variable 3D environments. Access expands to Google AI Ultra subscribers globally, with target uses spanning robotics training, game prototyping, and virtual travel.
OpenAI joins C2PA and adopts Google’s SynthID watermarking
Source: openai-blog, deepmind-blog, techcrunch-ai, ars-technica-ai
OpenAI is adding Content Credentials, Google’s SynthID watermarks, and a public verification tool to its image products, joining the C2PA standard alongside Nvidia and others. The cross-lab alignment gives provenance signals a shot at surviving as generated media floods the web.
Karpathy joins Anthropic’s pre-training team after OpenAI, Tesla stints
Source: techcrunch-ai
Andrej Karpathy is joining Anthropic to work on pre-training, the compute-heaviest phase that shapes Claude’s core knowledge. The hire pulls one of deep learning’s best-known educators away from his independent Eureka Labs project and into the Claude roadmap.
KPMG rolls out Claude to its 276,000-person global workforce
Source: anthropic-news
KPMG is embedding Claude across audit, tax, and advisory workflows for more than 276,000 employees under a new strategic alliance with Anthropic. The deal is one of the largest enterprise Claude deployments to date and deepens Anthropic’s push into Big Four professional services.
NextEra-Dominion utility megamerger is driven by data center demand
Source: ars-technica-ai
NextEra’s blockbuster acquisition of Dominion is pitched as a play to serve surging AI data center load on the East Coast grid. Analysts warn the consolidation will likely raise residential electricity bills as utilities pass infrastructure costs to consumers.
Footnotes
-
Medium — ‘The Speed of Thought’ (theinference) — https://medium.com/@theinference/the-speed-of-thought-why-gemini-3-5-flash-is-the-end-of-the-pro-tier-pretense-b5f39c61522c
↩ ↩2‘Flash’ no longer denotes a cheap, dumb version of the Pro model. It serves as a second Pro-tier model with a different latency curve, priced at roughly 75% of Gemini 3.1 Pro — labs are probing the price tolerance of their users.
-
Artificial Analysis — https://artificialanalysis.ai/articles/gemini-3-5-flash-everything-you-need-to-know
↩Gemini 3.5 Flash is 5.5x costlier to run on our benchmark suite than its predecessor, reflecting both a 3x higher token price ($1.50/$9 per 1M tokens) and increased token consumption during reasoning-heavy tasks.
-
Hacker News discussion thread — https://news.ycombinator.com/item?id=47963930
↩ ↩2Working demos such as the ‘Fish Slop’ game rewrite showed the model producing broken code and non-functional mechanics, leading critics to argue the 20x price increase over the 2.0 generation is unjustified for the current level of hallucination-heavy output.
-
Latent Space (AINews) — https://www.latent.space/p/ainews-google-io-2026-gemini-35-flash
↩Terminal-Bench measures agent scaffolding and specialized tool integrations rather than raw reasoning skill; tasks like finding the best chess move or protein folding require pre-installed engines like Stockfish, causing general-purpose tools like Claude Code to fail despite superior coding logic.
-
Digital Applied benchmark comparison — https://www.digitalapplied.com/blog/gemini-3-5-flash-vs-gpt-5-5-opus-4-7-agentic-coding
↩Claude Opus 4.7 remains the top performer on SWE-Bench Pro with a 64.3% resolution rate, outperforming both GPT-5.5 (58.6%) and Gemini 3.5 Flash (55.1%) in multi-file editing and low-error tolerance tasks.
-
Tom’s Guide on Gemini Spark — https://www.tomsguide.com/ai/google-gemini/google-unveils-gemini-spark-a-24-7-personal-ai-agent-that-could-be-a-game-changer-for-agentic-ai
↩Leaked onboarding strings for the beta explicitly warn that the agent ‘may share your info or make purchases without asking,’ creating a default-risk scenario where users may not inspect granular controls before the agent performs high-stakes actions.
-
Medium write-up of I/O 2026 keynote demo — https://medium.com/@chewloongnian/google-i-o-2026-everything-google-announced-and-the-93-agents-that-built-an-os-in-12-hours-94d21c19bb61
↩Google demonstrated the platform building a functional operating system kernel in under 12 hours using 93 parallel subagents for under $1,000 in compute credits.
-
Google AI Developer Forum — ‘Antigravity v2.0: the worst IDE for development’ — https://discuss.ai.google.dev/t/antigravity-v-2-0-the-worst-ide-for-development-is-an-absolute-disaster-for-actual-development/145728
↩Critics labeled the tool a ‘stakeholders’ feature,’ suggesting the emphasis on ‘vibe coding’ and autonomous agents prioritizes flashy demos over the reliability required for production-level software.
-
Hacker News thread on Antigravity 2.0 — https://news.ycombinator.com/item?id=47028013
↩A fucking hand grenade that will blow up at any moment… yet perfect for a one-man band startup where the massive upside outweighs the risk of the project eventually collapsing under its own technical debt.
-
Vybe.build ‘Best AI Agent Platforms 2026’ benchmark roundup — https://www.vybe.build/blog/best-ai-agent-platforms-2026
↩Claude Opus 4.7 leads SWE-bench Pro at 64.3%, GPT-5.5 leads Terminal-Bench 2.1 at 78.2%, while Gemini 3.5 Flash tops MCP Atlas at 83.6% due to lower latency and high-speed tool-use execution.
-
Business Engineer analysis of Google I/O 2026 — https://businessengineer.ai/p/google-io-2026
↩Because the 12x performance multiplier is tied specifically to the Antigravity harness, developers may find it difficult to migrate production systems to rival platforms without significant performance loss.
-
Pillar Security research — https://www.pillar.security/blog/prompt-injection-leads-to-rce-and-sandbox-escape-in-antigravity
↩By injecting the -X (exec-batch) flag, a malicious prompt can force the agent to execute arbitrary binaries against workspace files… the exploit bypasses Antigravity’s ‘Secure Mode’ because find_by_name is classified as a ‘native tool’ rather than a shell command.
-
Hyperframe Research — https://hyperframeresearch.com/2026/04/22/google-cloud-next-2026-google-cloud-bifurcates-the-ai-future-specialized-tpu-8t-and-8i-architectures-signal-the-end-of-general-purpose-silicon/
↩TPU 8t delivers nearly three times the compute of its predecessor with 121 exaflops per superpod scaling to 9,600 chips; TPU 8i triples on-chip SRAM to 384MB for an 80% improvement in performance-per-dollar for inference
-
TowardsAI — ‘I tested Gemini 3.5 Flash on 18 agent tasks’ — https://pub.towardsai.net/i-tested-gemini-3-5-flash-on-18-agent-tasks-its-6-pricier-flash-crushed-gpt-5-5-at-4-speed-aeab103e0981?source=rss----98111c9905da---4
↩On MCP Atlas, Gemini 3.5 Flash scored 83.6%, surpassing Claude Opus 4.7 (79.1%) and GPT-5.5 (75.3%)… completed 14-step tool chains in ~11 seconds vs GPT-5.5’s 45 seconds
-
GenerativeAI.pub (Paul Pallaghy) — https://generativeai.pub/evaluation-science-of-agi-the-einstein-test-c93c6b60ae7f
↩Hassabis may be defining ‘superintelligence’ rather than AGI — most humans are generally intelligent but cannot reproduce Einstein’s work
-
Lumia Security blog — https://www.lumia.security/blog/the-space-race-looking-for-security-issues-in-googles-antigravity
↩Their ‘AIKatz-style’ attack demonstrated that sensitive authentication tokens could be siphoned from the IDE’s internal communication processes… Google patched the network-facing leak, they reportedly dismissed local process-memory vulnerabilities as ‘out of scope’
-
Gadgets360 — https://www.gadgets360.com/ai/news/google-synthid-ai-watermarking-tech-reverse-engineered-developer-claims-11361320
↩developer Aloshdenny published a technique to reverse-engineer the watermark using ‘black and white’ reference images from Google’s Nano Banana model, claiming he could confuse the decoder enough to nullify detection
-
DigitalApplied — AI search traffic projection — https://www.digitalapplied.com/blog/ai-search-traffic-q3-2026-projection-zero-click-acceleration
↩AI-generated summaries have reduced daily traffic to source pages by roughly 15%… zero-click searches now account for 60-69% of all queries… ~33% of publishers intend to block Google’s AI features