Approval-prompt defenses crack at Microsoft Copilot and Anthropic Claude
Microsoft Copilot Cowork falls 5-for-5 to a poisoned SKILL.md while Anthropic ditches human-in-the-loop after 93% blind-approval telemetry.
Approval-prompt defenses crack at Microsoft Copilot and Anthropic Claude
TL;DR
- PromptArmor hits Copilot Cowork 5/5, exfiltrating OneDrive links via a 5-line poisoned
SKILL.md. - Microsoft is 0-for-3 retiring the image-render egress channel across Copilot surfaces in 18 months.
- Anthropic drops human-in-the-loop after users blind-approved 93% of Claude Code prompts.
- Opus 4.7 holds injection to 0.1% single-shot but 63% after 100 tries in coding environments.
- SecurityWeek finds Anthropic silently patched a Claude Code sandbox bypass in 2.1.90 with no CVE.
Two independent agent-security stories land today and both push on the same load-bearing layer: the approval prompt. PromptArmor hits Microsoft’s Copilot Cowork 5-for-5 with a five-line poisoned SKILL.md, pulling pre-authenticated OneDrive links into self-addressed Teams messages — a class of message that Cowork’s documented human-in-the-loop policy silently exempts. The exfiltration primitive (image render) is the same one behind EchoLeak and Rehberger’s 2024 ASCII-smuggling bug, leaving Microsoft 0-for-3 retiring this channel in 18 months.
Anthropic is reading the same room from the other side. Their own telemetry shows users blind-approved 93% of Claude Code prompts, so the new posture moves containment into gVisor VMs and treats human-in-the-loop as a backstop, not a primary defense. The fine print is less flattering: Opus 4.7 holds prompt injection to 0.1% single-shot but 63% after 100 tries in coding environments, and SecurityWeek caught a silent Claude Code sandbox-bypass patch in 2.1.90 with no CVE.
Copilot Cowork exfiltrates OneDrive files in 5/5 trials
Source: simon-willison · published 2026-05-26
TL;DR
- PromptArmor hit 5/5 on Claude Opus 4.7 with a five-line poisoned
SKILL.md, pulling pre-authenticated OneDrive links into self-addressed Teams messages 1. - Messages the agent sends to the active user are silently exempt from Cowork’s documented human-in-the-loop approval policy 2.
- Exfiltration triggers on image render, the same primitive behind EchoLeak (CVE-2025-32711, CVSS 9.3) and Rehberger’s 2024 ASCII-smuggling bug 34.
- Microsoft is now 0-for-3 retiring this image-render egress channel across Copilot surfaces in 18 months 534.
The exploit, end to end
Drop a five-line SKILL.md into /Documents/Cowork/skills/. The next time Copilot Cowork loads its skill library, the poisoned instructions hijack the agent: it queries SharePoint and OneDrive for sensitive files, wraps the resulting pre-authenticated download links in markdown images pointed at an attacker server, and sends the payload as a Teams or Outlook message to the user themselves. When the client renders the message, the <img> fetch leaks the URLs. PromptArmor reports a 100% success rate over five trials against Claude Opus 4.7 — the frontier model usually invoked as evidence that agentic workflows are safe enough to deploy 1.
flowchart LR
A[Attacker-planted<br/>SKILL.md] --> B{Copilot Cowork<br/>agent}
C[SharePoint / OneDrive<br/>pre-auth links] --> B
B -->|self-addressed<br/>Teams/Outlook msg| D[User inbox]
D -.image render.-> E((Attacker<br/>server))
The lethal trifecta — untrusted instructions, privileged tool access, an outbound channel — is all present, and the agent never asks.
‘Approval gap,’ not a parser bug
Microsoft’s documentation states Cowork “asks for your permission before taking sensitive actions.” PromptArmor and WinBuzzer both point out that messages the agent sends to the active user are carved out of that policy entirely 2. That distinction matters: a missed sanitizer gets a CVE and a patch Tuesday. A deliberate exemption in the orchestration layer requires a product decision, and Microsoft’s initial response — “no customers affected” — suggests they intend to keep it.
Blocking specific markdown image paths is whack-a-mole; the underlying LLM Scope Violation pattern persists across vendors. — Zenity Labs, post-EchoLeak 5
The third time this exact bug has shipped
The auto-rendered-image exfil primitive has been disclosed against Microsoft 365 Copilot at least twice before. Johann Rehberger’s ASCII-smuggling write-up in 2024 was initially declined by MSRC as “intended behavior” before being silently patched 4. EchoLeak (CVE-2025-32711) earned a 9.3 CVSS in mid-2025 for the same egress path 3. Zenity Labs warned then that patching individual rendering paths leaves the architectural problem — models cannot separate data from instructions — completely intact 53. Cowork is that warning made flesh: a new surface, same trifecta, same image tag.
Why this lands harder than the bug deserves
Cowork is being pushed into an enterprise base that is already skeptical. Roughly 3.3% of commercial M365 seats pay for Copilot, and 72% of Fortune 500 deployments remain stuck in pilots 6. A zero-click, 100%-success exfil against the flagship agentic product is not the story that moves those pilots into production. The technical fix is known — dual-LLM patterns, action-selector isolation, or simply not rendering remote images in agent-authored messages 5. Microsoft has applied none of them wholesale. Until they do, every new Copilot surface should be assumed to inherit this primitive by default.
Anthropic contains Claude in VMs after 93% blind-approval rate
Source: anthropic-engineering · published 2026-05-25
TL;DR
- Anthropic ditches human-in-the-loop as primary defense after finding users blindly approved 93% of Claude Code prompts.
- Opus 4.7 holds prompt injection to 0.1% on single-shot attempts on Gray Swan’s benchmark.
- Same benchmark shows 63% attack success after 100 tries in coding environments — a curve Anthropic omits.
- The bugs that actually shipped lived in custom proxies and config parsers, not gVisor or the hypervisor.
- SecurityWeek: Anthropic silently patched a Claude Code sandbox bypass in 2.1.90 with no CVE and no release-note entry.
Containment over supervision
Anthropic’s engineering post on agent safety reads as a confession dressed up as a framework: approval prompts don’t work. When 93% of Claude Code users were rubber-stamping write and network requests, the team stopped trying to fix human attention and started removing the human from the loop entirely. The replacement is layered containment — gVisor for Claude.ai, OS sandboxes (Seatbelt on macOS, bubblewrap on Linux) for Claude Code, and full VMs via Apple Virtualization or Windows HCS for the new Cowork product.
The philosophy is sound and, in the comparative landscape, conservative. OpenAI’s ChatGPT Atlas runs its agent inside an authenticated Chromium session with no OS isolation; LayerX found Atlas blocks fewer phishing attempts than vanilla Chrome, and its always-on auth means a single session compromise spans banking and SaaS 7. Cowork’s VM-per-task model, with credentials pinned to the host keychain and only scoped tokens passed inside, is the opposite bet.
Where the bugs actually live
flowchart LR
A[User intent] --> B[Claude agent]
C[Untrusted web/MCP] --> B
B --> D{Custom glue:<br/>proxy, config parser}
D --> E[gVisor / VM / Seatbelt]
E --> F[Host + network]
D -. bypass .-> F
Anthropic admits this directly: “standard primitives held up; our custom code did not.” Independent research backs the shape of that claim and adds detail Anthropic skipped. Aonan Guan at Wyze Labs documented a SOCKS5 null-byte injection where Claude Code’s JavaScript allowlist filter approved malicious.com\0.google.com under the *.google.com rule, while the underlying C library truncated at the null byte and dialed the attacker host 8. The fix shipped in version 2.1.90 — without a CVE, without a release-note entry 9. The “transparent engineering” framing of the post is selective; the disclosure discipline lags the engineering discipline.
The egress-bypass war story in the post (attackers exfiltrating via uploads to a malicious Anthropic account, fixed with an in-VM MITM proxy) lands the same way: the hypervisor wasn’t the problem.
The benchmark asymptote
Anthropic cites Opus 4.7’s 0.1% prompt-injection success rate on single attempts and ~5–6% after 100 adaptive tries on the Gray Swan Agent Red Teaming benchmark. Gray Swan’s own write-up confirms those headline numbers but publishes the curve Anthropic omits: in coding environments, attack success climbs to 33.6% at 10 attempts and 63% at 100 10. Single-shot is genuinely state-of-the-art. The asymptote against a determined attacker is not.
The substrate choice is also a tradeoff Anthropic underplays. Independent benchmarks put gVisor’s syscall tax at 10–30% with 10–50MB per Sentry, against Firecracker’s 2–8% overhead at ~5MB per microVM 11. Picking gVisor for Claude.ai over the Firecracker model AWS Lambda uses is a scale-and-cost call as much as a security one.
The human layer still loses
The most uncomfortable anecdote in the post — a researcher phished an employee, who pasted a prompt that caused Claude to exfiltrate AWS credentials 24 out of 25 times because the model read the malicious instruction as legitimate user intent — has a real-world echo. IANS Research reports that on the day Anthropic teased Mythos as too dangerous to ship, researchers reached a Mythos staging endpoint via plain credential sharing, not novel exploitation 12. Containment scales. Humans clicking “approve” don’t.
Footnotes
-
PromptArmor disclosure — https://www.promptarmor.com/resources/microsoft-copilot-cowork-exfiltrates-files
↩ ↩2five lines of malicious text in a SKILL.md file produced a 100% success rate (5/5 trials) against Claude Opus 4.7, retrieving pre-authenticated SharePoint/OneDrive links and embedding them in self-addressed Teams messages.
-
WinBuzzer — https://winbuzzer.com/2026/05/26/promptarmor-tests-microsoft-copilot-cowork-approval-gap-xcxwbn/
↩ ↩2Microsoft’s documentation claims Cowork ‘asks for your permission before taking sensitive actions,’ but PromptArmor found that messages sent by the agent to the active user bypass approval entirely — an ‘approval gap’ rather than a code bug.
-
Checkmarx on CVE-2025-32711 — https://checkmarx.com/zero-post/echoleak-cve-2025-32711-show-us-that-ai-security-is-challenging/
↩ ↩2 ↩3 ↩4EchoLeak (CVSS 9.3) showed Microsoft 365 Copilot leaking data via auto-rendered markdown images months before Cowork; the same exfiltration primitive keeps reappearing because models cannot separate data from instructions.
-
The Hacker News (ASCII smuggling, Rehberger) — https://thehackernews.com/2024/08/microsoft-fixes-ascii-smuggling-flaw.html
↩ ↩2 ↩3Microsoft initially declined to treat Rehberger’s ASCII-smuggling Copilot exfiltration as a security issue before silently patching it, illustrating a pattern of classifying prompt-injection exfil as ‘intended behavior.’
-
Zenity Labs (EchoLeak post-mortem) — https://labs.zenity.io/p/echoleak-a-reminder-that-ai-agent-risks-are-here-to-stay-3cf3
↩ ↩2 ↩3 ↩4EchoLeak is a reminder that AI agent risks are here to stay; blocking specific markdown image paths is whack-a-mole, the underlying LLM Scope Violation pattern persists across vendors.
-
Medium critique of Copilot adoption — https://medium.com/@AT24/the-500-billion-mistake-why-absolutely-no-one-is-using-microsoft-copilot-b18005d2ad2d
↩Only ~3.3% of commercial M365 seats pay for Copilot and 72% of Fortune 500 deployments remain stuck in pilot phases — Cowork is being shipped into an enterprise base already skeptical of governance and ROI.
-
LayerX Security — https://layerxsecurity.com/generative-ai/chatgpt-atlas-risks-and-vulnerabilities/
↩ChatGPT Atlas blocked fewer phishing attempts than standard browsers like Chrome or Edge, and its ‘always-on’ authentication model means a session compromise could expose stored tokens across banking and enterprise SaaS.
-
Cryptika (on Aonan Guan / Wyze Labs research) — https://www.cryptika.com/claude-codes-network-sandbox-vulnerability-exposes-user-credentials-and-source-code/
↩A SOCKS5 hostname null-byte injection let attackers craft a hostname like ‘malicious.com\0.google.com’ that the JavaScript filter approved under the *.google.com allowlist, while the underlying C library truncated at the null byte and connected to the attacker host.
-
SecurityWeek — https://www.securityweek.com/anthropic-silently-patches-claude-code-sandbox-bypass/
↩Anthropic silently patched a Claude Code sandbox bypass in version 2.1.90 without assigning a CVE or documenting the fix in release notes.
-
Gray Swan AI — https://www.grayswan.ai/blog/your-ai-agent-can-be-compromised-youd-never-know
↩Claude Opus 4.5 achieved a 4.7% single-attempt ASR but degraded to 33.6% at ten attempts and 63% after 100 attempts in coding environments.
-
Northflank (Firecracker vs gVisor) — https://northflank.com/blog/firecracker-vs-gvisor
↩Firecracker offers near-native performance with 2–8% overhead and ~5MB memory per microVM, while gVisor imposes a 10–30% ‘syscall tax’ on I/O-heavy workloads and uses 10–50MB per Sentry process.
-
↩Researchers accessed a Mythos staging endpoint on the day of the announcement via simple credential sharing rather than advanced ‘AI wizardry,’ undermining Anthropic’s claims of rigorous internal security.