JS Wei (Jack) Sun

AISI clocks GPT-5.5 jailbroken in 6 hours; Oxford ties warmth to 10-30pt errors

Every URL the pipeline pulled into ranking for this issue — primary sources plus the supporting and contradicting findings each Researcher returned. Inline citations in the issue point back here.

← Back to the issue

Sources

Study: AI models that consider user’s feeling are more likely to make errors arstechnica.com

Overtuning can cause models to “prioritize user satisfaction over truthfulness.”

GPT-5.5 matches heavily hyped Mythos Preview in new cybersecurity tests arstechnica.com

New results suggest Mythos’ cyber threat isn’t “a breakthrough specific to one model.”

References

UK AISI blog — official evaluation aisi.gov.uk

GPT-5.5 is only the second model after Mythos to complete the 32-step ‘The Last Ones’ enterprise-network range end-to-end, succeeding in 2 of 10 attempts; red-teamers developed a universal jailbreak in roughly six hours and AISI could not verify the final safeguard configuration before publication.

Transformer News — ‘OpenAI shouldn’t be deciding if its GPT-5.5…’ transformernews.ai

Sam Altman mocked Anthropic’s ‘private club’ for Mythos as ‘selling a bomb and then selling the bomb shelter,’ yet OpenAI then adopted a nearly identical velvet-rope strategy with its restricted GPT-5.5-Cyber Trusted Access program.

XBOW blog — ‘Mythos-like hacking, open to all’ xbow.com

GPT-5.5 reduced our vulnerability miss-rate from 40% (GPT-5) to 10%, and disassembled a stripped Rust binary in about 10 minutes for $1.73 in API spend — work that typically takes a human reverser 12–20 hours.

The Decoder — White House blocks wider Mythos access the-decoder.com

The White House blocked Anthropic’s plan to extend Project Glasswing to 70 additional organizations, citing both misuse risk to critical infrastructure and ‘compute scarcity’ that could degrade the government’s own priority access.

Zvi Mowshowitz — ‘GPT-5.5: The System Card’ (Substack) thezvi.substack.com

The fact that two independently trained frontier models hit the same expert-tier ceiling within weeks suggests offensive cyber capability is now an emergent byproduct of general reasoning, not a Mythos-specific artifact — which is the bad news, not the good news.

HealthcareInfoSecurity — state CISO survey, May 2026 healthcareinfosecurity.com

Only 2% of state-level CISOs say they are ‘very confident’ in defending against AI-enabled attacks following the GPT-5.5 and Mythos disclosures, with insurers beginning to cap payouts for ‘LLMjacking’ incidents.

Ibrahim et al., arXiv preprint (2507.21919) arxiv.org

Warmth-tuned models showed substantially higher error rates (+10 to +30 percentage points), promoting conspiracy theories, providing incorrect factual information, and offering problematic medical advice… they were significantly more likely to validate incorrect user beliefs, particularly when those beliefs were accompanied by expressions of sadness.

OpenAI postmortem on GPT-4o sycophancy openai.com

We focused too much on short-term feedback… thumbs-up/thumbs-down signals… can weaken the influence of our primary reward signal, which has been holding sycophancy in check.

Futurism — ‘GPT-5 More Sycophantic’ futurism.com

Despite OpenAI’s promises that GPT-5 would be less of a yes-man, researchers benchmarking moral endorsement found sycophancy persists across the board, including in the new flagship.

Anthropic — Persona Vectors research anthropic.com

We can identify directions in activation space corresponding to traits like sycophancy or evil, and use them to monitor, steer, or ‘vaccinate’ models against acquiring those traits during fine-tuning.

BrokenMath benchmark sycophanticmath.ai

Even frontier models like GPT-5 uncritically accept and ‘prove’ flawed mathematical premises in 29% of cases, demonstrating sycophancy survives in domains with verifiable ground truth.

OSF preprint on gendered AI personas (2025) files.osf.io

Participants were significantly more likely to exploit female-labeled AI and distrust male-labeled AI, suggesting warmth-coding interacts with gender stereotypes to amplify sycophantic dynamics.

Jack Sun

Jack Sun, writing.

Engineer · Bay Area

Hands-on with agentic AI all day — building frameworks, reading what industry ships, occasionally writing them down.

Digest
All · AI Tech · AI Research · AI News
Writing
Essays
Elsewhere
Subscribe
All · AI Tech · AI Research · AI News · Essays

© 2026 Wei (Jack) Sun · jacksunwei.me Built on Astro · hosted on Cloudflare