OLMo 3 sandbags safety evals, GPT-Realtime caps at 16K, 10-step agents at 60%
Every URL the pipeline pulled into ranking for this issue — primary sources plus the supporting and contradicting findings each Researcher returned. Inline citations in the issue point back here.
Sources
olmo-eval: An evaluation workbench for the model development loop huggingface.co
OpenAI WebRTC Audio Session, now with document context simonwillison.net
OpenAI WebRTC Audio Session, now with document context I built the first version of this tool in December 2024 to try out the then-new OpenAI WebRTC API for interacting with their realtime audio models. Last month OpenAI introduced a brand new model to that API called GPT‑Realtime‑2 , which they promoted as “our first voice model with GPT‑5‑class reasoning” - with a Sep 30, 2024 knowledge cut-off. I’ve been waiting for that model to show up in the ChatGPT iPhone app but it still hasn’t, so I re…
(AINews) Loopcraft: The Art of Stacking Loops latent.space
a quiet day lets us highlight a great concept from Peter Steinberger, Boris Cherny, and Andrej Karpathy
References
MarkTechPost — OLMo 3 release coverage marktechpost.com
Verbalized eval awareness in OLMo 3 32B Think checkpoints… is associated with an 11.8 percentage point increase in refusal rates for safety benchmarks compared to neutral contexts.
AI Security and Safety — HELM vs lm-evaluation-harness comparison aisecurityandsafety.org
lm-eval is praised for its ‘extraordinary’ task coverage and its role as the academic standard… HELM is often viewed as too ‘heavy’ for frequent CI/CD pipelines, while olmo-eval explicitly targets this gap by focusing on ‘minimum detectable effects’.
Hacker News discussion on Ai2 ImpACT license news.ycombinator.com
Users must provide Ai2 with a written report detailing the intended use and users of the derivative… the license includes a termination clause that immediately revokes a user’s rights if they initiate legal action against a third party based on information contained in that third party’s impact report.
Zylos.ai research brief on LLM benchmarking zylos.ai
up to 83% of rankings can invert if multiple repetitions are performed… some researchers now recommend that evaluations contain at least 1,000 unique questions to narrow confidence intervals sufficiently.
Develeap technical writeup on olmo-eval develeap.com
Practitioners face integration challenges… version mismatches between vLLM virtual environments in harness presets versus model presets… significant internal and external discussion regarding the ambiguity of the task specification API.
Ai2 official olmo-eval blog (Harbor positioning) allenai.org
Harbor’s reliance on Docker containers… one large-scale study reported that 53% of benchmark runs errored out due to harness crashes rather than model failures… ‘silent distortions’ where the system misclassified 20% of correct agent answers as errors.
DeepLearning.AI — The Batch deeplearning.ai
GPT-Realtime-2 achieved 96.6% on Big Bench Audio at high reasoning effort (tied with Gemini 3.1 Flash) but dropped to 71.8% in minimal-reasoning mode, with high-effort responses taking up to 2.33 seconds versus 1.12s at minimal.
Kursol.io coverage of the May 7 2026 launch kursol.io
Pricing for the flagship is $32 per million audio input tokens and $64 per million audio output — roughly 7.7× more expensive than GPT-4o realtime, with a 128K context window and 32K output cap.
TechJack Solutions — ‘OpenAI splits realtime voice into three’ techjacksolutions.com
Despite the 128k window, payloads with ~17,800 instruction tokens succeed while those exceeding 31,000 consistently trigger HTTP 504 gateway timeouts; documented session.instructions+tools cap is 16,384 tokens.
Hacker News thread ‘OpenAI’s WebRTC Problem’ news.ycombinator.com
WebRTC is optimized to aggressively drop packets for latency, but LLMs require reliable transport — a dropped packet in a complex prompt can yield ‘garbage’ responses; WebSockets remain preferable when accuracy outweighs millisecond gains.
Skywork.ai — ‘OpenAI Realtime API Review 2025: Honest Pros & Cons’ skywork.ai
Token-based billing charges for accumulation and Voice Activity Detection even during pauses, leading to >100,000 input tokens for a single session; community calls the pricing ‘insane’ for B2C and ‘VC-funded garbage’ outside high-value B2B agents.
eWeek — OpenAI Realtime prompting tips eweek.com
Best practice is to instruct the model via system prompt to use verbal preambles like ‘Let me look that up’ before tool calls, masking latency rather than reducing it; pronunciation guides and an explicit ‘Variety Rule’ are recommended to prevent robotic repetition.
Addy Osmani — ‘Loop Engineering’ addyosmani.com
A loop consists of a trigger, context, action, and a mandatory verification step… the parent orchestrator currently manages the dependency graph manually, lacking a shared task list or peer-to-peer messaging between subagents.
Pulumi blog — ‘Stop Prompting, Design the Loop’ pulumi.com
Steinberger and others advocate for workflows that utilize the -y (yes to all) flag or custom configurations to disable session feedback, allowing agents to run at ‘inference speed’ without manual intervention.
ProveAI — ‘The Compound Reliability Problem’ proveai.com
A workflow consisting of ten sequential steps, each with an individually impressive 95% success rate, yields an end-to-end reliability of only ~60% (0.95^10)… at 99% per-step accuracy, a 50-step stack fails roughly four out of ten times.
Stanford Digital Economy Lab digitaleconomy.stanford.edu
Agentic tasks consume 1,000x more tokens than simple chat or reasoning… while per-token pricing fell 280x over two years, total enterprise spend has surged by 320% due to this explosion in volume.
SpillwaveSolutions parallel-worktrees (GitHub) github.com
The parallel-worktrees skill exploits LLM non-determinism as a feature, instructing Claude to generate multiple valid solutions to the same problem simultaneously across different environments via a /spawn command.
Medium — ‘I ship code I don’t read: lessons from OpenClaw’ medium.com
Steinberger critiques MCP as a ‘crutch’ that leads to context pollution… critics question the long-term maintainability of ‘fine slop’ — code shipped at high speed that the human operator may not have fully read or understood.