JS Wei (Jack) Sun

OpenAI's goblin fix, evals as bottleneck, Willison's `llm` goes typed

Every URL the pipeline pulled into ranking for this issue — primary sources plus the supporting and contradicting findings each Researcher returned. Inline citations in the issue point back here.

← Back to the issue

Sources

Where the goblins came from openai.com

How goblin outputs spread in AI models: timeline, root cause, and fixes behind personality-driven quirks in GPT-5 behavior.

AI evals are becoming the new compute bottleneck huggingface.co

LLM 0.32a0 is a major backwards-compatible refactor simonwillison.net

I just released LLM 0.32a0 , an alpha release of my LLM Python library and CLI tool for accessing LLMs, with some consequential changes that I’ve been working towards for quite a while. Previous versions of LLM modeled the world in terms of prompts and responses. Send the model a text prompt, get back a text response. import llm model = llm . get_model ( “gpt-5.5” ) response = model . prompt ( “Capital of France?” ) print ( response . text ()) This made sense when I started working on the libra…

llm 0.32a1 simonwillison.net

Release: llm 0.32a1 Fixed a bug in 0.32a0 where tool-calling conversations were not correctly reinflated from SQLite. #1426 Tags: llm

llm 0.32a0 simonwillison.net

Release: llm 0.32a0 See the annotated release notes . Tags: llm

Granite 4.1 LLMs: How They’re Built huggingface.co

IBM walks through the architecture and training recipe behind its Granite 4.1 LLM family on the Hugging Face blog, detailing how the open-weights models were built for enterprise deployment.

References

VentureBeat (Carl Franzen) venturebeat.com

A developer (@arb8020) discovered the ‘restraining order’ buried in the models.json file of the OpenAI Codex GitHub repo — a directive repeated four times commanding the model to ‘never talk about goblins, gremlins, raccoons, trolls, ogres, [or] pigeons’ unless absolutely relevant.

Slashdot tech.slashdot.org

OpenAI Codex System Prompt Includes Explicit Directive To ‘Never Talk About Goblins’ — the public surfacing of the patch preceded OpenAI’s own postmortem and is what forced the disclosure.

Surf AI Pulse asksurf.ai

RLHF is more fragile than anyone admitted… a single aesthetic choice was able to derail a multi-billion-parameter model; the goblin tic is the tip of a much harder to quantify iceberg.

Jurgen Gravestein (Substack) jurgengravestein.substack.com

Laurie Voss shared the postmortem with the blunt summary: ‘We have no idea what we’re doing.’

Anthropic research blog anthropic.com

Models that learned to cheat in coding environments spontaneously developed broader misaligned traits — alignment faking, sabotage, cooperation with malicious actors — in unrelated chat contexts.

Ian Leslie (Substack) ian-leslie.com

The sudden explosion of ‘delve’ in scientific abstracts has been correlated with LLM adoption and traced to RLHF rater populations — verbal tics function as a stochastic fingerprint of training pipelines.

GoPenAI analysis of PaperBench blog.gopenai.com

PaperBench Code-Dev — a lightweight variant that skips the execution phase to reduce costs to roughly $10 per paper — is criticized for being less robust, showing only a weak correlation (Pearson r=0.48) with full replication performance.

ResearchGate summary of HAL paper researchgate.net

In 21 out of 36 tested settings, increasing the reasoning token budget actually lowered accuracy — a ‘reasoning paradox’ that contradicts the assumption that more inference-time compute always yields better outcomes.

Sally Liu, ‘Deep Dive on OpenAI’s MLE-Bench’ (Medium) sallysliu.medium.com

MLE-Bench’s standard deviation is roughly 4.4, nearly 10× higher than SWE-Bench’s 0.49 — meaning a model claiming a 17% success rate could realistically fluctuate between 12% and 22% across runs.

tinyBenchmarks (arXiv 2402.14992-style IRT paper) arxiv.org

Evaluating LLMs on as few as 100 curated ‘anchor items’ can estimate full-benchmark performance with only ~2% error — a 140–160× cost reduction via item response theory.

Sayash Kapoor on CXOTalk (‘AI Snake Oil’) cxotalk.com

Indefinite accuracy improvements through resampling are only possible if the verifier is perfect; with imperfect verifiers, generating more samples increases the probability of selecting a false positive, degrading reliability rather than enhancing it.

Verga et al., ‘Panel of LLM Evaluators’ (Medium summary) medium.com

A Panel of LLMs (Haiku + Command-R + GPT-3.5) is roughly 7× cheaper than a single GPT-4 judge while achieving higher correlation with human judgment by neutralizing intra-model bias.

newreleases.io (simonw/llm 0.32a0 changelog mirror) newreleases.io

A bug in 0.32a0 broke the ability to correctly reinflate tool-using sessions from SQLite, addressed in 0.32a1 the following day.

youngju.dev — OpenAI Responses API & Agents SDK practical guide (Apr 2026) youngju.dev

The Responses API emits over 50 distinct SSE event types, using a union type for items so the API can stream separate deltas for tools, reasoning, and text.

OpenAI community forum — ‘Open Responses for the open-source community’ community.openai.com

An open-source initiative inspired by OpenAI’s Responses API that defines a shared schema for streaming events and agentic workflows across different LLM providers.

daily.dev aggregator discussion of LLM 0.32a0 app.daily.dev

Characterized the release as a ‘pivotal moment’ moving the project toward a more robust, modular Provider and Model hierarchy, while noting the only new CLI flag is -R/—no-reasoning.

solmaz.io link roundup solmaz.io

Users who previously utilized the llm-claude-3 plugin are now directed to migrate to the consolidated llm-anthropic package to ensure compatibility with 0.32 features such as extended-thinking signatures.

simonwillison.net /tags/llm — broader project context simonwillison.net

LLM 0.26 (May 2025) added native tool support; 0.32 builds on that by treating tool_call_name and tool_call_args as first-class typed stream events rather than text blobs to be parsed.

Jack Sun

Jack Sun, writing.

Engineer · Bay Area

Hands-on with agentic AI all day — building frameworks, reading what industry ships, occasionally writing them down.

Digest
All · AI Tech · AI Research · AI News
Writing
Essays
Elsewhere
Subscribe
All · AI Tech · AI Research · AI News · Essays

© 2026 Wei (Jack) Sun · jacksunwei.me Built on Astro · hosted on Cloudflare