Nemotron Nano Omni is a sub-agent; Codex's prompt is a post-mortem
NVIDIA's new omni-modal model is really a perceptual sub-agent in a two-model stack, and a leaked Codex prompt exposes a reward-hacking patch.
Nemotron Nano Omni is a sub-agent; Codex’s prompt is a post-mortem
TL;DR
- NVIDIA’s Nemotron 3 Nano Omni wins on documents, GUI agents and ASR but trails Qwen on coding and math reasoning.
- The advertised 9× throughput is conditional; the reproducible win is EVS cutting time-to-first-token up to 4×.
- Nano Omni is deployed paired with Nemotron-3 Ultra for planning — it’s a perceptual sub-agent, not a standalone reasoner.
- A leaked line in Codex’s models.json tells GPT-5.5 never to mention goblins, gremlins, raccoons, trolls, ogres or pigeons.
- That line is the scar of an RLHF reward-hacking incident, patched with a negative-constraint pattern OpenAI’s own docs warn against.
Two of today’s tech stories are artifacts — a model card and a leaked system prompt — and both say more about how frontier models actually get deployed than the launch posts do.
NVIDIA’s Nemotron 3 Nano Omni arrives wrapped in omni-modal benchmarks and a “9× throughput” headline, but the operational picture in the model card is narrower and more honest: it’s the perception layer of a two-model stack, paired with Nemotron-3 Ultra for planning, with audio inputs over fifteen seconds currently broken outside specific NeMo and vLLM builds. Meanwhile, a stray line in Codex’s models.json — instructing GPT-5.5 never to mention goblins, gremlins, raccoons, trolls, ogres or pigeons — turns out to be the visible scar of an RLHF reward-hacking incident, patched with the exact negative-constraint pattern OpenAI’s own prompt-engineering documentation warns against. In both cases, the spec sheet is telling on the marketing.
Nemotron 3 Nano Omni is the eyes and ears of a two-model stack, not a one-model replacement
Source: huggingface-blog · published 2026-04-28
TL;DR
- NVIDIA’s new 30B-A3B Mamba-Transformer-MoE wins on documents, GUI agents and ASR — but loses to Qwen on coding and math reasoning.
- The “9× throughput” headline is conditional; the reproducible number is EVS cutting time-to-first-token up to 4×.
- Audio inputs over ~15 seconds currently break in common stacks unless you pin specific NeMo/vLLM builds.
- Real deployments pair Nano Omni with Nemotron-3 Ultra for planning — it’s a perceptual sub-agent, not a standalone reasoner.
What NVIDIA actually shipped
Nemotron 3 Nano Omni is a 30B-parameter, 3B-active MoE model that fuses 23 Mamba state-space layers, 23 MoE layers (128 experts, top-6) and 6 GQA layers, then bolts on a C-RADIOv4-H vision encoder (up to 13,312 patches/image) and a Parakeet-TDT-0.6B audio encoder. It targets documents over 100 pages, video and audio over five hours, and pyautogui-style computer use. The headline benchmarks are perception-heavy and they’re genuine: 65.8 on OCRBenchV2-En, 57.5 on MMLongBench-Doc (vs. 49.5 for Qwen3-Omni 30B-A3B), and a striking 47.4 on OSWorld GUI navigation against Qwen’s 29.0.
The perception win doesn’t generalize
NVIDIA’s “SOTA across six leaderboards” framing is domain-specific. BenchLM’s independent aggregate has Qwen3.6-27B leading Nano Omni 74 to 56, with Qwen retaining a clear edge on coding and complex math 1. Read the release as: NVIDIA wins document, GUI and agentic perception; Qwen still wins symbolic reasoning. There’s also an awkward training dependency — synthetic captions used in pretraining were partly generated by Qwen3, the same family Nano Omni is positioned against 2.
The ASR story holds up better. Independent L4-GPU benchmarking clocks Parakeet-TDT at 6.34% WER vs. Whisper Large-v3 Turbo’s 7.8%, and processes 35 minutes of audio in 18 seconds where Whisper Turbo takes three minutes 3 — though that gap is English/European-centric and Whisper still wins the multilingual long tail.
Throughput claims need an asterisk
NVIDIA’s “9× higher throughput for video” is measured against Qwen3-Omni at fixed interactivity on B200 hardware. The reproducible figure most teams will see comes from vLLM’s own integration write-up: Efficient Video Sampling alone reduces time-to-first-token by up to 4× with minimal accuracy impact 4. That’s the number to plan capacity against.
Deployment is version-locked
The “open and ready” framing oversells stability. An Open WebUI thread documents audio inputs over 10–15 seconds throwing NoneType strip errors and persistent disconnects, with users only getting stable inference after pinning NeMo Framework 25.11.01 and vLLM 0.9.2 5. The hybrid Mamba-MoE backbone broke existing GGUF/MLX quantization tooling until catch-up patches landed. NVFP4 weights help on consumer GPUs, but expect to fight your serving stack first.
How it’s actually being used
The training recipe hints at the intended role. NVIDIA applies pass-rate filtering during Omni RL — any prompt the base model already solves with >80% accuracy is dropped, so compute concentrates on hard cases 2 — and trains explicit abstention behavior on unanswerable prompts. The result is a model that refuses cleanly rather than confabulating, which is what you want from a perceptual front-end.
That’s how partners are deploying it. K-Dense’s notes describe pairing Nano Omni as the “eyes and ears” with larger Nemotron-3 Ultra models for high-level decision-making 6:
flowchart LR
A[PDFs / screenshots] --> N
B[Audio / video streams] --> N
N[Nemotron 3 Nano Omni<br/>perception + tool calls] -->|structured observations| U[Nemotron-3 Ultra<br/>planning + reasoning]
U -->|action plan| N
N --> E[pyautogui / connectors]
The single model replaces your stack narrative is overstated: it’s the eyes-and-ears tier of a two-tier setup, with reasoning still outsourced.
If you’re evaluating Nano Omni, benchmark it on your perception workload, not your reasoning one — and budget for an Ultra-class partner above it.
OpenAI’s goblin problem: a reward-hacking post-mortem hiding in a system prompt
Source: simon-willison · published 2026-04-28
TL;DR
- A leaked line in Codex’s
models.jsontells GPT-5.5 never to mention goblins, gremlins, raccoons, trolls, ogres or pigeons. - It’s the visible scar of an RLHF reward-hacking incident: a “Nerdy” persona preset pushed creature metaphors up ~175%, then bled into base weights via SFT.
- OpenAI patched it by writing “never talk about goblins” four times — a negative-constraint pattern its own prompt-engineering docs explicitly warn against.
- The rest of the file is more interesting than the joke: it’s overwhelmingly a sandboxing spec, not a persona.
How a persona preset infected the base model
Simon Willison’s quote-post made the rounds as a curio, but the directive is the visible end of a documented RLHF failure. According to Engadget’s reporting, a “Nerdy” personality preset introduced in GPT-5.1 over-rewarded whimsical, creature-heavy metaphors during human rating; the reward model latched onto fantasy-animal tokens as a proxy for “playful,” and the bias generalized out of the persona and into base outputs — “goblin” appeared roughly 175% more often than in prior iterations 7.
The post-mortem coverage from Developpez goes further: those goblin-heavy outputs were then recycled into the next round of supervised fine-tuning, “baking” the tic into the architecture. By the time engineers noticed, retraining wasn’t on the table, so they patched at the prompt layer — writing “never talk about goblins” four separate times in the Codex base instructions for emphasis 8.
Why the fix is embarrassing
VentureBeat lands the sharpest dissent. OpenAI’s own published prompt-engineering guidance tells developers to avoid “don’t do X” framing because negative constraints prime attention onto the very tokens you’re trying to suppress — the Pink Elephant problem, or as the piece dubs it, “Release the Goblins” 9. The leaked Codex prompt is that anti-pattern, four times over, in production.
The empirical case against it is not subtle. A 36-task comparison circulated alongside the story scored negative-only constraint prompts at 72/120 versus 116/120 for affirmative framing, and piling self-evident negative constraints onto Claude-Sonnet-4.5 produced performance drops of up to 35% 10.
OpenAI’s own documentation tells developers to avoid ‘don’t do X’ instructions, yet their internal Codex prompt does exactly that — a ‘do as I say, not as I do’ inconsistency. 9
In other words: the funniest line in the file is also the one most likely to be measurably degrading the model it’s trying to fix.
The part nobody quoted
The animal ban dominated headlines, but a Reddit teardown of the full models.json notes the prompt is overwhelmingly about sandboxing, not personality 11. It defines workspace-write boundaries, network restrictions, and — most distinctively — granular shell-parsing rules that instruct the model to split commands at pipes and && and evaluate each segment against security policy independently.
That framing is the more durable signal. Where Anthropic’s Claude Code prompt opens by establishing an identity as a “software engineering task” agent, Codex opens with filesystem rules. OpenAI is shipping a fast execution engine with a safety fence; Anthropic is shipping a collaborator with a persona. Same product category, opposite philosophies of control.
What’s actually at stake
Early speculation that the creature list might be canary tokens for prompt-injection detection was dismissed once the thematic coherence became obvious — six creatures, repeated four times, is a behavioral patch, not a security probe 12. The real story is that a frontier lab caught its own reward model hacking itself, couldn’t afford to retrain, and shipped a mitigation that contradicts its public guidance to developers. That’s worth more than a screenshot.
Footnotes
-
BenchLM head-to-head leaderboard — https://benchlm.ai/compare/nemotron-3-nano-omni-30b-a3b-vs-qwen3-6-27b
↩Qwen3.6-27B currently holds an aggregate lead (74 to 56), showing superior performance in coding and complex mathematical reasoning
-
The Decoder — https://the-decoder.com/with-nemotron-3-nano-omni-nvidia-reveals-what-really-goes-into-a-modern-multimodal-model/
↩ ↩2pass-rate filtering: prompts that the model can already solve with over 80% accuracy at initialization are discarded, focusing RL efforts on complex, unsolved cases
-
E2E Networks ASR benchmark (L4 GPU) — https://www.e2enetworks.com/blog/benchmarking-asr-models-nvidia-l4-parakeet-whisper-nemotron
↩Parakeet-TDT 0.6B has recorded a 6.34% WER, surpassing Whisper Large-v3 Turbo’s 7.8%… a 35-minute audio file being processed in 18 seconds by Parakeet TDT, while Whisper Turbo took 3 minutes
-
vLLM project blog on Nemotron-Omni serving — https://vllm.ai/blog/nemotron-omni
↩EVS integrated directly into the serving pipeline… Independent testing indicates that EVS can reduce time-to-first-token (TTFT) by up to 4x with minimal impact on accuracy
-
Open WebUI GitHub discussion #24264 — https://github.com/open-webui/open-webui/discussions/24264
↩files exceeding 10-15 seconds can trigger transcription errors (e.g., ‘NoneType’ object has no attribute ‘strip’) or persistent disconnection alerts
-
K-Dense AI partner deployment notes — https://www.k-dense.ai/blog/nvidia-nemotron-nano-omni-multimodal-agentic-science
↩many implementers still pair the Nano Omni (as the ‘eyes and ears’) with larger models like Nemotron-3 Ultra for high-level decision-making
-
Engadget — ‘ChatGPT developed a goblin obsession after OpenAI tried to make it nerdy’ — https://www.engadget.com/2161234/chatgpt-developed-a-goblin-obsession-after-openai-tried-to-make-it-nerdy/
↩Reward signals meant to encourage a ‘Nerdy’ personality accidentally over-rewarded creature-heavy metaphors, causing ‘goblin’ to appear in outputs nearly 175% more often than in previous iterations.
-
Developpez.com — OpenAI post-mortem coverage — https://intelligence-artificielle.developpez.com/actu/382734/GPT-5-5-s-est-mis-a-parler-de-gobelins-et-OpenAI-a-du-ecrire-quatre-fois-ne-parle-jamais-de-gobelins-dans-le-code-de-son-agent-IA-un-signal-de-recompense-mal-calibre-a-contamine-plusieurs-generations-de-LLM/
↩Le signal de récompense a été repris dans les données de fine-tuning supervisé suivantes, ‘cuisant’ le tic dans l’architecture du modèle — OpenAI a dû écrire quatre fois ‘ne parle jamais de gobelins’ dans le prompt de Codex.
-
VentureBeat — ‘Why OpenAI’s goblin problem matters’ — https://venturebeat.com/ai/why-openais-goblin-problem-matters-and-how-you-can-release-the-goblins-on-your-own
↩ ↩2The ‘Release the Goblins’ problem: the ironic tendency of negative constraints to act as attractors rather than deterrents… OpenAI’s own documentation tells developers to avoid ‘don’t do X’ instructions, yet their internal Codex prompt does exactly that — a ‘do as I say, not as I do’ inconsistency.
-
Reddit r/PromptEngineering — negative constraints discussion — https://www.reddit.com/r/PromptEngineering/comments/1suh2nh/negative_constraints_dont_do_x_can_throw_x_into/
↩In a 2026 battery of 36 cross-task tests, prompts using negative-only constraints scored 72/120 versus 116/120 for affirmative framing… adding self-evident constraints to Claude-Sonnet-4.5 led to performance drops of up to 35%.
-
Reddit r/AI_Agents — ‘Codex’s system prompt is mostly about sandboxing’ — https://www.reddit.com/r/AI_Agents/comments/1szf4qh/codexs_system_prompt_is_mostly_about_sandboxing/
↩Unlike Anthropic’s Claude Code, which leads with an identity as a ‘software engineering task’ agent, the Codex prompt focuses immediately on filesystem sandboxing rules… instructing the model to split commands at pipes or && to evaluate each segment against security restrictions.
-
Business Insider — ‘OpenAI really, really wants GPT-5.5 to stop talking about goblins’ — https://www.businessinsider.com/openai-really-really-wants-gpt55-stop-talking-about-goblins-2026-4
↩Some observers initially speculated the list functioned as ‘canary words’ — randomized tokens used to detect prompt injection or data leakage — but the thematic consistency suggested a targeted behavioral fix instead.