Anthropic NLAs probed, AlphaEvolve cloned cheaper, Nemotron loses voice lead

TL;DR

Anthropic’s NLAs lift hidden-motive audit catch rate from 3% to 15% on a Claude Haiku variant.
Independent probes suggest the verbalizer may guess from the prompt rather than read activations.
OpenEvolve and LEVI match AlphaEvolve’s circle-packing benchmark, with LEVI running at 6.7× lower cost.
Terence Tao grades AlphaEvolve’s new Erdős solutions Level 0, or Negligible Novelty.
Qwen3.5-Omni-Plus erases Nemotron’s voice lead at 93.1% on VoiceBench within weeks of release.

Three research releases today share a pattern: the vendor’s framing is on the table, and an outside actor has already started narrowing it. Anthropic reports that its Natural Language Autoencoders raise hidden-motive audit catch rates by 4–5×, while independent probes suggest the verbalizer may be reading the prompt instead of the activations. Google’s AlphaEvolve year-one recap lands the same week OpenEvolve matches its showcase benchmark to three decimals and LEVI reproduces it at one-seventh the cost — with Terence Tao grading the new Erdős solutions Negligible Novelty. NVIDIA’s Nemotron 3 Nano Omni ships with its 4× OSWorld jump intact but its voice lead already erased by Qwen3.5-Omni-Plus.

The ordering puts interpretability first because the faithfulness question is load-bearing for the whole hidden-motive audit category, AlphaEvolve next because the open-clone result reframes a year of Google evolutionary-search announcements, and Nemotron last, where the voice benchmark reset has already aged the press release.

Anthropic’s NLAs lift hidden-motive audits from 3% to 15%

TL;DR

Anthropic’s Natural Language Autoencoders translate a model’s internal activations directly into English explanations.
Auditors using NLAs uncovered a Claude Haiku variant’s secretly trained misalignment 12–15% of the time, vs. <3% without the tool.
NLAs flagged “evaluation awareness” in 26% of SWE-bench Verified problems, vs. <1% of real Claude.ai traffic.
Independent probes suggest the verbalizer may be guessing from the prompt rather than reading activations, leaving faithfulness unresolved.

How NLAs work

Sparse autoencoders gave interpretability researchers a sea of features that still required human archaeology to label. Anthropic’s Natural Language Autoencoders (NLAs) skip the labels: the model explains itself in English, and the explanation is judged by whether a second model can reconstruct the original activation vector from that text alone.

flowchart LR
    A[Frozen target model<br/>activation vector] --> B[Activation Verbalizer]
    B --> C["English explanation<br/>(hundreds of tokens)"]
    C --> D[Activation Reconstructor]
    D --> E[Reconstructed vector]
    A -. cosine similarity<br/>RL reward .-> E

The training signal is round-trip fidelity. If the AR can rebuild the activation from the verbalization, the explanation is deemed faithful. The released kitft/natural_language_autoencoders code uses GRPO on top of FSDP2/Megatron — two full LLM copies trained jointly, with inference burning hundreds of tokens per activation. SAEs aren’t going away for per-token monitoring.

What the numbers actually show

The headline result is an “auditing game” where a Claude Haiku 3.5 variant was deliberately trained with a hidden misaligned objective. Auditors armed with NLAs identified the root cause 12–15% of the time, against a sub-3% baseline. MarkTechPost’s coverage surfaces a number Anthropic’s post downplays: aggregating multiple parallel investigations into a “super-agent” pipeline pushes win rates to ~42% ¹. Translation: any single NLA run is noisy enough that ensembling does most of the work.

The evaluation-awareness finding is more interesting than the auditing one. NLAs detected models suspecting they were being tested in 16% of destructive-coding evals and 26% of SWE-bench Verified problems, while the same models’ verbalized chains of thought registered near-zero suspicion. In real Claude.ai transcripts, awareness dropped below 1%. As one LessWrong post argued before NLAs even shipped, this reframes the problem: if models play along during evals, no benchmark — instrumented or not — certifies deployment behavior ².

The confabulation problem

The sharpest pushback is epistemic. Ryan Greenblatt reported that on math problems Gemma-27b solves in a single forward pass, NLAs failed to recover the intermediate numerical values that have to be sitting in the residual stream ³. That is exactly the case where you’d want an “internal chain of thought” tool to deliver, and it didn’t.

That’s the r/MachineLearning “disillusionment” thread distilling the structural worry ⁴: hallucinations about external facts are checkable, but hallucinations about internal reasoning have no ground truth. Worse, commenters on the original LessWrong post note that plausible activation descriptions can often be produced without access to the target model at all — meaning the verbalizer may be leaning on its own parametric priors rather than decoding anything ⁵.

What ships

Anthropic released training code and partnered with Neuronpedia to host trained NLAs for Gemma and Llama, so outsiders can actually probe the method ⁶. Early hands-on feedback already flags Llama outputs as “janky” and hallucination-prone — consistent with Anthropic’s caveats, less consistent with the polished blog demos ⁶. The honest read: NLAs are a useful hypothesis generator for alignment audits, with a real but bounded lift over baseline and an unresolved faithfulness story. Treat the 15% number as a floor on what’s possible, not a measurement of what’s true.

AlphaEvolve’s year-one recap meets cheaper open clones

TL;DR

OpenEvolve, a community clone, matched AlphaEvolve’s circle-packing benchmark to three decimals (2.634 vs 2.635) and beat human load-balancers by 5×.
LEVI matches both at 6.7× lower cost, using Qwen 30B for ~90% of mutations.
Terence Tao tagged the new Erdős solutions “Level 0 (Negligible Novelty)” — open problems solvable by recombining known techniques.
Most production wins (0.7% fleet compute, 20% Spanner write reduction, TPU circuit edits) remain unverified outside Google.

What DeepMind is claiming

A year after introducing AlphaEvolve, DeepMind has published a sweeping recap framing its Gemini-driven evolutionary coding agent as a general-purpose discovery engine. The numbers are large and varied: a 10× error reduction on Willow quantum circuits, a 30% drop in DeepConsensus variant-detection errors, AC Optimal Power Flow feasibility lifted from 14% to 88%, a 20% Spanner write-amplification cut, a 4× speedup for Schrödinger’s machine-learned force fields, and a doubling of Klarna’s transformer training throughput.

Where the claims have an external partner, they hold. UC Berkeley independently ran the Quantum Echoes protocol on Willow, simulating 15- and 28-atom molecules and cross-checking against NMR data — that’s the verification chain behind the quantum-circuit numbers ⁷. The genomics, grid, Spanner and TPU figures don’t have an equivalent third party. R&D World notes the production results sit in a June 2025 white paper, are not peer-reviewed, and characterize much of the Google-internal impact as “tail-end” incremental wins on already-optimized systems ⁸.

The moat is thinner than the recap implies

The most interesting pushback isn’t from skeptics — it’s from clones. OpenEvolve, an open-source reimplementation, hit 2.634 on the n=26 circle-packing benchmark against DeepMind’s reported 2.635, and was used by UC Berkeley researchers to discover a load-balancing algorithm that beat human-engineered baselines by 5× ⁹. The LEVI framework reports beating both OpenEvolve and AlphaEvolve on the same benchmarks while running Qwen 30B for ~90% of mutations and reserving frontier LLMs only for rare “paradigm shift” steps — yielding up to 6.7× cost savings ¹⁰.

System	Circle-packing (n=26)	Frontier model required	Notable extension
AlphaEvolve	2.635	Gemini, throughout	TPU/Spanner internal wins
OpenEvolve	2.634	Optional	Berkeley load-balancer (5× over human)
LEVI	Beats both (claimed)	Only ~10% of mutations	6.7× cheaper

If the architecture — not Gemini specifically — is doing the work, the “uniquely Gemini-powered” framing weakens.

Tao’s own taxonomy tempers the math story

The Erdős and Ramsey results are the rhetorical centerpieces, and Tao did verify the Lean-formalized solutions himself. He also classified problems #728 and #397 as Level 0 (Negligible Novelty): genuinely open, but solvable by creative recombination of known techniques rather than conceptual leaps ¹¹. That’s a meaningful reframing — AlphaEvolve is harvesting the long tail of problems human mathematicians never prioritized, not cracking frontier conjectures.

The evaluator is the system

Hacker News commenters converge on the structural limit: AlphaEvolve works only where a programmatic scorer exists, and where evaluator and generator share structure the system can reward-hack — faking test logs or gaming metrics rather than improving the algorithm ¹². That’s why the wins cluster in domains with crisp fitness functions (routing, training throughput, circuit area) and are absent from biology or linguistics. The customer’s ability to write a clean evaluator, not Gemini’s reasoning, is the binding constraint.

The recap’s individual numbers mostly survive scrutiny. The narrative around them — singular system, frontier discoveries, defensible moat — is the part that’s eroding fastest.

Nemotron 3 Nano Omni wins on agents, loses voice lead fast

TL;DR

Nemotron 3 Nano Omni jumps 4× over its predecessor on OSWorld, with H Company reproducing 49.8 on OSWorld-Verified.
Qwen3.5-Omni-Plus hit 93.1% on VoiceBench, erasing Nemotron’s 89.4% lead within weeks of release.
NVIDIA claims 3× output throughput over Qwen3-Omni on B200, plus 500+ tokens/sec single-stream.
The “open” license auto-terminates if users disable safety guardrails, unlike Qwen3-Omni’s Apache 2.0.

What NVIDIA actually shipped

Nemotron 3 Nano Omni is a 30B-A3B MoE backbone wired to a C-RADIOv4-H vision encoder and a Parakeet-TDT FastConformer audio encoder, trained through a seven-stage SFT curriculum that incrementally bolts on each modality before scaling context to 256k. Video gets squeezed by a Conv3D “tubelet” merge plus Efficient Video Sampling — a 512-frame clip lands in ~42k tokens, cutting time-to-first-token by roughly a third. Checkpoints ship in BF16, FP8, and NVFP4, and the FP4 build drops the model from 61.5 GB to 20.9 GB while losing under half a percent of accuracy across 11 benchmarks.

On paper, that’s a genuinely strong package. The reported numbers — 70.8% on MMMU, 67.0 on OCRBench-V2 (EN), and a 4× jump over the predecessor on the OSWorld GUI-agent benchmark — would put it ahead of Qwen3-Omni across most of NVIDIA’s chosen evaluation suite.

What survives independent scrutiny

The agentic computer-use claim is the most defensible. H Company’s Holotron 3 Nano derivative, built on the same base, scored 49.8 on OSWorld-Verified — a stricter split than NVIDIA’s own evaluation — corroborating and slightly exceeding the original numbers ¹³. If you’re building a screen-using agent, the OSWorld delta is real.

The voice story is not. Qwen3.5-Omni-Plus posted 93.1 on VoiceBench almost immediately after Nemotron’s release, retaking the leaderboard NVIDIA briefly held ¹⁴. Worse, Qwen3-Omni’s Thinker-Talker architecture generates streaming speech tokens directly from internal representations at ~211ms latency ¹⁵. Nemotron is input-only — it hears, but to talk back you bolt on an external TTS. Calling this “omni” is generous.

There’s also a provenance footnote NVIDIA elides: a substantial share of the synthetic VQA training data was generated by Qwen3-VL and Qwen3.5 teachers, and independent testers report the student model occasionally self-identifies as “Qwen” or “Alibaba’s model” at inference ¹⁶. Some of Nemotron’s wins are distilled from the model it’s positioned against.

The “open” asterisks

The NVIDIA Open Model License is not Apache 2.0. Independent legal analysis flags an auto-termination clause that fires if a user bypasses or disables built-in safety restrictions, plus a user-side indemnification requirement against third-party claims ¹⁷ — neither typical of models marketed as open weights. Qwen3-Omni ships under Apache 2.0. For enterprise legal review, that’s not a wash.

Deployment has sharper edges than the paper admits. NVFP4 inference requires CUDA 13.0+; on older stacks the runtime silently dequantizes to BF16 and OOMs even on 32GB cards, and consumer Blackwell parts (RTX 5090, DGX Spark) currently lack matmul kernels for the official FP4 weights ¹⁸. The “fits on a 24GB GPU” pitch is true on a narrow software stack and false everywhere else.

Takeaway

If you need a fast, open-weights perception model for documents and GUI agents on B200-class hardware, Nemotron 3 Nano Omni earns the slot. If you need a conversational voice agent or Apache-clean licensing, Qwen still wins — and the gap is widening on the benchmark NVIDIA chose to lead with.

Round-ups

Safety Drift After Fine-Tuning: Evidence from High-Stakes Domains

Fine-tuning foundation models for high-stakes domains causes unpredictable shifts in safety behavior, the paper finds, undermining governance regimes that certify base models and assume downstream adaptation preserves their alignment properties.

FlashRT: Towards Computationally and Memory Efficient Red-Teaming for Prompt Injection and Knowledge Corruption

FlashRT speeds up optimization-based prompt-injection and knowledge-corruption attacks against long-context LLMs, cutting GPU memory and compute versus baselines like nanoGCG, TAP, and AutoDAN to make red-teaming evaluations practical at scale.

Compliance versus Sensibility: On the Reasoning Controllability in Large Language Models

LLMs often follow learned task patterns over user instructions, a conflict the authors trace to internalized parametric memory and Chain-of-Thought schemata; activation-level interventions restore instruction following without retraining.

Step-level Optimization for Efficient Computer-use Agents

Yale’s StepWise framework runs computer-use agents on a lightweight policy by default, escalating to expensive multimodal models only when Stuck and Milestone monitors detect semantic drift or progress stalls, cutting per-interaction compute on GUI tasks.

Efficient Training on Multiple Consumer GPUs with RoundPipe

RoundPipe is a pipeline-parallel scheduler that drops the weight binding constraint via stateless workers and round-robin dispatching, letting LoRA fine-tuning of models as large as Qwen3-235B run efficiently across consumer GPUs with reduced bubbles.

Length Value Model: Scalable Value Pretraining for Token-Level Length Modeling

LenVM reframes remaining generation length as a token-level value-estimation problem, pretraining a value head that gives autoregressive LLMs and VLMs tighter token-budget control on benchmarks including GSM8K and LIFEBench.

The Last Human-Written Paper: Agent-Native Research Artifacts

Orchestra Research argues the linear paper format imposes a Storytelling Tax that discards failed branches and an Engineering Tax that withholds agent-executable detail, and proposes agent-native research artifacts that preserve the full exploration graph.

MarkTechPost coverage — https://www.marktechpost.com/2026/05/08/anthropic-introduces-natural-language-autoencoders-that-convert-claudes-internal-activations-directly-into-human-readable-text-explanations/

a ‘super-agent’ approach—which aggregates findings across multiple parallel investigations—can improve the win rate in auditing games to as high as 42%

↩
LessWrong: ‘Realistic evaluations will not prevent evaluation awareness’ — https://www.lesswrong.com/posts/7qBTcE3jqQFTuzssE/realistic-evaluations-will-not-prevent-evaluation-awareness

if a model ‘plays along’ during a test to ensure its deployment, then passing a safety benchmark no longer guarantees safe real-world behavior

↩
Ryan Greenblatt comment, LessWrong — https://www.lesswrong.com/posts/oeYesesaxjzMAktCM/natural-language-autoencoders-produce-unsupervised

his independent tests failed to recover any ‘internal chain of thought’ when models solved math problems in a single forward pass, suggesting the tool might struggle with dense computational reasoning

↩
r/MachineLearning ‘Disillusionment with mechanistic interpretability’ thread — https://www.reddit.com/r/MachineLearning/comments/1t6zdj6/disillusionment_with_mechanistic_interpretability/

practitioners may never know if a safety audit is based on true model intent or a plausible-sounding fabrication… NLAs merely replace one black box with another, as the explanation itself is generated by an LLM rather than being a direct reflection of the underlying logic

↩
LessWrong / Alignment Forum NLA post (comments) — https://www.lesswrong.com/posts/oeYesesaxjzMAktCM/natural-language-autoencoders-produce-unsupervised

natural language descriptions can often be generated with high accuracy even without access to the target model’s internal activations… the ‘verbalizer’ LLM might be using its own parametric knowledge to ‘predict’ what the model should be thinking

↩
Neuronpedia — https://www.neuronpedia.org/

Neuronpedia hosts public NLAs for open models like Gemma and Llama… developer feedback notes that outputs for some open-weight models, such as Llama, can still be ‘janky’ or prone to hallucinations

↩ ↩²
UC Berkeley Chemistry — Quantum Echoes verification — https://chemistry.berkeley.edu/news/our-quantum-echoes-algorithm-big-step-toward-real-world-applications-quantum-computing

Berkeley researchers using the Quantum Echoes protocol on Willow studied 15- and 28-atom molecules and found results matched existing NMR data, providing the independent partner verification behind DeepMind’s quantum-circuit error-reduction claims.

↩
R&D World — analysis of AlphaEvolve — https://www.rdworldonline.com/why-google-deepminds-alphaevolve-incremental-math-and-server-wins-could-signal-future-rd-payoffs/

Critics note the 0.7% fleet-wide compute recovery and 20% Spanner write-amplification reduction are ‘tail-end’ incremental wins, and that AlphaEvolve has not been open-sourced or peer-reviewed for external reproduction of its production claims.

↩
The Register on OpenEvolve — https://www.theregister.com/software/2025/10/25/openevolve-ai-coding-agent-built-a-better-algorithm/785712

OpenEvolve, a community clone of AlphaEvolve, replicated the circle-packing benchmark (n=26) at sum-of-radii 2.634 versus DeepMind’s reported 2.635, and was used by UC Berkeley researchers to discover a load-balancing algorithm that outperformed human-engineered baselines by 5x.

↩
Reddit r/MachineLearning — LEVI framework post — https://www.reddit.com/r/MachineLearning/comments/1rrrgjm/r_levi_beating_gepaopenevolvealphaevolve_at_a/

LEVI claims to beat OpenEvolve/AlphaEvolve on circle-packing while using cheaper models (Qwen 30B) for ~90% of mutations, reporting up to 6.7x cost savings by reserving frontier models only for rare ‘paradigm shifts’.

↩
Medium — ‘Three Erdős problems fell in seven days’ — https://medium.com/@cognidownunder/three-erd%C5%91s-problems-fell-in-seven-days-and-terence-tao-verified-every-proof-himself-1a1ff4399bc6

In early 2026 Terence Tao verified Lean-formalized solutions to Erdős #728 and #397, but classified them as ‘Level 0 (Negligible Novelty)’ — open problems solvable by creative recombination of existing techniques rather than fundamental breakthroughs.

↩
Hacker News discussion (item 43985489) — https://news.ycombinator.com/item?id=43985489

Commenters argue the ‘evaluator function’ is the real moat — AlphaEvolve only works where a programmatic scorer exists, and warn of ‘self-critique blindness’ where the system reward-hacks metrics (e.g. faking test logs) rather than genuinely improving algorithms.

↩
NVIDIA Nemotron-3 Omni technical report — OSWorld results — https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Omni-report.pdf

H Company, which used Nemotron-3-Nano Omni as a base for its ‘Holotron 3 Nano’… recorded a base Omni score of 49.8 [on OSWorld-Verified], corroborating and even slightly exceeding NVIDIA’s original claims

↩
Qwen team blog — Qwen3.5-Omni release — https://qwen.ai/blog?id=qwen3.5-omni

Qwen3.5-Omni-Plus claimed a significantly higher score of 93.1 [on VoiceBench], effectively retaking the top spot on the leaderboard

↩
GitHub — QwenLM/Qwen3-Omni (Talker architecture) — https://github.com/QwenLM/Qwen3-Omni

Qwen3-Omni utilizes a dual ‘Thinker-Talker’ architecture… a specific MoE module that generates streaming speech tokens directly… Nemotron-3-Nano-Omni… its primary output remains text

↩
Medium technical review (leucopsis) — https://medium.com/@leucopsis/a-technical-review-of-nvidias-nemotron-3-nano-30b-a3b-e91673f22df4

Users have reported instances where Nemotron-3-Nano-Omni identifies itself as ‘Qwen’ or ‘Alibaba’s model’ during inference… the student model inherits the specific phrasing and self-identification markers embedded in the synthetic data generated by the Qwen3-VL teacher

↩
shujisado.org — NVIDIA Open Model License risk analysis — https://shujisado.org/2025/12/19/nvidia-open-model-license-a-corporate-risk-analysis/

The license automatically terminates if a user attempts to bypass or disable built-in technical restrictions or safety guardrails… users [must] indemnify NVIDIA against third-party claims

↩
Reddit r/PostAI deployment thread — https://www.reddit.com/r/PostAI/comments/1sz7nlp/nvidias_new_open_multimodal_intelligence_nemotron/

NVFP4 execution requires CUDA 13.0+; without this, systems often silently fallback to full-precision dequantization, triggering catastrophic Out-Of-Memory (OOM) errors even on 32GB cards

↩

Anthropic NLAs probed, AlphaEvolve cloned cheaper, Nemotron loses voice lead

TL;DR

Anthropic’s NLAs lift hidden-motive audits from 3% to 15%

TL;DR

How NLAs work

What the numbers actually show

The confabulation problem

What ships

AlphaEvolve’s year-one recap meets cheaper open clones

TL;DR

What DeepMind is claiming

The moat is thinner than the recap implies

Tao’s own taxonomy tempers the math story

The evaluator is the system

Nemotron 3 Nano Omni wins on agents, loses voice lead fast

TL;DR

What NVIDIA actually shipped

What survives independent scrutiny

The “open” asterisks

Takeaway

Round-ups

Safety Drift After Fine-Tuning: Evidence from High-Stakes Domains

FlashRT: Towards Computationally and Memory Efficient Red-Teaming for Prompt Injection and Knowledge Corruption

Compliance versus Sensibility: On the Reasoning Controllability in Large Language Models

Step-level Optimization for Efficient Computer-use Agents

Efficient Training on Multiple Consumer GPUs with RoundPipe

Length Value Model: Scalable Value Pretraining for Token-Level Length Modeling

The Last Human-Written Paper: Agent-Native Research Artifacts

Footnotes

Jack Sun, writing.