DeepMind’s wrong test, Church recodes E. coli, Princeton’s MoE tradeoff

TL;DR

DeepMind’s co-clinician scores 97/98 on commission errors, but BMJ data pins ~76.6% of severe clinical harm on omission; a Beth Israel pilot is the load-bearing result.
Wang and Church recode E. coli’s ribosomes to drop isoleucine, shrinking the in-use code from 20 amino acids to 19 at 60–70% wild-type growth.
The same generative stack used for the recoding can design toxin variants that evade DNA-order screening — a dual-use detail the announcement skips.
Princeton’s TEMoE cuts MoE expert switching from 58% to 4.1% on gpt-oss-20b, but MATH accuracy drops from 71.5 to 64.0.
Concurrent CommitMoE claims 1.3–9.4x inference speedups with no accuracy trade-off, raising whether switch-rate was the right target at all.

Today’s research lands in three unrelated domains — clinical decision support, recoded synthetic biology, and mixture-of-experts inference — but each story turns on the same move: the comparison that would reframe the headline is sitting one slide away from the announcement.

DeepMind’s co-clinician posts a 97/98 zero-critical-error score on a commission-error benchmark, while published BMJ data assigns ~76.6% of severe clinical harm to the omission errors the test doesn’t measure; the independent Beth Israel pilot is doing the actual lifting. Wang and Church’s E. coli recoded to 19 amino acids is a real biology win — and the same AI design stack quietly handles toxin variants that evade DNA-order screening. Princeton’s TEMoE pays 7.5 MATH points to cut MoE expert switching 14x, while concurrent CommitMoE claims similar speedups with no accuracy hit at all. Read each result, then read what’s next to it.

DeepMind’s AI co-clinician scores 97/98 — on the wrong test

TL;DR

DeepMind’s “co-clinician” reuses the 2024 Talker-Reasoner architecture, now wrapped in clinical citation checks and multimodal Gemini/Astra I/O.
The headline 97/98 zero-critical-error result measures commission errors; BMJ data says omission drives ~76.6% of severe clinical harm.
An independent Beth Israel pilot is the stronger signal: 90% differential-diagnosis accuracy across 100 real pre-visit intakes, zero safety stops.
Microsoft’s MAI-DxO + o3 chases sequential rare-disease diagnosis; DeepMind is chasing empathetic intake. They aren’t really benchmarked against each other.

The architecture isn’t new — the application is

Google DeepMind is pitching a “dual-agent” Planner/Talker split as the safety scaffolding for an AI that talks to patients under physician supervision. The topology is a direct lift of Christakopoulou et al.’s October 2024 Talker-Reasoner paper, which framed the design in Kahneman terms: a fast System-1 Talker handling fluid dialogue, a slower System-2 Reasoner doing planning asynchronously ¹. That earlier work shipped inside a sleep-coaching agent. What’s new in April 2026 is the clinical groundedness layer — verified citations, an evidence-retrieval pipeline, and the NOHARM-derived evaluation harness — bolted on top of a battle-tested chassis.

flowchart LR
    P[Patient] <--> T[Talker agent<br/>live audio + video]
    T <-. monitored by .-> Pl[Planner agent<br/>safety + clinical bounds]
    T --> R[Retrieval +<br/>citation check]
    R --> KB[(Clinical evidence)]
    T --> Doc[Supervising physician]
    Doc --> P

Strong on commission, quiet on omission

In blind evaluations of 98 primary-care queries, the system logged zero critical errors in 97 cases and beat existing AI tools on the open-ended RxQA medication benchmark. That’s a real number — but it’s a number for errors of commission (saying something wrong). The BMJ’s editorial on conversational diagnostic AI argues the dominant failure mode is the opposite: errors of omission — failing to suggest a critical test or referral — account for roughly 76.6% of severely harmful errors, and USMLE-style scores correlate only weakly (r≈0.6) with downstream safety ². DeepMind’s own telemedicine study quietly concedes the point: human PCPs still outperformed the agent at flagging red flags and ordering critical examinations.

The most credible external validation isn’t in this release. It’s the prior Beth Israel Deaconess feasibility study, where AMIE-style history-taking across 100 actual pre-visit patients hit 90% differential-diagnosis accuracy with no safety stops and rising patient trust after the conversation ³. That’s a far stronger signal than 10 patient-actors cycling through 20 synthetic scenarios.

A different bet than Microsoft

It’s tempting to read this as DeepMind’s answer to Microsoft’s MAI-DxO, which paired with OpenAI’s o3 reportedly cleared more than 80% of complex NEJM case puzzles versus ~20% for unaided physicians ⁴. They are not the same product. MAI-DxO optimizes sequential diagnosis and test-ordering cost — a House M.D. for rare cases, embedded in Microsoft’s DAX ambient-documentation stack. DeepMind’s co-clinician optimizes empathetic multimodal intake — guiding a patient’s inhaler technique over video, taking history before the appointment. Whoever “wins” the leaderboard isn’t winning the same game.

The actual battlefronts are regulatory and geographic

The January 2026 FDA guidance carved out enforcement discretion for “glass-box” single-recommendation clinical decision support — exactly the lane a co-clinician would slot into — but clinician commentators warn transparency is a thin defense against time-pressed automation bias ⁵. And in the Global South pilots DeepMind highlights (India, UAE, Singapore), Indian health-policy commentators have already flagged “pilotitis” and post-colonial dependency risks when Global North–trained models are parachuted into systems short roughly 10 million workers ⁶.

The interesting question isn’t whether the model works. It’s whether DeepMind picks a harder evaluation — and a harder regulatory posture — before the deployment partners do it for them.

Wang and Church use AI to drop E. coli to 19 amino acids

TL;DR

A Columbia/Harvard team rebuilt E. coli’s ribosomal proteins so the cell no longer needs isoleucine, shrinking the in-use code from 20 amino acids to 19.
18 ribosomal genes tolerated a simple isoleucine→valine swap; 13 only worked after AI-driven redesign with counterintuitive charged or rigid replacements.
The recoded strain runs at 60–70% of wild-type growth and stayed stable over 400 generations with zero reversions to isoleucine.
The same generative stack used here can also design toxin variants that evade DNA-order screening — a dual-use overhang the announcement skips.

What the experiment actually pulled off

The Wang (Columbia) / Church (Harvard) group rewrote every isoleucine-containing ribosomal protein in E. coli so none of them require the amino acid. Across the 31 genes they had to touch, 18 tolerated a trivial isoleucine-to-valine swap; the other 13 only worked after AI-driven redesign that often substituted charged or structurally rigid residues no human would have guessed ⁷. The payoff is a living cell whose translation machinery operates on a 19-letter alphabet — the first credible subtraction from the canonical 20 in a free-living organism.

The fitness numbers are better than the headline suggests. The engineered strain grows at roughly 60–70% of wild-type rate and held stable across 400 generations of continuous culture. Sequencing caught 20–30 secondary mutations accumulating along the way, but none restored isoleucine to the ribosomal proteins ⁷. That is the load-bearing claim: the cell found a new equilibrium rather than slowly clawing its way back to the old one.

How this sits next to Syn57

This is one of two parallel bets on rewriting translation. Jason Chin’s MRC LMB group took the opposite route last year with Syn57: over 100,000 edits across a 4 Mb genome to strip six sense codons and one stop codon, with the host assembled bottom-up from synthetic 100-kb fragments ⁸. Syn57 reportedly grows about four times slower than wild-type ⁹, which makes the Wang/Church 60–70% figure look comparatively gentle.

Program	Strategy	Fitness cost
Wang/Church 2026	Subtract one amino acid (isoleucine) via ribosome redesign	~30–40% slower
Chin Syn57 2025	Subtract 7 codons via whole-genome rewrite	~4× slower

Both pitch the same downstream payoff — a genetic firewall that blocks viral replication and lets the cell incorporate non-canonical monomers — but they disagree on whether amino acids or codons are the cleaner thing to remove.

The dissent the press release skipped

Not everyone thinks subtraction is progress. Dipti Nayak’s archaeal work shows codon ambiguity can be adaptive; she argues genetic “leakiness” is “a feature, not a bug” that helps organisms survive extreme environments ¹⁰. Reviewers of the broader noncanonical-amino-acid program also flag a persistent “evolutionary deadlock” between ribosome catalysis and synthetase specificity that limits how far this kind of surgery scales toward genuinely exotic chemistries ¹¹.

There is also a biosecurity tail the Ars writeup omits. The same class of generative tools rescuing the isoleucine-free ribosome — RFdiffusion and ProteinMPNN — has been red-teamed and shown to produce functional toxin variants with under 50% identity to anything in known databases, defeating homology-based DNA-order screening ¹². A 19-amino-acid E. coli is a milestone in AI-assisted biology. It is also a reminder that the tools doing the assisting are getting better at things synthesis screens were built to catch.

TEMoE cuts MoE expert switching 14x, loses 10% accuracy

TL;DR

Princeton’s TEMoE reframes MoE expert routing as a hierarchical RL problem, using a “deliberation cost” to penalize switching.
On gpt-oss-20b, switch rate drops from ~58% to 4.1% while MATH accuracy falls from 71.5% to 64.0%.
The mechanism is borrowed wholesale from 2018 option-critic work; the novelty is applying it to MoE LLMs.
Concurrent work (CommitMoE) claims 1.3–9.4x inference speedups without any accuracy trade-off, raising the question of whether reducing churn is the right target at all.

The problem: MoE routers can’t sit still

Modern MoE LLMs like gpt-oss-20b and Qwen change their active expert set at nearly every token — switch rates sit close to 100%. That’s fine when all experts live on the GPU, but ruinous the moment total parameters exceed VRAM and experts have to be paged in from CPU or disk. Stable routing is the prerequisite for offloading, KV-cache reuse, and the kind of memory tricks that make 20B+ MoEs run on consumer hardware.

Shen and Henderson’s TEMoE attacks this by treating expert selection as a semi-Markov decision process. Each MoE layer gets a lightweight controller with a termination head (β: should I switch?) and a selection head (which k experts next?). A DeepSets encoder makes the representation permutation-invariant over the current expert mask, and Gumbel-top-k sampling keeps the whole thing differentiable. Training is option-critic with a self-distillation reward — reverse KL against the unconstrained teacher — and LoRA rank-16 adapters on experts and attention.

The crucial knob is η, a scalar penalty added to the termination gradient. Switch only when expected value beats the current option’s value by more than η.

Results: a clean accuracy/stability trade-off

Config (budget k̂=16)	Switch rate	MATH	MMLU
Base gpt-oss-20b	58.6%	71.5%	79.5%
TEMoE η=0.02	4.1%	64.0%	72.5%
TEMoE η=0.03	1.3%	58.5%	67.5%
TEMoE η=0.04	1.2%	55.0%	63.0%

A 14x reduction in switching for ~7 points of MATH is a real result, and TEMoE crushes static pruning baselines at the same expert budget — frequency pruning hits 53.5%, Wanda and random selection collapse below 15%. The controller’s ability to swap expert sets at reasoning-phase boundaries is doing real work that no static mask can replicate.

The intellectual debt and the competition

The η penalty isn’t new. It’s lifted directly from Bacon et al.’s 2018 option-critic paper, where the same regularizer was introduced to stop RL agents from terminating options every step and reverting to primitive actions ¹³. TEMoE’s contribution is the analogy — per-token expert churn is option collapse — not the math.

More awkwardly, a concurrent 2026 paper, CommitMoE, attacks the same offloading bottleneck by skipping switch-rate reduction entirely. It exploits MoE robustness to mispredicted experts and claims 1.3x–9.4x faster inference than offloading baselines with no retraining and no accuracy hit ¹⁴. If those numbers hold, TEMoE’s 10% accuracy tax looks expensive. Pre-gated MoE ¹⁵ takes a third route — predict next-layer experts to overlap PCIe transfer with compute — but requires full retraining, where TEMoE’s LoRA-only path is a real win.

That finding is the deeper threat. If most routing decisions cycle through near-fungible experts, the “churn” TEMoE suppresses may be partly cosmetic — and the accuracy loss may concentrate in exactly the moments a Super Expert was needed but the deliberation cost vetoed the switch. The authors’ own caveat that they can’t disentangle gains from temporal extension versus self-distillation gets sharper under this lens.

What’s actually at stake

TEMoE is a principled framing of a real bottleneck, and the controller architecture is the right shape for adapter-based retrofits of existing MoEs. But the paper ships no CUDA kernels, no offloading integration, and no end-to-end latency numbers — the systems payoff is deferred. With CommitMoE claiming the prize without the accuracy hit and Super Experts work suggesting switch-rate is the wrong metric, TEMoE has to prove that disciplined routing beats robust routing in actual deployment. The RL framing is elegant; the empirical case is not yet closed.

Round-ups

WorldMark: A Unified Benchmark Suite for Interactive Video World Models

WorldMark standardizes evaluation of interactive video world models through a shared action-mapping layer, identical test scenarios, and a hierarchical suite covering visual quality, control alignment, and world consistency. The release includes an evaluation toolkit and a live leaderboard across multiple architectures.

Context Unrolling in Omni Models

Omni introduces ‘context unrolling,’ a technique that lets a unified multimodal model expand heterogeneous inputs into a shared knowledge manifold before reasoning, aiming to improve downstream performance and in-context generation across modalities trained jointly rather than stitched together.

VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation

UC Santa Cruz’s VLAA-GUI tackles two failure modes plaguing GUI agents — premature stopping and repetitive action loops — by wiring a Completeness Verifier, Loop Breaker, and Search Agent around coding and grounding modules. Tests on OSWorld and WindowsAgentArena cover Opus 4.5/4.6 and Gemini 3.1 Pro backbones.

Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks

COS-PLAY pairs an LLM decision agent with a learnable skill bank that discovers, refines, and chains reusable skills across episodes, targeting long-horizon environments with delayed rewards and partial observability. Skills are retrieved at inference rather than baked into weights, enabling cross-episode transfer.

WebGen-R1: Incentivizing Large Language Models to Generate Functional and Aesthetic Websites with Reinforcement Learning

WebGen-R1 trains small LLMs to produce multi-page websites end-to-end via reinforcement learning, combining a structured scaffolding paradigm with a cascaded multimodal reward that scores both functional correctness and visual aesthetics, bypassing the multi-agent pipelines typical of project-level web generation.

PersonalAI: A Systematic Comparison of Knowledge Graph Storage and Retrieval Approaches for Personalized LLM agents

PersonalAI benchmarks knowledge-graph external memory for personalized LLM agents, building on the AriGraph architecture with hyper-edges and comparing retrieval strategies including A* search, WaterCircles traversal, and beam search to capture temporal dependencies in user history.

Test-Time Adaptation for EEG Foundation Models: A Systematic Study under Real-World Distribution Shifts

A systematic study of test-time adaptation on EEG foundation models finds gradient-based TTA methods unstable under real-world distribution shifts across heterogeneous datasets, while optimization-free approaches deliver more consistent downstream performance — a caution for clinical EEG deployment pipelines.

Christakopoulou et al., arXiv:2410.08328 (Talker-Reasoner) — https://arxiv.org/pdf/2410.08328

Inspired by Kahneman’s ‘Thinking, Fast and Slow,’ the Talker handles fluid conversational responses (System 1) while the Reasoner operates asynchronously to perform deep reasoning and multi-step planning (System 2).

↩
BMJ editorial on conversational AI in clinical settings — https://www.bmj.com/content/390/bmj.r1385

Errors of omission—failing to suggest a critical test or referral—account for roughly 76.6% of severely harmful errors, and a model’s score on USMLE-style exams is only moderately correlated (r≈0.6) with its clinical safety.

↩
Google Research blog — Beth Israel Deaconess AMIE feasibility study — https://research.google/blog/exploring-the-feasibility-of-conversational-diagnostic-ai-in-a-real-world-clinical-study/

AMIE conducted history-taking for 100 patients before their appointments; the AI’s differential diagnosis was accurate in 90% of cases, and patients reported higher trust in AI following the interaction, with zero safety stops required.

↩
Digital Health Wire — AMIE vs Microsoft MAI-DxO comparison — https://digitalhealthwire.com/google-amie-outperforms-in-real-world-debut/

Microsoft’s MAI-DxO paired with OpenAI’s o3 reportedly solved over 80% of complex NEJM cases versus 20% for unaided physicians, focusing on sequential diagnosis as resource-optimization, while Google’s co-clinician optimizes for empathetic multimodal dialogue.

↩
KevinMD — FDA 2026 guidance on AI clinical decision support — https://kevinmd.com/2026/01/fda-loosens-ai-oversight-what-clinicians-need-to-know-about-the-2026-guidance.html

The January 2026 FDA guidance permits enforcement discretion for single-recommendation ‘co-clinician’ tools that operate as a ‘glass box,’ but critics warn this transparency may be insufficient against cognitive offloading by time-pressed physicians.

↩
Times of India — AI in Healthcare governance, India RAISE dialogue — https://timesofindia.indiatimes.com/city/mumbai/ai-in-healthcare-governance-equity-and-responsible-innovation-in-india/articleshow/126450123.cms

Experts warned of ‘pilotitis’ — AI solutions trapped in experimental phases without integration into overstretched public health systems — and risks of ‘post-colonial dependency’ when core models are trained on Global North datasets.

↩
nsaneforums (Ars repost with extended detail) — https://nsaneforums.com/news/general-news/researchers-try-to-cut-the-genetic-code-from-20-to-19-amino-acids-r34788/

simple ‘isoleucine-to-valine’ swaps worked for 18 genes, the remaining 13 required brute-force computational redesign… the engineered strains typically exhibit only 60–70% of the growth rate seen in wild-type E. coli… the strain remained stable over 400 generations of continuous growth; … none of the secondary mutations restored isoleucine to the ribosomal proteins

↩ ↩²
MRC LMB (Chin lab) — Syn57 announcement — https://mrclmb.ac.uk/news-events/articles/syn57-represents-a-new-chapter-in-the-genetic-code-of-life/

Syn57 represents a new chapter in the genetic code of life — over 100,000 edits across the 4 Mb genome to remove six sense codons and one stop codon, building the host bottom-up from synthetic 100-kb fragments.

↩
Singularity Hub on Syn57 — https://singularityhub.com/2025/08/25/meet-syn57-the-most-stripped-down-living-synthetic-bacteria-yet/

Engineered strains like Syn57 reportedly grow four times slower than wild-type counterparts, raising questions about whether radically simplified codes can ever match natural fitness.

↩
Smithsonian Magazine — recoded E. coli context — https://www.smithsonianmag.com/smart-news/scientists-rewrote-the-genetic-code-of-e-coli-and-its-drastically-from-anything-else-found-in-nature-180987098/

Recent discoveries of natural ‘ambiguity’ in Archaea — where the same codon can signal both a stop and a rare amino acid — suggest that life may actually benefit from a ‘leaky’ code; Dipti Nayak argues genetic ambiguity is ‘a feature, not a bug.’

↩
bioengineer.org on orthogonal tRNA-synthetase pairs — https://bioengineer.org/optimizing-trna-synthetase-pairs-for-noncanonical-amino-acids/

A persistent ‘evolutionary deadlock’ has historically prevented the incorporation of non-L-α-amino acids: synthetases cannot be evolved for substrates the ribosome cannot polymerize, and vice versa.

↩
Wang/Yang/Yassif red-team study (via 2026 preprint review) — https://mdpi-res.com/bookfiles/book/11854/New_Insights_into_Plant_Signaling_Mechanisms_in_Biotic_and_Abiotic_Stress.pdf?v=1771899133

ProteinMPNN and RFdiffusion can generate functional protein variants with less than 50% sequence identity to known toxins, rendering traditional homology-based DNA-order screening ineffective.

↩
Bacon et al., ‘When Waiting is not an Option: Learning Options with a Deliberation Cost’ (AAAI 2018) — https://www.researchgate.net/publication/319736060_When_Waiting_is_not_an_Option_Learning_Options_with_a_Deliberation_Cost

Without a penalty for switching, agents often suffer from ‘option collapse,’ where they terminate options at every step to re-evaluate their choices, effectively reverting to slow, primitive actions.

↩
CommitMoE (arXiv 2511.05814, 2026) — https://arxiv.org/pdf/2511.05814

CommitMoE uses a Commit Router that makes unconditional expert predictions… achieving 1.3× to 9.4× faster inference compared to state-of-the-art offloading frameworks, while maintaining model quality without any retraining.

↩
Pre-gated MoE (NeurIPS 2022 / Hwang et al.) — https://papers.neurips.cc/paper_files/paper/2022/file/2f00ecd787b432c1d36f3de9800728eb-Paper-Conference.pdf

Pre-gated MoE uses a modified gate function that predicts which experts are required for the subsequent layer, overlapping the latency of migrating experts from CPU to GPU with computation of the current layer — but requires complete retraining because it restructures the transformer block.

↩
‘Super Experts’ analysis of Qwen3 / DeepSeek-R1 (OpenReview) — https://openreview.net/pdf?id=o43eHjPEMO

Despite having hundreds of experts, a tiny subset (often just 3-5) is mechanistically critical; pruning these specific ‘Super Experts’ leads to immediate performance collapse, whereas pruning others has minimal impact.

↩

DeepMind's wrong test, Church recodes E. coli, Princeton's MoE tradeoff

DeepMind’s wrong test, Church recodes E. coli, Princeton’s MoE tradeoff

TL;DR

DeepMind’s AI co-clinician scores 97/98 — on the wrong test

TL;DR

The architecture isn’t new — the application is

Strong on commission, quiet on omission

A different bet than Microsoft

The actual battlefronts are regulatory and geographic

Wang and Church use AI to drop E. coli to 19 amino acids

TL;DR

What the experiment actually pulled off

How this sits next to Syn57

The dissent the press release skipped

TEMoE cuts MoE expert switching 14x, loses 10% accuracy

TL;DR

The problem: MoE routers can’t sit still

Results: a clean accuracy/stability trade-off

The intellectual debt and the competition

What’s actually at stake

Round-ups

WorldMark: A Unified Benchmark Suite for Interactive Video World Models

Context Unrolling in Omni Models

VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation

Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks

WebGen-R1: Incentivizing Large Language Models to Generate Functional and Aesthetic Websites with Reinforcement Learning

PersonalAI: A Systematic Comparison of Knowledge Graph Storage and Retrieval Approaches for Personalized LLM agents

Test-Time Adaptation for EEG Foundation Models: A Systematic Study under Real-World Distribution Shifts

Jack Sun, writing.

DeepMind’s wrong test, Church recodes E. coli, Princeton’s MoE tradeoff

TL;DR

DeepMind’s AI co-clinician scores 97/98 — on the wrong test

TL;DR

The architecture isn’t new — the application is

Strong on commission, quiet on omission

A different bet than Microsoft

The actual battlefronts are regulatory and geographic

Wang and Church use AI to drop E. coli to 19 amino acids

TL;DR

What the experiment actually pulled off

How this sits next to Syn57

The dissent the press release skipped

TEMoE cuts MoE expert switching 14x, loses 10% accuracy

TL;DR

The problem: MoE routers can’t sit still

Results: a clean accuracy/stability trade-off

The intellectual debt and the competition

What’s actually at stake

Round-ups

WorldMark: A Unified Benchmark Suite for Interactive Video World Models

Context Unrolling in Omni Models

VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation

Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks

WebGen-R1: Incentivizing Large Language Models to Generate Functional and Aesthetic Websites with Reinforcement Learning

PersonalAI: A Systematic Comparison of Knowledge Graph Storage and Retrieval Approaches for Personalized LLM agents

Test-Time Adaptation for EEG Foundation Models: A Systematic Study under Real-World Distribution Shifts

Footnotes

Jack Sun, writing.