Gemini 3.5 Flash GAs 3× pricier, Ettin 17M tops MiniLM, co-scientists retrieve
Three unrelated AI tech releases today: Gemini 3.5 Flash GAs at 3× price, Ettin's 17M reranker tops MiniLM, Nature co-scientists draw scrutiny.
Gemini 3.5 Flash GAs 3× pricier, Ettin 17M tops MiniLM, co-scientists retrieve
TL;DR
- Gemini 3.5 Flash GAs at $1.50/$9 per M tokens, 3× Flash Preview pricing.
- Reverse-engineering pegs Flash at ~250–300B total / 10–16B active MoE.
- Ettin’s 17M reranker beats the 4-year MiniLM default by 0.051 NDCG@10 on MTEB.
- AI co-scientists in Nature dismissed by critics as retrieval, not discovery.
- datasette-llm 0.1a8 patches a context-chain bug that truncated multi-turn responses.
Today’s three AI tech features don’t share a thread, so take them on their own terms. Google’s Gemini 3.5 Flash goes GA at 3× the Preview price, and community reverse-engineering suggests the hike tracks real capacity — a ~250–300B MoE with 10–16B active, finishing a 14-step MCP chain roughly 4× faster than Opus 4.7 while still trailing it on SWE-Bench Pro.
Tom Aarsen’s Ettin drop ships six Apache-2.0 ModernBERT rerankers from 17M to 1B, with the smallest model unseating a four-year-old MiniLM default and the largest matching its 1.54B teacher at 2.4× throughput. And two AI co-scientists — Robin and Google’s — landed in Nature with wet-lab-validated drug-retargeting hits, but critics note every novel compound was already in the training literature, recasting discovery as well-orchestrated retrieval.
Google ships Gemini 3.5 Flash at 3× price, near Pro tier
Source: simon-willison · published 2026-05-19
TL;DR
- Gemini 3.5 Flash ships GA at $1.50/$9 per M tokens — 3× Flash Preview, 6× Flash-Lite.
- Community reverse-engineering pegs it at ~250–300B total / 10–16B active MoE, reframing the hike as real capacity.
- On a 14-step MCP chain it finished in 11.3s vs 38.9s for Opus 4.7 — roughly 4× wall-clock speed.
- Opus 4.7 still beats it on SWE-Bench Pro 64.3% to 55.1% — the coding-quality gap persists.
The “Flash” brand is being redefined
Google’s I/O drop wasn’t really a model release — it was a repositioning. Gemini 3.5 Flash skipped the -preview step, went straight to GA, and shipped the same day into Search AI Mode, the consumer Gemini app, Antigravity, AI Studio, Android Studio, and the Enterprise Agent Platform. That’s not a cautious rollout; that’s Google replacing the default model under billions of users in one afternoon.
The catch is the price tag. At $1.50 input / $9 output per million tokens, 3.5 Flash costs 3× Gemini 3 Flash Preview and 6× Gemini 3.1 Flash-Lite, and now sits within striking distance of 3.1 Pro ($2 / $12). On Artificial Analysis’s benchmark cost-to-run, 3.5 Flash at high reasoning ($1,551) actually costs more than 3.1 Pro Preview ($892) — because the reasoning-token volume has gone up alongside the per-token rate.
The most useful explanation came from Hacker News commenters working backward from Google’s disclosed TPU 8i serving specs and a 280 tok/s target: roughly 250–300B total parameters with 10–16B active in an MoE 1. If that’s right, the “Flash” line has quietly stopped meaning cheap and started meaning dense and fast. The price hike is Google passing through a real capacity increase, in step with GPT-5.5 (2× GPT-5.4) and Opus 4.7 (~1.46× Opus 4.6 after tokenizer changes). Pichai’s pitch that enterprises can save $1B/year by migrating off Pro 2 only makes sense if you read 3.5 Flash as the new Pro-replacement, not the new commodity tier.
Where the speed wins, and where it doesn’t
Independent agentic testing supports the speed story in concrete terms. On a 14-step MCP chain, 3.5 Flash finished in 11.3 seconds against 38.9s for Claude Opus 4.7 and 46.1s for GPT-5.5 3 — roughly 4× wall-clock for 6% more cost than the old Flash. It also leads MCP Atlas at 83.6% versus Opus at 79.1% and GPT-5.5 at 75.3% 4.
But the “frontier replacement” framing breaks down on deeper coding work. On SWE-Bench Pro, Opus 4.7 still beats 3.5 Flash 64.3% to 55.1% 4, and Claude Code holds multi-file SWE-bench at 80.8% 5. The Antigravity 2.0 IDE that 3.5 Flash is tuned for is getting mixed practitioner reviews — impressive parallel-agent demos, but a “cluttered” manager UI and agents acting with “eerie” confidence that demands manual verification 5.
| Workload | Winner | Margin |
|---|---|---|
| Short MCP tool-loops (latency) | Gemini 3.5 Flash | 11.3s vs 38.9s Opus 3 |
| MCP Atlas agent tasks | Gemini 3.5 Flash | 83.6% vs 79.1% Opus 4 |
| SWE-Bench Pro (deep coding) | Claude Opus 4.7 | 64.3% vs 55.1% 4 |
| Multi-file SWE-bench | Claude Code | 80.8% 5 |
Interactions API: divergent lock-in
The quieter half of the drop is the new Interactions API, currently in beta — Google’s answer to OpenAI’s Responses pattern, with server-side history management. The two camps have diverged philosophically: Gemini keeps interleaved thoughts and tool calls inspectable, while Responses compaction replaces them with opaque encrypted items 6. Different transparency stories, identical outcome — once a live conversation lives on a vendor’s server, you’re not portable. The pricing debate hasn’t priced that in yet.
Net: 3.5 Flash is a genuinely larger, faster model rather than a rebrand, and the price is defensible on capacity grounds. Google is also spending down a lot of “cheap Flash” goodwill in one move, and betting that the latency wins on short agent loops will paper over the coding-quality gap to Opus.
Further reading
- Gemini 3.5 Flash might be fast enough for gen AI to make sense — ars-technica-ai
- llm-gemini 0.32 — simon-willison
- llm-gemini 0.32a0 — simon-willison
Ettin’s 17M reranker tops a 33M MiniLM by 0.051 NDCG
Source: huggingface-blog · published 2026-05-19
TL;DR
- Tom Aarsen shipped six ModernBERT CrossEncoder rerankers (17M–1B params) under Apache-2.0, distilled from
mxbai-rerank-large-v2. - The 1B student matches its 1.54B teacher on MTEB Retrieval (0.6114 vs 0.6115 NDCG@10) at 2.4× throughput.
- The 17M model beats
ms-marco-MiniLM-L12-v2(33M) by +0.051 NDCG@10 on MTEB — the long-overdue successor to a four-year default. - The headline “8.3× speedup” applies only on H100 with bf16 + FA2, not consumer Ampere cards 7.
A MiniLM successor, finally
For four years the default open reranker has been some flavor of ms-marco-MiniLM. Ettin is the first family that makes upgrading obvious. The smallest model — 17.6M parameters, 7 layers, 256 hidden — outperforms the 33M MiniLM-L12-v2 by +0.051 NDCG@10 on MTEB (eng, v2) while running faster. The 32M variant beats BGE-reranker-v2-m3 (568M) by +0.025 NDCG@10 despite being 17× smaller, and the 68M matches Qwen3-Reranker-0.6B at roughly one-ninth the parameter count 8. The 1B model effectively closes the gap to its mxbai-rerank-large-v2 teacher (0.6114 vs 0.6115) while being 35% smaller and 2.4× faster.
The recipe is unglamorous: pointwise MSE distillation on raw teacher logits, ~143M (query, document, score) triples, a quantile-anchor hard-negative scheme that mixes 32 head samples with 32 medium-difficulty negatives per query. What makes it work is the backbone. Ettin’s ModernBERT encoders come out of the Johns Hopkins/LightOn Seq vs Seq program, which trained paired encoder/decoder models on identical 2T-token recipes and released 200+ intermediate checkpoints — the most documented pretraining lineage of any open CrossEncoder, and the paper’s central finding (encoders beat much larger decoders on retrieval) is the implicit case for shipping a CrossEncoder rather than a Qwen-style generative scorer 9.
Where it doesn’t fit
Ettin is English-only with an 8K context. Qwen3-Reranker spans 0.6B–8B, supports 100+ languages with a 32K native context, and uses a CausalLM yes/no logit head 10. For cross-lingual retrieval, BGE-m3 and Qwen3 remain the defaults. Broader reranker surveys still treat Cohere Rerank 4 and Voyage Rerank 2.5 as the quality ceiling (a 15–30% precision boost over embedding-only retrieval), with Apache-2.0 weights being the main reason to pick an open model at all — Jina Reranker v2’s CC-BY-NC license rules it out of most commercial stacks 11. Read in that frame, Ettin is the strongest open English CrossEncoder family of the year, not a Qwen3 killer.
Read the speedup claim carefully
The “8.3× speedup” headline is a bounded observation: it compares bf16 + Flash Attention 2 on an H100 against fp32 + SDPA, and gains on consumer Ampere cards (e.g. RTX 3090) are materially smaller because they can’t exploit Hopper-specific kernels 7. ModernBERT’s unpadding and 8K-context advantages have the same dependency — without FA2 the attention reverts to quadratic scaling, and Optimum/ONNX export paths have been slow to land, which is a real obstacle for teams that need CPU inference or non-PyTorch deployment 12.
Takeaway
If your retrieval stack is English and GPU-served, the 150M Ettin is probably the new sweet spot: it outperforms every sub-600M reranker tested and runs 2.3× faster than padded-wrapper peers. If you need multilingual, long-context-beyond-8K, or CPU inference, the announcement doesn’t change your shortlist.
Two AI co-scientists hit Nature, but repurpose known drugs
Source: ars-technica-ai · published 2026-05-19
TL;DR
- Robin and Google’s Co-Scientist both landed in Nature in May 2026 with wet-lab-validated drug-retargeting hypotheses.
- Robin’s top hit — ripasudil for dry AMD — boosted RPE debris clearance 7.5× in vitro and upregulated ABCA1.
- Critics call the results “underwhelming”: every “novel” drug was already in the training literature, suggesting retrieval, not discovery.
- The analytical sub-agent Finch needed human correction on bioinformatics, undercutting the “autonomous” framing.
What the two systems actually shipped
Two agentic science platforms published simultaneously in Nature: FutureHouse’s Robin and Google’s AI Co-Scientist, each claiming end-to-end hypothesis generation for drug repurposing. Robin’s headline result is the most concrete: it proposed enhancing phagocytic activity in retinal pigment epithelium cells as a path to slowing dry age-related macular degeneration, then designed the in vitro experiments that showed ripasudil — a ROCK inhibitor — increased debris clearance 7.5× and upregulated the lipid efflux pump ABCA1 13. FutureHouse’s separate benchmark numbers back the broader capability claim: its Crow and Falcon agents hit ~90% on the LitQA literature-QA benchmark, versus ~67% for PhD-level human baselines 14.
That is real, peer-reviewed wet-lab validation, not a press-release demo. It is also the strongest case for these systems anyone has made to date.
The retrieval-vs-discovery problem
The sharpest pushback is that none of the “discoveries” are actually new. Ripasudil has been studied in ophthalmology for years; Co-Scientist’s AML candidates were similarly well-attested in the prior literature these models trained on. A review from k-dense.ai labels the Nature results “underwhelming” and argues Co-Scientist may be suffering from data leakage, “essentially acting as a sophisticated search engine rather than an independent discoverer” 15.
That reframing matters. “AI proposes a treatment for blindness” and “AI surfaces a known ROCK inhibitor faster than a postdoc would have” are very different claims, and the press coverage is mostly making the first one.
The analytical agent is the weak link
Ars highlights that one of these systems “goes on to analyze some of the data” — that’s Finch, Robin’s analytical agent. Independent commentary on the paper is blunter: Finch “struggled with complex bioinformatics and statistics, frequently requiring human prompts to correct errors,” and researchers overrode several of Robin’s experimental-design suggestions 16. The component being marketed as the differentiator from pure hypothesis-generators is also the component that most depends on undisclosed human sense-checking.
Biosecurity and integrity overhang
Two adjacent concerns are aimed squarely at this class of tool, not at hypothetical future ones. RAND-affiliated analysis warns that agentic biology systems “raise the floor” for non-experts seeking dangerous biological knowledge and “raise the ceiling” for expert misuse, particularly once agents start interfacing with lab robotics and bypassing traditional biosafety checkpoints 17. Separately, a response to Nature’s March 2026 editorial argues autonomous research agents will amplify existing integrity crises — paper-mill content and hallucinated analyses entering the literature at scale 18.
Net read
The wet-lab validation is genuine 13. The “autonomous discovery” framing is not: the novel hits aren’t novel 15, the analytical agent is fragile 16, and the biosecurity community is treating this milestone as a warning shot 1718. Robin and Co-Scientist are best understood as very fast literature-triangulation engines with a wet-lab loop attached — which is useful, and worth taking seriously, but a long way from the headline.
Round-ups
Allen AI’s OlmoEarth v1.1 trims compute for Earth observation models
Source: huggingface-blog
OlmoEarth v1.1 from Allen AI lands as a leaner family of geospatial foundation models, tuned for satellite and remote-sensing tasks. The update targets efficiency over its predecessor, aiming to make Earth observation pipelines cheaper to run without sacrificing the v1 accuracy baseline.
datasette-llm 0.1a8 fixes response-chain bug spanning accountant plugin
Source: simon-willison, simon-willison
Simon Willison’s datasette-llm 0.1a8 patches the llm_prompt_context() hook so it fully walks chains of prior responses, with a matching 0.1a4 release of the accountant plugin propagating the fix to usage tracking. The bug had truncated multi-turn context collection.
Footnotes
-
Hacker News thread 48196570 — https://news.ycombinator.com/item?id=48196570
↩Napkin-math on the TPU 8i serving specs and 280 tok/s target puts 3.5 Flash at ~250–300B total / 10–16B active in an MoE — significantly larger than prior Flash models, which explains why Google had to raise the per-token price even as the inference architecture got more efficient.
-
VentureBeat — https://venturebeat.com/technology/google-says-gemini-3-5-flash-can-slash-enterprise-ai-costs-by-more-than-1-billion-a-year
↩Google says Gemini 3.5 Flash can slash enterprise AI costs by more than $1 billion a year — Pichai framed the model as a ‘financial lifeline’ for organizations migrating workloads off Pro-tier endpoints.
-
Towards AI — ‘I tested Gemini 3.5 Flash on 18 agent tasks’ — https://pub.towardsai.net/i-tested-gemini-3-5-flash-on-18-agent-tasks-its-6-pricier-flash-crushed-gpt-5-5-at-4-speed-aeab103e0981?source=rss----98111c9905da---4
↩ ↩2On a 14-step MCP chain Gemini 3.5 Flash finished in 11.3 seconds vs 38.9s for Claude Opus 4.7 and 46.1s for GPT-5.5 — 6% pricier than old Flash but ~4x the wall-clock speed of the frontier.
-
Digital Applied benchmark roundup — https://www.digitalapplied.com/blog/gemini-3-5-flash-vs-gpt-5-5-opus-4-7-agentic-coding
↩ ↩2 ↩3 ↩4Gemini 3.5 Flash leads MCP Atlas at 83.6% (vs Opus 4.7 at 79.1% and GPT-5.5 at 75.3%), but still trails Opus on SWE-Bench Pro (55.1% vs 64.3%) — the ‘agentic’ framing hides a real coding-quality gap.
-
Vibe Coding Academy — coding-assistant comparison — https://www.vibecodingacademy.ai/blog/best-ai-coding-assistant-2026
↩ ↩2 ↩3Antigravity 2.0’s agent-manager UI is ‘cluttered’ with steep orchestration overhead, and early adopters report agents acting with ‘eerie’ confidence that requires significant manual verification — Claude Code still wins multi-file SWE-bench at 80.8%.
-
Sean Goedecke blog — Responses API analysis — https://www.seangoedecke.com/responses-api/
↩Gemini’s Interactions API keeps interleaved thoughts and tool calls inspectable, where OpenAI’s Responses compaction replaces them with opaque encrypted items — the two camps have diverged on transparency vs token compression, and both create hard vendor lock-in for live conversations.
-
John6666 activity posts on Hugging Face — https://huggingface.co/John6666/activity/posts
↩ ↩2The 8.3x speedup is a real result under specific H100 conditions (bf16 + FA2 vs. fp32 + SDPA) but should be read as a bounded observation, not a general platform claim; consumer GPUs like the RTX 3090 see materially smaller gains because they cannot exploit the same Hopper-specific kernels.
-
HubNextEra, ‘How Ettin Rerankers Boost Your Embedder Performance’ — https://news.hubnextera.com/how-ettin-rerankers-boost-your-embedder-performance.html
↩The 32M Ettin reranker outperforms the 568M BGE-reranker-v2-m3 by +0.025 NDCG@10 on MTEB despite being 17x smaller, and the 68M variant matches Qwen3-Reranker-0.6B at roughly one-ninth the parameter count.
-
Weller et al., ‘Seq vs Seq: An Open Suite of Paired Encoders and Decoders’ (arXiv 2507.11412) — https://arxiv.org/html/2507.11412v1
↩The Ettin suite pairs encoder-only and decoder-only models from 17M to 1B parameters trained on identical 2T-token recipes, with batch-level training order and 200+ intermediate checkpoints released — the first apples-to-apples comparison showing encoders consistently beat decoders of much larger size on classification and retrieval.
-
Qwen3-Embedding/Reranker technical blog (qwenlm.github.io) — https://qwenlm.github.io/blog/qwen3-embedding/
↩Qwen3-Reranker spans 0.6B–8B with a CausalLM ‘yes/no’ logit scoring head, 32K native context, and >100-language support — capabilities Ettin’s English-only, 8K-context CrossEncoder design does not match, at the cost of substantially higher per-pair latency.
-
ThinkingLoop, ‘10 Vector Rerankers Benchmarked on Cost vs Quality’ (Medium) — https://medium.com/@ThinkingLoop/10-vector-rerankers-benchmarked-on-cost-vs-quality-6bad293eaef9
↩Cohere Rerank 4 and Voyage Rerank 2.5 remain the quality ceiling, often providing a 15–30% precision boost over embedding-only retrieval; open-weight Apache-2.0 options like BGE-v2-m3 are preferred where licensing matters, while Jina Reranker v2 weights are CC-BY-NC-4.0 and unsuitable for many commercial deployments.
-
Portkey.ai summary of ModernBERT paper — https://portkey.ai/blog/smarter-better-faster-longer-a-modern-bidirectional-encoder-for-fast-memory-efficient-and-long-context-finetuning-and-inference-summary-2/
↩ModernBERT’s 8K context and unpadding gains are gated on Flash Attention 2, which requires GPU + fp16/bf16; on CPU or without FA2 the architecture reverts to standard quadratic scaling, and Optimum/ONNX export paths have been slow to land — a real obstacle for production deployment.
-
The Scientist — coverage of Robin’s dAMD finding — https://www.the-scientist.com/an-ai-powered-scientist-proposes-a-treatment-for-blindness-73079
↩ ↩2Robin independently hypothesized that enhancing the phagocytic activity of retinal pigment epithelium cells could mitigate disease progression… ripasudil increased debris clearance by 7.5 times and upregulated the lipid efflux pump ABCA1.
-
FutureHouse — LAB-Bench announcement — https://www.futurehouse.org/research-announcements/lab-bench-measuring-capabilities-of-language-models-for-biology-research
↩FutureHouse agents (Crow and Falcon) achieved ~90% accuracy on the LitQA benchmark, significantly outperforming PhD-level researchers who averaged ~67%.
-
k-dense.ai — ‘AI Co-Scientist, Not AI Scientist’ — https://www.k-dense.ai/blog/ai-co-scientist-not-ai-scientist
↩ ↩2Independent critics have labeled the results ‘underwhelming,’ noting that the ‘novel’ drugs identified by the AI were already well-established in existing literature… Co-Scientist may suffer from data leakage, essentially acting as a sophisticated search engine rather than an independent discoverer.
-
Bioengineer.org — commentary on Robin Nature paper — https://bioengineer.org/multi-agent-system-automates-scientific-discoveries/
↩ ↩2The analytical agent Finch struggled with complex bioinformatics and statistics, frequently requiring human prompts to correct errors… human researchers had to override several of the AI’s experimental design suggestions, suggesting that ‘autonomous’ discovery still relies heavily on human ‘sense-checking’.
-
EurekAlert — RAND/biosecurity framing of agentic biology — https://www.eurekalert.org/news-releases/1116894
↩ ↩2Such systems ‘raise the floor’ for non-experts to access complex biological knowledge and ‘raise the ceiling’ for experts to accelerate the design of dangerous pathogens… agentic systems can iterate on experimental protocols and interface with lab robotics, potentially bypassing traditional biosafety safeguards.
-
ETC Journal — response to Nature’s 25 March 2026 editorial on AI scientists — https://etcjournal.com/2026/03/25/a-response-to-natures-25-march-2026-editorial-on-ai-scientists/
↩ ↩2The shift from AI tools to autonomous AI agents may amplify existing crises in research integrity, such as the production of ‘paper mill’ content and hallucinated data analyses.