Gemini 3.5 Flash GAs 3× pricier, Ettin 17M tops MiniLM, co-scientists retrieve
Every URL the pipeline pulled into ranking for this issue — primary sources plus the supporting and contradicting findings each Researcher returned. Inline citations in the issue point back here.
Sources
Gemini 3.5 Flash: more expensive, but Google plan to use it for everything simonwillison.net
Today at Google I/O, Google released Gemini 3.5 Flash . This one skipped the -preview modifier and went straight to general availability, and Google appear to be using it for a whole lot of their key products: 3.5 Flash is available today to billions of people globally: For everyone via the Gemini app and AI Mode in Google Search For developers in our agent-first development platform Google Antigravity and Gemini API in Google AI Studio and Android Studio For enterprises in Gemini Enterprise Ag…
Gemini 3.5 Flash might be fast enough for gen AI to make sense arstechnica.com
Google says its more efficient Gemini 3.5 Flash is the key to your agentic AI future.
llm-gemini 0.32 simonwillison.net
Release: llm-gemini 0.32 New model gemini-3.5-flash for Gemini 3.5 Flash . See also my notes on Gemini 3.5 Flash , and the pelican I drew using this upgrade to the plugin. Tags: gemini , llm
llm-gemini 0.32a0 simonwillison.net
Release: llm-gemini 0.32a0 Compatible with llm>=0.32a0 alpha - adds the ability to stream reasoning tokens. Tags: gemini , llm
Introducing the Ettin Reranker Family huggingface.co
Two AI-based science assistants succeed with drug-retargeting tasks arstechnica.com
Both tools generate hypotheses; one goes on to analyze some of the data.
OlmoEarth v1.1: A more efficient family of models huggingface.co
OlmoEarth v1.1 from Allen AI lands as a leaner family of geospatial foundation models, tuned for satellite and remote-sensing tasks. The update targets efficiency over its predecessor, aiming to make Earth observation pipelines cheaper to run without sacrificing the v1 accuracy baseline.
datasette-llm 0.1a8 simonwillison.net
Simon Willison’s datasette-llm 0.1a8 patches the llm_prompt_context() hook so it fully walks chains of prior responses, with a matching 0.1a4 release of the accountant plugin propagating the fix to usage tracking. The bug had truncated multi-turn context collection.
datasette-llm-accountant 0.1a4 simonwillison.net
Release: datasette-llm-accountant 0.1a4 Fixed bug tracking chains of responses. Refs datasette-llm#7 Tags: llm , datasette
References
Hacker News thread 48196570 news.ycombinator.com
Napkin-math on the TPU 8i serving specs and 280 tok/s target puts 3.5 Flash at ~250–300B total / 10–16B active in an MoE — significantly larger than prior Flash models, which explains why Google had to raise the per-token price even as the inference architecture got more efficient.
VentureBeat venturebeat.com
Google says Gemini 3.5 Flash can slash enterprise AI costs by more than $1 billion a year — Pichai framed the model as a ‘financial lifeline’ for organizations migrating workloads off Pro-tier endpoints.
Towards AI — ‘I tested Gemini 3.5 Flash on 18 agent tasks’ pub.towardsai.net
On a 14-step MCP chain Gemini 3.5 Flash finished in 11.3 seconds vs 38.9s for Claude Opus 4.7 and 46.1s for GPT-5.5 — 6% pricier than old Flash but ~4x the wall-clock speed of the frontier.
Digital Applied benchmark roundup digitalapplied.com
Gemini 3.5 Flash leads MCP Atlas at 83.6% (vs Opus 4.7 at 79.1% and GPT-5.5 at 75.3%), but still trails Opus on SWE-Bench Pro (55.1% vs 64.3%) — the ‘agentic’ framing hides a real coding-quality gap.
Sean Goedecke blog — Responses API analysis seangoedecke.com
Gemini’s Interactions API keeps interleaved thoughts and tool calls inspectable, where OpenAI’s Responses compaction replaces them with opaque encrypted items — the two camps have diverged on transparency vs token compression, and both create hard vendor lock-in for live conversations.
Vibe Coding Academy — coding-assistant comparison vibecodingacademy.ai
Antigravity 2.0’s agent-manager UI is ‘cluttered’ with steep orchestration overhead, and early adopters report agents acting with ‘eerie’ confidence that requires significant manual verification — Claude Code still wins multi-file SWE-bench at 80.8%.
Weller et al., ‘Seq vs Seq: An Open Suite of Paired Encoders and Decoders’ (arXiv 2507.11412) arxiv.org
The Ettin suite pairs encoder-only and decoder-only models from 17M to 1B parameters trained on identical 2T-token recipes, with batch-level training order and 200+ intermediate checkpoints released — the first apples-to-apples comparison showing encoders consistently beat decoders of much larger size on classification and retrieval.
ThinkingLoop, ‘10 Vector Rerankers Benchmarked on Cost vs Quality’ (Medium) medium.com
Cohere Rerank 4 and Voyage Rerank 2.5 remain the quality ceiling, often providing a 15–30% precision boost over embedding-only retrieval; open-weight Apache-2.0 options like BGE-v2-m3 are preferred where licensing matters, while Jina Reranker v2 weights are CC-BY-NC-4.0 and unsuitable for many commercial deployments.
Qwen3-Embedding/Reranker technical blog (qwenlm.github.io) qwenlm.github.io
Qwen3-Reranker spans 0.6B–8B with a CausalLM ‘yes/no’ logit scoring head, 32K native context, and >100-language support — capabilities Ettin’s English-only, 8K-context CrossEncoder design does not match, at the cost of substantially higher per-pair latency.
Portkey.ai summary of ModernBERT paper portkey.ai
ModernBERT’s 8K context and unpadding gains are gated on Flash Attention 2, which requires GPU + fp16/bf16; on CPU or without FA2 the architecture reverts to standard quadratic scaling, and Optimum/ONNX export paths have been slow to land — a real obstacle for production deployment.
John6666 activity posts on Hugging Face huggingface.co
The 8.3x speedup is a real result under specific H100 conditions (bf16 + FA2 vs. fp32 + SDPA) but should be read as a bounded observation, not a general platform claim; consumer GPUs like the RTX 3090 see materially smaller gains because they cannot exploit the same Hopper-specific kernels.
HubNextEra, ‘How Ettin Rerankers Boost Your Embedder Performance’ news.hubnextera.com
The 32M Ettin reranker outperforms the 568M BGE-reranker-v2-m3 by +0.025 NDCG@10 on MTEB despite being 17x smaller, and the 68M variant matches Qwen3-Reranker-0.6B at roughly one-ninth the parameter count.
The Scientist — coverage of Robin’s dAMD finding the-scientist.com
Robin independently hypothesized that enhancing the phagocytic activity of retinal pigment epithelium cells could mitigate disease progression… ripasudil increased debris clearance by 7.5 times and upregulated the lipid efflux pump ABCA1.
k-dense.ai — ‘AI Co-Scientist, Not AI Scientist’ k-dense.ai
Independent critics have labeled the results ‘underwhelming,’ noting that the ‘novel’ drugs identified by the AI were already well-established in existing literature… Co-Scientist may suffer from data leakage, essentially acting as a sophisticated search engine rather than an independent discoverer.
FutureHouse — LAB-Bench announcement futurehouse.org
FutureHouse agents (Crow and Falcon) achieved ~90% accuracy on the LitQA benchmark, significantly outperforming PhD-level researchers who averaged ~67%.
Bioengineer.org — commentary on Robin Nature paper bioengineer.org
The analytical agent Finch struggled with complex bioinformatics and statistics, frequently requiring human prompts to correct errors… human researchers had to override several of the AI’s experimental design suggestions, suggesting that ‘autonomous’ discovery still relies heavily on human ‘sense-checking’.
EurekAlert — RAND/biosecurity framing of agentic biology eurekalert.org
Such systems ‘raise the floor’ for non-experts to access complex biological knowledge and ‘raise the ceiling’ for experts to accelerate the design of dangerous pathogens… agentic systems can iterate on experimental protocols and interface with lab robotics, potentially bypassing traditional biosafety safeguards.
ETC Journal — response to Nature’s 25 March 2026 editorial on AI scientists etcjournal.com
The shift from AI tools to autonomous AI agents may amplify existing crises in research integrity, such as the production of ‘paper mill’ content and hallucinated data analyses.