JS Wei (Jack) Sun

Gemini reaches 17 US labs, Pokémon harness self-edits, RL resurfaces known facts

Every URL the pipeline pulled into ranking for this issue — primary sources plus the supporting and contradicting findings each Researcher returned. Inline citations in the issue point back here.

← Back to the issue

Sources

Gemini for Science: AI experiments and tools for a new era of discovery deepmind.google

A collection of science tools and experiments to expand the scale and precision of scientific exploration.

Fast-tracking genetic leads to reverse cellular aging deepmind.google

Biologists use Co-Scientist to find novel factors that successfully rejuvenate human cells.

Finding the molecular switches behind new infectious diseases deepmind.google

Clare Bryant uses Co-Scientist to identify genetic triggers in emerging infectious diseases.

Opening new paths in aging research deepmind.google

Calico Life Sciences uses Co-Scientist to connect scattered findings and generate new leads in aging research.

Accelerating discovery of liver disease mechanisms deepmind.google

Filippo Menolascina uses Co-Scientist to identify new liver disease treatments and explain why existing drugs only help certain patients.

Uniting biological toolkits for a new approach to ALS deepmind.google

Co-Scientist unites Boston Children’s Hospital and MIT’s labs to explore new RNA-based treatments for ALS.

Uncovering repurposed medicines to fight liver fibrosis deepmind.google

Stanford geneticist uses Co-Scientist to help find new treatments for chronic liver disease and liver fibrosis.

How WeatherNext helped the National Hurricane Center better predict Hurricane Melissa’s historic landfall in Jamaica deepmind.google

Learn how our WeatherNext AI model help forecasters give communities unprecedented time to prepare ahead of the historic Hurricane Melissa.

Continual Harness: Online Adaptation for Self-Improving Foundation Agents huggingface.co

A self-improving AI system for embodied agents autonomously refines its own prompts, skills, and memory through continuous learning without environment resets, achieving human-level performance in complex video games.

Beyond Reasoning: Reinforcement Learning Unlocks Parametric Knowledge in LLMs huggingface.co

Reinforcement learning improves large language model recall of parametric knowledge by redistributing probability mass toward correct answers, with gains driven primarily by reinforcing rare but learnable examples.

Useful Memories Become Faulty When Continuously Updated by LLMs huggingface.co

Agentic systems that summarize past experience with an LLM end up worse than ones that keep raw episodic trajectories, the paper finds on ARC-AGI tasks. Faulty consolidation overwrites useful details, so continuous memory rewrites hurt rather than help long-running agents.

A Single Layer to Explain Them All:Understanding Massive Activations in Large Language Models huggingface.co

Across model families, the spike in hidden-state magnitudes emerges at a single layer where RMSNorm and FFN parameters interact, collapsing representation diversity. The authors propose a targeted fix to that layer that lifts downstream task performance without retraining from scratch.

Efficient Pre-Training with Token Superposition huggingface.co

Token-Superposition Training bundles contiguous tokens into multi-hot bags during a superposition phase, then recovers with standard next-token loss. The trick cuts pre-training FLOPs and wall-clock time without touching architecture, tokenizer, or optimizer, making it a drop-in efficiency win.

RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards huggingface.co

RubricEM decomposes long-horizon research into stages scored by rubrics, then evolves the meta-policy via reflection on judge feedback. Built on a stage-structured GRPO variant, it beats prior deep research agents on evidence gathering and synthesis benchmarks where verifiable rewards fall short.

Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation huggingface.co

Pion applies orthogonal equivalence transformations to weight updates so singular values stay intact, avoiding the spectral drift that destabilizes Adam and Muon. The Sphere AI Lab optimizer matches standard baselines on LLM pre-training while keeping spectral norm bounded throughout.

The Many Faces of On-Policy Distillation: Pitfalls, Mechanisms, and Fixes huggingface.co

A systematic study of on-policy and self-distillation for LLMs pinpoints three failure modes: distribution mismatch, optimization instability, and learning without privileged information. The authors prescribe fixes including TopK supervision and stop-gradient tweaks, clarifying when distillation beats RLVR or SFT.

Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training huggingface.co

Post-training works best when scarce labels drive sparse-reward RL for a teacher, then dense supervision compresses behavior into a student. The forward-KL warmup plus staged pipeline outperforms pure GRPO or on-policy distillation on MATH and AIME under fixed label budgets.

δ-mem: Efficient Online Memory for Large Language Models huggingface.co

A lightweight memory mechanism called δ-mem enhances large language models by augmenting a frozen attention backbone with a compact associative memory state that provides low-rank corrections to attention computations.

LoopUS: Recasting Pretrained LLMs into Looped Latent Refinement Models huggingface.co

LoopUS is a post-training framework that transforms pretrained LLMs into looped architectures for improved reasoning performance through latent-refinement and adaptive early exiting mechanisms.

Learning, Fast and Slow: Towards LLMs That Adapt Continually huggingface.co

A fast-slow learning framework for large language models combines fixed parameters with optimized context to achieve better sample efficiency, reduced catastrophic forgetting, and improved adaptability in continual learning scenarios.

Solve the Loop: Attractor Models for Language and Reasoning huggingface.co

Attractor Models enable efficient iterative refinement through fixed-point solving with implicit differentiation, achieving superior language modeling and reasoning performance with reduced computational costs compared to traditional transformers.

LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues huggingface.co

A new benchmark called LongMemEval-V2 is introduced to evaluate memory systems’ ability to help agents acquire environment-specific experience in web environments, featuring a suite of memory methods including AgentRunbook-R and AgentRunbook-C that demonstrate varying performance in accuracy and latency.

On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment huggingface.co

FATE is an on-policy framework that uses failure trajectories to improve agent safety and performance through self-evolution and Pareto-aware optimization.

ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents huggingface.co

ToolCUA is an end-to-end agent that learns optimal GUI-tool path selection through staged training, achieving superior performance in hybrid action space environments.

SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture huggingface.co

Unified vision-language models treat understanding and generation as integrated processes rather than separate tasks, demonstrating strong performance across multiple multimodal capabilities including image synthesis and action reasoning.

Multi-Stream LLMs: Unblocking Language Models with Parallel Streams of Thoughts, Inputs and Outputs huggingface.co

Language models can be enhanced by transitioning from sequential message-based instruction-tuning to parallel stream processing, enabling simultaneous reading and generation across multiple concurrent data flows.

AlphaGRPO: Unlocking Self-Reflective Multimodal Generation in UMMs via Decompositional Verifiable Reward huggingface.co

AlphaGRPO enhances multimodal generation by applying Group Relative Policy Optimization to AR-Diffusion Unified Multimodal Models through self-reflective refinement and decompositional verifiable reward mechanisms.

Reward Hacking in Rubric-Based Reinforcement Learning huggingface.co

Research examines reward hacking in rubric-based reinforcement learning, identifying verifier failure and rubric-design limitations as key sources of divergence between training and evaluation metrics.

Your Language Model is Its Own Critic: Reinforcement Learning with Value Estimation from Actor’s Internal States huggingface.co

POISE enables stable and efficient policy optimization for large reasoning models by estimating baselines using internal model signals, reducing computational overhead while maintaining performance comparable to existing methods.

Missing Old Logits in Asynchronous Agentic RL: Semantic Mismatch and Repair Methods for Off-Policy Correction huggingface.co

Asynchronous reinforcement learning in large language models faces challenges with PPO-style corrections due to delayed updates and missing historical logits, which are addressed through exact and approximate correction methods including snapshot tracking and revised PPO-EWMA techniques.

MemPrivacy: Privacy-Preserving Personalized Memory Management for Edge-Cloud Agents huggingface.co

MemPrivacy enables privacy-preserving personalized memory in edge-cloud environments by using type-aware placeholders to protect sensitive data while maintaining semantic integrity for effective memory operations.

Teaching Language Models to Think in Code huggingface.co

ThinC framework enables mathematical problem solving where code serves as the primary reasoning mechanism instead of a verification tool, demonstrating superior performance on math benchmarks.

EVOCHAMBER: Test-Time Co-evolution of Multi-Agent System at Individual, Team, and Population Scales huggingface.co

Multi-agent test-time evolution framework EVOCHAMBER enables emergent specialization through collaborative reflection and asymmetric knowledge transfer across coevolving agents.

Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values huggingface.co

Autonomous agents exhibit distinct value systems from underlying language models, requiring new benchmarking approaches to assess alignment across diverse execution environments.

MEME: Multi-entity & Evolving Memory Evaluation huggingface.co

MEME benchmark evaluates memory systems across multiple entities and evolving conditions, revealing persistent challenges in dependency reasoning despite advanced retrieval and prompting techniques.

Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents huggingface.co

A visual-native agent harness with image bank reference protocol enables reusable intermediate visual evidence and closed-loop data generation that improves multimodal deep search performance across multiple benchmarks.

CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives huggingface.co

CausalCine enables interactive, multi-shot video generation by addressing limitations of autoregressive models through causal modeling, dynamic memory routing, and real-time distillation techniques.

L2P: Unlocking Latent Potential for Pixel Generation huggingface.co

Latent-to-Pixel transfer paradigm efficiently leverages pre-trained latent diffusion models to create pixel-space models with minimal training overhead and high-resolution generation capabilities.

A Causal Language Modeling Detour Improves Encoder Continued Pretraining huggingface.co

Switching from Masked Language Modeling to Causal Language Modeling during encoder adaptation improves downstream performance on biomedical texts through dense supervision effects in lower transformer layers.

Covering Human Action Space for Computer Use: Data Synthesis and Benchmark huggingface.co

Computer-use agents face reliability challenges with complex GUI interactions due to data scarcity, addressed through a multi-modal benchmark and synthetic data generation pipeline.

UniPath: Adaptive Coordination of Understanding and Generation for Unified Multimodal Reasoning huggingface.co

Unified multimodal models can improve performance by adaptively selecting coordination paths rather than using fixed patterns, enabling diverse reasoning strategies for different inputs.

Beyond the Last Layer: Multi-Layer Representation Fusion for Visual Tokenization huggingface.co

DRoRAE enhances visual representation by fusing multi-layer features from pretrained vision encoders through adaptive routing and incremental correction, improving reconstruction and generation quality.

AdaPreLoRA: Adafactor Preconditioned Low-Rank Adaptation huggingface.co

LoRA optimizers are analyzed through a unified framework based on surrogate matrices and preconditioners, with AdaPreLoRA proposing a novel approach using Adafactor diagonal Kronecker preconditioning to improve factor-space updates while maintaining low memory usage.

Debiased Model-based Representations for Sample-efficient Continuous Control huggingface.co

DR.Q algorithm improves model-based representations for Q-learning by maximizing mutual information and using faded prioritized experience replay to reduce bias and overfitting in representation learning.

PASA: A Principled Embedding-Space Watermarking Approach for LLM-Generated Text under Semantic-Invariant Attacks huggingface.co

PASA is a robust watermarking algorithm for large language models that operates at the semantic level using latent embedding spaces and shared randomness for secure text detection.

Do Enterprise Systems Need Learned World Models? The Importance of Context to Infer Dynamics huggingface.co

Enterprise discovery agents that read system configuration at runtime outperform traditional world models in configurable environments where dynamics change over time.

AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration — Learning from Cheap, Optimizing Expensive huggingface.co

An agentic framework called AutoLLMResearch automates high-cost large language model experiment configurations by learning from multi-fidelity experimental environments and enabling efficient configuration identification through cross-fidelity extrapolation.

From Web to Pixels: Bringing Agentic Search into Visual Perception huggingface.co

Researchers introduce WebEye, a benchmark for object localization requiring external knowledge resolution, and Pixel-Searcher, an agent-based approach that connects hidden target identities to visual annotations through search and reasoning.

SeePhys Pro: Diagnosing Modality Transfer and Blind-Training Effects in Multimodal RLVR for Physics Reasoning huggingface.co

SeePhys Pro benchmark reveals that current multimodal models struggle with representation-invariant reasoning when information shifts from text to visual formats, and demonstrates that blind training can improve performance through residual textual cues.

One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue huggingface.co

Multi-turn dialogue safety monitoring system detects harmful intent accumulation through turn-level analysis and evaluates performance on a new benchmark dataset.

MCP-Cosmos: World Model-Augmented Agents for Complex Task Execution in MCP Environments huggingface.co

MCP-Cosmos integrates generative World Models into the Model Context Protocol ecosystem to enhance agent planning and execution through predictive simulation in latent space.

World Action Models: The Next Frontier in Embodied AI huggingface.co

World Action Models unify predictive state modeling with action generation for embodied policy learning, forming a cohesive framework for understanding environment dynamics and action prediction.

World Model for Robot Learning: A Comprehensive Survey huggingface.co

World models as predictive representations of environmental dynamics have become essential for robot learning, supporting policy learning, planning, and simulation across various embodied applications.

Do not copy and paste! Rewriting strategies for code retrieval huggingface.co

Research investigates how different text rewriting strategies impact code retrieval performance, identifying that full natural language rewriting provides the greatest improvements while proposing entropy-based diagnostics to determine when such costly rewrites are beneficial.

LychSim: A Controllable and Interactive Simulation Framework for Vision Research huggingface.co

A simulation framework called LychSim is introduced, featuring a Python API, procedural data pipeline, and MCP integration to enable controllable and interactive environments for vision system development and evaluation.

VidSplat: Gaussian Splatting Reconstruction with Geometry-Guided Video Diffusion Priors huggingface.co

VidSplat is a training-free generative reconstruction framework that uses video diffusion priors to synthesize novel views and recover complete 3D scenes from sparse inputs through adaptive denoising and iterative refinement.

WildRelight: A Real-World Benchmark and Physics-Guided Adaptation for Single-Image Relighting huggingface.co

WildRelight dataset addresses the gap between synthetic and real-world single-image relighting by providing high-resolution outdoor scenes with aligned natural illumination, enabling physics-guided domain adaptation through diffusion posterior sampling and test-time adaptation.

Micro-Defects Expose Macro-Fakes: Detecting AI-Generated Images via Local Distributional Shifts huggingface.co

A local distribution-aware detection framework that amplifies micro-scale statistical irregularities to identify AI-generated images with improved accuracy.

References

getcoai.com getcoai.com

Hypothesis creation is the ‘most fun’ aspect that scientists are least likely to outsource… drug-repurposing candidates identified by the AI were already well-established in the literature, suggesting the tool may struggle to move beyond sophisticated summarization to true discovery.

PsyPost — Penadés Co-Scientist ‘rediscovery’ coverage psypost.org

Penadés famously recounted being out shopping when he received the results and telling his companion, ‘please leave me alone for an hour, I need to digest this thing’… He initially suspected the AI might have accessed his private files, prompting him to email Google to ask, ‘Do you have access to my computer?’

r/AI_Agents — ‘Tested 5 AI scientist platforms for biotech’ reddit.com

Faraday (AscentBio) is optimized for medicinal chemistry, while Biomni is preferred for broader academic research… FutureHouse’s Crow/Falcon agents reportedly achieve 90% accuracy on LitQA, significantly higher than human PhDs (~67%).

Google AI Developer Forum — ‘2026 Stability Crisis’ thread discuss.ai.google.dev

Early adopters reported an ‘infinite loading loop’ and aggressive ‘silent safety filters’ that disrupted complex workflows during the May 2026 rollout… ‘concept drift’ where the model rejected real-world 2026 data as ‘simulated test scenarios’ due to its pre-2025 training weights.

StartupFortune — biosecurity red-team reporting startupfortune.com

A group of 60 U.K. lawmakers accused Google of violating safety pledges by releasing experimental Gemini models without transparent, third-party safety testing… AI-generated versions of toxins can sometimes evade standard biosecurity screening software.

PPPL / DOE Genesis Mission announcement pppl.gov

The Genesis Mission… aims to double U.S. scientific productivity within a decade by transitioning from human-led research to AI-augmented ‘autonomous laboratories,’ connecting the Discovery supercomputer at Oak Ridge with Google DeepMind’s Co-Scientist across all 17 national labs.

Yue et al., ‘Does RL Expand the Capability Boundary of LLM Agents? A Pass@k Analysis’ (ResearchGate) researchgate.net

while RL-trained models significantly outperform base models at k=1, the performance gap narrows or disappears as k increases… suggests the correct reasoning paths were already present in the base model’s latent distribution; RL simply biased the model toward these paths.

arXiv 2509.25123 — boundary-narrowing critique of RLVR arxiv.org

RL improves sampling efficiency (Pass@1), it actually narrows the model’s total solution coverage at higher sampling budgets like Pass@128 or Pass@256… RL-trained models solve fewer problems overall than the base model when given enough attempts.

Nathan Lambert, Interconnects — ‘Reinforcement Learning with Random Rewards’ interconnects.ai

random rewards yielded a +21.4% boost in math accuracy, while rewarding explicitly incorrect labels resulted in a +24.1% gain, nearly matching the +29.1% gain from ground-truth rewards… findings are highly specific to the Qwen model family.

Moonlight review of ‘Spurious Rewards: Rethinking Training Signals in RLVR’ themoonlight.io

Qwen2.5-Math-1.5B reportedly improved its MATH-500 accuracy from 36.0% to 73.6% after exposure to just one query-answer pair… RL serves more as a directional nudge toward pre-existing logical ‘modes’.

arXiv 2506.06632 — Curriculum RL Easy-to-Hard for LLM reasoning arxiv.org

If a reasoning task is too difficult for a model to solve even once during on-policy rollouts, the reward remains zero, and the gradient signal vanishes entirely… the ‘hardest’ examples frequently provide the least learning signal in practice.

ACL 2025 Findings — ‘master keys’ for Qwen judges aclanthology.org

minimal, often meaningless tokens like ’:’ or specific formatting patterns can trick Qwen2-based judges into providing false-positive rewards.

TechTimes — AI benchmarks under fire techtimes.com

Gemini’s harness included a custom mini-map, pathfinding tools, and direct RAM access to ‘read’ game text, whereas Claude often failed at simple tasks like cutting down a tree because it lacked the visual reasoning to recognize the obstacle.

TIME — AI ChatGPT Claude Gemini Pokemon time.com

Joel Zhang himself has stated that the experiment does not prove Gemini is ‘smarter’ than Claude, as the two operate under vastly different conditions… success in open-ended environments depends less on ‘brain size’ and more on the quality of the tools.

Moonlight review of Continual Harness themoonlight.io

Flash-Lite variants using the Continual Harness actually underperformed compared to simple baselines, suggesting that a minimum level of base reasoning is required to manage and utilize the complex tools the harness creates.

ResearchGate paper page (ablation summary) researchgate.net

Standard warm-up stages—such as Supervised Fine-Tuning on expert trajectories or offline GRPO—did not produce meaningful milestone advancement on their own; sustained progress only occurred when the model’s weights were updated mid-game using a teacher-student relabeling process.

Decrypt — Self-evolving AI agents unlearn safety decrypt.co

A coding agent’s refusal rate for harmful prompts dropped from 99% to 54% after it began iteratively refining its own logic to pursue goals more efficiently.

Next Signal Prediction Substack — ‘Harness is the Dataset’ nextsignalprediction.substack.com

The Harness is the Dataset. Competitive advantage is now the trajectories your harness captures.

Jack Sun

Jack Sun, writing.

Engineer · Bay Area

Hands-on with agentic AI all day — building frameworks, reading what industry ships, occasionally writing them down.

Digest
All · AI Tech · AI Research · AI News
Writing
Essays
Elsewhere
Subscribe
All · AI Tech · AI Research · AI News · Essays

© 2026 Wei (Jack) Sun · jacksunwei.me Built on Astro · hosted on Cloudflare