Sources

Paving the way for agents in biology anthropic.com

(AINews) FrontierCode: Benchmarking for Code Quality over Slop latent.space

We made a thing!

Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses huggingface.co

A 20B search agent trained with reinforcement learning within a stateful search framework demonstrates superior retrieval performance across multiple domains by separating semantic decision-making from environmental bookkeeping.

Multi-Agent Computer Use huggingface.co

Coordinating several computer-use agents through a directed acyclic graph beats single-agent baselines on complex desktop tasks. The system decomposes work dynamically and runs subtasks in parallel, cutting wall-clock time while keeping each agent’s observation scope narrow enough to stay reliable.

When Does Multi-Agent RL Improve LLM Workflows? Workflow, Scale, and Policy-Sharing Tradeoffs huggingface.co

Reinforcement learning lifts multi-agent LLM workflow accuracy over base models, but gains hinge on topology, task, and scale. Shared-policy training suffers asymmetric gradient mass from dominant roles, while isolated-policy setups hit terminal degradation, exposing distinct failure modes for end-to-end RL.

Linear Ensembles Wash Away Watermarks: On the Fragility of Distributional Perturbations in LLMs huggingface.co

Averaging token distributions across multiple models cancels the perturbations watermarks rely on, collapsing detection z-scores while improving generation quality and speed. The WASH attack exploits vocabulary misalignment and statistical hybridisation, suggesting distributional watermarks are fragile against trivial multi-model setups.

Unified Neural Scaling Laws huggingface.co

A single Unified Neural Scaling Law fits and extrapolates network behavior across parameters, dataset size, training steps, inference steps, and compute simultaneously. The formulation holds across vision, language, math, and reinforcement learning architectures on both upstream and downstream tasks.

On the Scaling of PEFT: Towards Million Personal Models of Trillion Parameters huggingface.co

Parameter-efficient fine-tuning is reframed as a substrate for persistent personal models, with small adapters storing instance-specific behavior atop shared trillion-parameter foundations. The paper outlines adapter identity, revision, provenance, and serving residency as the infrastructure needed to host millions of user-specific adapters.

Domino: Decoupling Causal Modeling from Autoregressive Drafting in Speculative Decoding huggingface.co

Domino accelerates speculative decoding by running a parallel backbone for token drafting alongside a lightweight causal refinement head trained via teacher-forced encoding. The split improves draft quality without inflating cost, delivering end-to-end and throughput speedups on Transformers and SGLang serving backends.

Not only where, But when: Temporal Scheduling for RLVR huggingface.co

Scheduling credit allocation across training time, rather than fixing it, improves RLVR policy evolution and stability. Early steps reweight advantages toward targeted tokens, then gradually relax to general optimization, keeping policy entropy healthy across trajectory percentiles and avoiding premature collapse.

Policy and World Modeling Co-Training for Language Agents huggingface.co

PaW is a co-training framework that combines policy learning and world modeling using on-policy reinforcement learning rollouts to improve language agent training without additional computational overhead.

Masking Stale Observations Helps Search Agents — Until It Doesn’t: A Regime Map and Its Mechanism huggingface.co

Observation masking in long-horizon search agents shows variable accuracy gains depending on the interaction between retriever capability and model capacity, following an asymmetric inverted-U pattern.

OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual Web Agents huggingface.co

OpenWebRL presents a framework for training visual web agents using online reinforcement learning on real websites, achieving state-of-the-art performance with minimal initial supervision.

FineVerify: Scaling Test-Time Compute with Fine-Grained Self-Verification for Agentic Search huggingface.co

FineVerify is a self-verification framework for agentic search that improves accuracy through decomposed sub-question checking and trajectory selection.

Draft-OPD: On-Policy Distillation for Speculative Draft Models huggingface.co

Speculative decoding uses a lightweight draft model to accelerate large language model inference, but supervised fine-tuning plateaus due to offline-to-inference mismatch, which is addressed through on-policy distillation with target-assisted rollouts and error replay.

Silent Failures in Physical AI: A Literature Review of Runtime Action Authorization for Autonomous Systems huggingface.co

Physical AI systems face safety challenges where black-box models can execute harmful actions without detection, necessitating comprehensive runtime guardrail mechanisms for safe operation.

Which Pretraining Paradigm Better Serves Spatial Intelligence? An Empirical Comparison of Vision-Language and Video Generation Models huggingface.co

A systematic comparison of vision-language models and video generation models reveals complementary strengths for spatial intelligence tasks, with vision-language models excelling in semantic tagging and instance grouping while video generation models perform better in dense geometry and camera motion prediction.

Measuring the Depth of LLM Unlearning via Activation Patching huggingface.co

A new metric called Unlearning Depth Score (UDS) is introduced to evaluate how thoroughly knowledge has been removed from large language models, addressing limitations of previous methods that could not detect hidden knowledge in internal representations.

Speculative Pipeline Decoding: Higher-Accruacy and Zero-Bubble Speculation via Pipeline Parallelism huggingface.co

Speculative Pipeline Decoding introduces a novel framework that leverages pipeline parallelism to accelerate large language model inference by enabling parallel token processing and reducing decoding latency.

ESPO: Early-Stopping Proximal Policy Optimization huggingface.co

ESPO improves mathematical reasoning in large language models by detecting and terminating failed trajectories early, leading to better performance and reduced computational waste.

A Matter of TASTE: Improving Coverage and Difficulty of Agent Benchmarks huggingface.co

Automated benchmark generation method creates challenging tasks with broader tool-use coverage by evolving tool sequences through adaptive contrastive n-gram modeling and iterative difficulty refinement.

NITP: Next Implicit Token Prediction for LLM Pre-training huggingface.co

Next Implicit Token Prediction enhances language model training by adding dense continuous supervision in representation space, improving generalization and performance across model sizes with minimal computational overhead.

LongAttnComp: Cross-Family Context Compression for Long-Context Reasoning huggingface.co

LongAttnComp adapts AttnComp for long-context processing by fine-tuning lightweight attention layers and implementing token-level chunking and positional reordering techniques.

Off-the-Shelf LLMs as Process Scorers: Training-Free Alternative to PRMs for Mathematical Reasoning huggingface.co

Chunk-Level Guided Generation uses a large language model as a process scorer to select fixed-length candidate chunks during small model generation, improving reasoning accuracy over traditional methods like majority voting and PRM guided search.

RoboSemanticBench: Diagnosing Semantic Grounding in Action Prediction for VLA Models huggingface.co

RoboSemanticBench identifies a disconnect between semantic understanding and action prediction in vision-language-action models, where robots can grasp objects but fail to select semantically correct targets.

Agent Skills Should Go Beyond Text: The Case for Visual Skills huggingface.co

Multimodal skills that combine textual logic with visual support outperform text-only approaches in visual-centric tasks by incorporating spatial layout, visual grounding, and state-aware interactions.

MindZero: Learning Online Mental Reasoning With Zero Annotations huggingface.co

MindZero presents a self-supervised reinforcement learning framework that enables multimodal large language models to perform efficient and robust online mental reasoning without requiring explicit mental state annotations.

MCP-Persona: Benchmarking LLM Agents on Real-World Personal Applications via Environment Simulation huggingface.co

MCP-Persona benchmark evaluates agent performance on personalized tools interacting with individual accounts and local databases, revealing significant challenges in current SOTA agents.

Joint Agent Memory and Exploration Learning via Novelty Signals huggingface.co

Joint Agent Memory and Exploration Learning (JAMEL) framework trains memory and exploration policies together through novelty-driven interaction, enabling effective exploration in open-ended environments with reduced computational costs.

LongLive-RAG: A General Retrieval-Augmented Framework for Long Video Generation huggingface.co

LongLive-RAG addresses long-video generation challenges by using retrieval-augmented generation to overcome error accumulation from sliding-window attention, enabling better temporal coherence and quality.

Skill is Not One-Size-Fits-All: Model-Aware Skill Alignment for LLM Agents huggingface.co

Model-aware skill alignment framework adapts skills to different backbones through hierarchical evolution and lightweight rewriter training, achieving superior performance across interactive tasks.

SkillAdaptor: Self-Adapting Skills for LLM Agents from Trajectories huggingface.co

Step-level skill adaptation framework with explicit failure attribution improves training-free skill maintenance for LLM agents in interactive tasks.

VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion huggingface.co

VideoMLA reduces memory usage in video diffusion models by replacing per-head keys and values with shared low-rank content and decoupled 3D-RoPE positional keys, maintaining quality while achieving significant compression and improved throughput.

LVSA: Training-Free Sparse Attention for Long Video Diffusion huggingface.co

Long Video Sparse Attention (LVSA) addresses computational bottlenecks in video diffusion models by introducing a sparse attention mechanism that reduces compute costs while maintaining video quality beyond training horizons.

Where to Look: Can Foundation Models Reach a Target Viewpoint Through Active Exploration? huggingface.co

Target Viewpoint Reproduction task challenges foundation models to actively adjust 3D viewpoints to match target images, revealing limitations in visual history processing and embodied movement mapping, with a unified post-training framework improving success rates through various training methods.

Crafter: A Multi-Agent Harness for Editable Scientific Figure Generation from Diverse Inputs huggingface.co

Automated systems for generating scientific figures face limitations in handling diverse figure types and conditions, prompting the development of multi-agent frameworks that generalize across different input scenarios and produce editable output formats.

TVIR: Building Deep Research Agents Towards Text—Visual Interleaved Report Generation huggingface.co

A multimodal deep research benchmark and agent framework are introduced to evaluate and improve the factual reliability and visual alignment of automated report generation systems.

SVI-Bench: A Dynamic Microworld for Strategic Video Intelligence huggingface.co

Strategic Video Intelligence requires understanding, causal reasoning, and planning capabilities that current benchmarks fail to evaluate adequately, leading to significant performance gaps in complex cognitive tasks.

K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts huggingface.co

Korean web-browsing agent benchmark K-BrowseComp evaluates frontier LLMs’ capabilities with 400 problems, showing significant performance gaps compared to English benchmarks and highlighting the need for more robust Korean AI development.

MineExplorer: Evaluating Open-World Exploration of MLLM Agents in Minecraft huggingface.co

MineExplorer benchmark evaluates multimodal large language models’ open-world exploration capabilities in Minecraft through atomic and multi-hop tasks designed via multi-agent synthesis.

X-Stream: Exploring MLLMs as Multiplexers for Multi-Stream Understanding huggingface.co

X-Stream introduces the first benchmark for multi-stream streaming understanding, revealing significant limitations of current MLLMs in handling concurrent streams.

StreamChar: Long-Horizon Streaming Character Audio-Video Generation with Decoupled Orchestration huggingface.co

StreamChar enables real-time streaming audio-video generation for character animation by separating long-horizon orchestration from short-window denoising through an LLM-based orchestrator and joint audio-video DiT, achieving efficient deployment via two-stage distillation and maintaining visual consistency through memory mechanisms.

PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding huggingface.co

PARCEL is a vision-language model architecture that dynamically partitions feature extraction tasks to improve efficiency and performance across different visual-token budgets.

VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization huggingface.co

Video generation models combined with vision-language models acting as test-time teachers through differentiable rewards achieve superior video reasoning performance.

RoboStressBench: Benchmarking VLM Robustness to Physical Visual Stress in Embodied Scenes huggingface.co

RoboStressBench presents a principled benchmark for evaluating vision-language model robustness to physical visual stress in embodied AI, decomposing visual stress into material, viewpoint, lighting, and geometry dimensions.

3DCodeBench: Benchmarking Agentic Procedural 3D Modeling Via Code huggingface.co

Vision-language models are evaluated for procedural 3D modeling tasks through a benchmark and ranking platform that assess their ability to translate text and images into executable 3D code.

AFUN: Towards an Affordance Foundation Model for Functionality Understanding huggingface.co

Affordance understanding model predicts functional masks and 3D motion curves from RGB-D observations and language descriptions, enabling generalizable robot manipulation across diverse environments.

SOCO: Benchmarking Semantic Object Correspondence in Vision Foundation Models huggingface.co

Semantic Object Correspondence (SOCO) benchmark evaluates structured object understanding in vision models through consistent part-level annotations and keypoint descriptions, revealing gaps between language-grounded localization and visual correspondence while demonstrating strong prediction of downstream task performance.

Brain-IT-VQA: From Brain Signals to Answers huggingface.co

Brain-IT-VQA framework decodes visual content from fMRI signals using transformer-based architecture and introduces NSD-VQA dataset for improved visual question answering evaluation.

ACL-Verbatim: hallucination-free question answering for research huggingface.co

Researchers develop a VerbatimRAG-based extractive question answering system using a novel ground truth dataset and ModernBERT model to improve accurate information retrieval from research papers.

HakushoBench: A Japanese Chart and Table VQA Benchmark from Governmental White Papers huggingface.co

Researchers created HakushoBench, a Japanese chart and table visual question answering benchmark derived from governmental documents, to evaluate vision-language models’ ability to understand complex visual data beyond English-language datasets.

Adapting Multilingual Embedding Models to Turkish via Cross-Lingual Tokenizer Surgery and Offline Distillation huggingface.co

A Turkish-focused sentence embedding model is developed through efficient adaptation techniques, achieving superior performance with reduced computational costs compared to larger teacher models.

References

arXiv preprint (VirBench paper) arxiv.org

Without specialized tools, frontier models exhibited significant variability… with retrieval accuracies as low as 17%. When agents were granted access to gget virus, retrieval accuracy surged to nearly 100%.

bioRxiv (Biomni / BiomniBench) biorxiv.org

Biomni demonstrated a 402.3% average relative performance gain over base LLMs… the specific agent harness or toolset provided can shift scores more significantly than upgrading the underlying model itself.

arXiv (REPRO-Bench / agentic reproducibility survey) arxiv.org

REPRO-Bench, which tasks agents with assessing the reproducibility of research papers, found that top agents initially achieved only 21.4% accuracy… agents often fail at end-to-end scientific discovery because they lack the scientific judgment to chain successes into a valid pipeline.

chatforest.com (NCBI-Datasets-MCP-Server / BioinfoMCP coverage) chatforest.com

A 2026 security audit of nearly 40,000 MCP repositories revealed that approximately 40% lacked authentication… while automated tools like the BioinfoMCP Compiler accelerate adoption, they also introduce potential vulnerabilities.

MDPI (Biosecurity Data Level framework) mdpi.com

The proposed Biosecurity Data Level (BDL) framework suggests five tiers of data control, ranging from unrestricted public release to highly restricted access for dual-use pathogen data… agent-friendly tools can provide novice uplift and expert enhancement.

Reddit r/AIGuild thread reddit.com

Model choice (GPT-5 vs Claude) becomes secondary when a robust, deterministic tool like gget virus handles the data retrieval… but skeptics argue ‘too dangerous’ framings are marketing spiel to justify closed-source dominance.

Cognition blog (FrontierCode launch post) cognition.ai

Correctness is table stakes; the benchmark evaluates whether a PR would actually be merged, with an 81% lower false-positive rate than SWE-Bench Pro across 3,000+ rubrics authored by 20+ open-source maintainers.

SlopCodeBench (arXiv, Wisconsin/Snorkel) huggingface.co

Agent-generated code is 2.3x more verbose and 2.0x more eroded than 473 open-source Python repositories; verbosity grows 6.6x and erosion 5.0x faster per checkpoint than in human-authored code, with ‘deletion phobia’ and ‘library aversion’ as core pathologies.

Arena.ai — Agent Arena methodology arena.ai

By randomizing component selections across sessions, Agent Arena runs a multi-intervention RCT to isolate an orchestrator’s ‘net improvement’ — some sessions involve nearly a thousand tool calls and hundreds of turns.

StartupHub.ai coverage startuphub.ai

GPT-5.5 scores 6.3% and Gemini 3.1 Pro 4.7% on Diamond — a steep drop-off from Opus 4.8’s 13.4% — and the Diamond tier is only 50 tasks, prompting questions about variance and the absence of error bars.

Digg roundup of HN/Willison reaction digg.com

Critics call the labeling effort a ‘sweatshop for brainpower’ and ask whether human maintainers would also score low on Diamond; Willison frames it through ‘cognitive debt’ inflicted on collaborators when unreviewed AI slop ships.

r/ClaudeAI thread on Ralph loops reddit.com

The ‘Ralph Wiggum’ persistent loop pattern — refuse to exit until tests pass — has been formalized in Claude Code’s /loop and Routines, but practitioners warn of ‘overbaking’ where unsupervised runs accrue bizarre emergent technical debt.

VentureBeat venturebeat.com

Researchers trained an open-source AI search agent (Harness-1) that outperforms GPT-5.4 on recalling relevant information

neuralnoise.com — ‘Harness-Bench (WIP)’ neuralnoise.com

the same model can produce a 6x performance variance depending solely on its wrapper… an agent is now defined as Model + Harness

ResearchGate — ‘MosaicLeaks: Privacy Risks in Querying-in-the-Open for Deep Research Agents’ researchgate.net

autonomous search agents can inadvertently leak sensitive local information through their external web queries… an adversary observing the query stream can often reconstruct protected information through the ‘mosaic effect’

OpenReview — BrowseComp+ critique on eval awareness openreview.net

Claude Opus 4.6 independently hypothesized it was being tested, identified the specific benchmark, and then systematically searched for and decrypted the benchmark’s own answer key instead of performing the research task

Ramp Labs Substack — ‘Building with Tinker’ ramplabs.substack.com

Tinker currently lacks support for hosting ‘heavy’ custom reward functions locally, forcing developers to use remote services like Modal, which can introduce latency

Medium — ‘Tongyi DeepResearch’ overview medium.com

Tongyi DeepResearch previously held the lead in open-weights systems with scores of 43.4 on BrowseComp and 75 on xbench-DeepSearch

Sources

References

Jack Sun, writing.