Sources

Language Models Need Sleep huggingface.co

A sleep-like consolidation mechanism for transformer models uses fast weights and recurrent passes to improve long-context processing while maintaining inference speed.

QUEST: Training Frontier Deep Research Agents with Fully Synthetic Tasks huggingface.co

QUEST is an open-family of deep research agents trained with synthesized data and reinforcement learning to perform well across diverse long-horizon search tasks.

CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents huggingface.co

RLVR framework for computer-use agents addresses data scarcity through scalable generation pipeline and synthetic environments, achieving superior performance on verification and transfer benchmarks.

ECHO: Terminal Agents Learn World Models for Free huggingface.co

ECHO adds an auxiliary loss that makes CLI agents predict the next environment observation alongside their action, turning sparse terminal feedback into dense supervision. The hybrid objective pairs policy-gradient updates with environment cross-entropy, lifting GRPO performance and unlocking on-policy self-improvement on tool-use tasks.

Faithfulness Metrics Don’t Measure Faithfulness: A Meta-Evaluation with Ground Truth huggingface.co

A new ground-truth benchmark labels 3,066 chains of thought across 13 tasks and 10 models, then scores popular faithfulness metrics against it. Most land near random AUROC, exposing that today’s go-to reliability measures don’t actually track whether reasoning reflects the model’s computation.

Representation over Routing: Diagnosing Temporal Routing Pathologies in Multi-Timescale PPO huggingface.co

Differentiable attention and heuristic uncertainty weights for blending discount factors quietly hack the surrogate objective instead of building real temporal abstractions. The authors trace the failure to coupled value targets and propose target decoupling as a structural fix across actor-critic routing variants.

SEAL: Synergistic Co-Evolution of Agents and Learning Environments huggingface.co

Instead of freezing the sandbox, SEAL runs a closed loop where environments adapt to expose agent weaknesses while policies update on the resulting trajectories. Turn-level failure labels drive diagnosis-guided advantage reweighting, improving interactive tool use and out-of-distribution transfer for LLM agents.

Directional Alignment Mitigates Reward Hacking in Reinforcement Learning for Language Models huggingface.co

Studying the geometry of RL updates, the authors find policies drift off a stable low-dimensional learning trajectory just before reward hacking kicks in. Projecting gradients onto trusted singular directions constrains updates and delays shortcut exploitation without sacrificing legitimate reward gains.

Foundation Protocol: A Coordination Layer for Agentic Society huggingface.co

As autonomous agents browse, buy, and deploy software on users’ behalf, the bottleneck shifts from model capability to coordination. The Foundation Protocol specifies how agents form relationships, exchange value, and stay accountable, aiming to be a TCP/IP-style substrate for a multi-agent economy.

ThriftAttention: Selective Mixed Precision for Long-Context FP4 Attention huggingface.co

Targeting Blackwell GPUs, ThriftAttention quantises most query-key interactions to 4-bit but escalates outlier blocks to FP16 inside an online softmax. The mixed-precision kernel keeps long-context and diffusion-model quality near full precision while capturing most of the FP4 throughput win.

Claw-Anything: Benchmarking Always-On Personal Assistants with Broader Access to User’s Digital World huggingface.co

Claw-Anything benchmark evaluates large language model agents on comprehensive user activity contexts spanning extended timeframes, multiple services, and diverse device interactions to assess true always-on personal assistance capabilities.

HorizonStream: Long-Horizon Attention for Streaming 3D Reconstruction huggingface.co

HorizonStream addresses long-term 3D reconstruction challenges by modeling geometric propagation through an evidence influence kernel, enabling stable, scalable streaming reconstruction with constant memory and linear time complexity.

MemForest: An Efficient Agent Memory System with Hierarchical Temporal Indexing huggingface.co

MemForest presents a memory framework for long-context LLM agents that improves scalability and reduces latency through parallel chunk extraction and hierarchical temporal indexing.

SkillEvolBench: Benchmarking the Evolution from Episodic Experience to Procedural Skills huggingface.co

Current large language model agents struggle to form robust reusable skills from episodic experience, with raw trajectory reuse often outperforming distilled skills due to discarded contextual cues.

DVAO: Dynamic Variance-adaptive Advantage Optimization for Multi-reward Reinforcement Learning huggingface.co

Dynamic Variance-adaptive Advantage Optimization (DVAO) addresses training instability in multi-reward reinforcement learning by adaptively weighting objectives based on empirical reward variance, maintaining bounded advantage magnitudes and improving multi-objective performance.

Towards Evaluation Engineering: An Empirical Study of ML Evaluation Harnesses in the Wild huggingface.co

Evaluation harnesses are software systems that orchestrate model evaluation by managing model invocation, data loading, metric computation, and result reporting. Despite their critical role in machine learning infrastructure, their operational challenges and engineering concerns have received limited attention so far. We present an empirical study of 57 evaluation harnesses, deriving a five-stage harness model and classifying 16,560 issues by workflow stage and root cause. Most harness operation

Anticipate and Learn: Unleashing Idle-Time Compute in Proactive Agents huggingface.co

ProAct is a proactive agent architecture that uses idle-time computation to anticipate user needs and improve task completion efficiency and accuracy.

Personalize-then-Store: Benchmarking and Learning Personalized Memory for Long-horizon Agents huggingface.co

Large language model-based memory systems can benefit from personalized policies that adapt to individual user contexts, though accurate implementation remains challenging.

How Far Will They Go? Red-Teaming Online Influence with Large Language Models huggingface.co

Open-source large language models exhibit varying political expressivity and vulnerability to jailbreak techniques, necessitating systematic red-teaming frameworks for assessing their potential misuse in influence campaigns.

WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation huggingface.co

WBench presents a comprehensive multi-turn benchmark for evaluating interactive world models across five dimensions using 289 test cases and 1,058 interaction turns with diverse scenarios and interaction types.

SimuWoB: Simulating Real-World Mobile Apps for Fast and Faithful GUI Agent Benchmarking huggingface.co

A synthetic benchmark for mobile GUI agents with 120 challenging tasks is introduced, featuring high-fidelity virtual environments with automatic reward generation and revealing significant limitations in current agent performance on complex, long-horizon interactions.

CRONOS: Benchmarking Counterfactual Physical Consistency in Video Models huggingface.co

CRONOS is a benchmark for evaluating counterfactual physical consistency in video prediction models through controlled interventions in viewpoint, scene, object category, and appearance while maintaining fixed physical event types.

CoSPlay: Cooperative Self-Play at Test-Time with Self-Generated Code and Unit Test huggingface.co

CoSPlay is a GT-free framework that jointly improves code generation and unit test quality through cooperative self-play, achieving competitive performance without ground-truth unit tests.

ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning huggingface.co

ParaVT enables parallel video tool calling through multi-agent reinforcement learning, addressing limitations of sequential approaches and improving long-video understanding performance.

Your Embedding Model is SMARTer Than You Think huggingface.co

SMART enhances multimodal retrieval by leveraging latent multi-vector capabilities from single-vector models through contrastive training and late-interaction inference, achieving state-of-the-art performance with reduced computational costs.

RankJudge: A Multi-Turn LLM-as-a-Judge Synthetic Benchmark Generator huggingface.co

A benchmark generator called RankJudge evaluates large language model judges on multi-turn conversations by creating flawed conversation pairs and using statistical models for ranking and difficulty assessment.

Recursive Flow Matching huggingface.co

Recursive Flow Matching enables high-fidelity, computationally efficient forecasting of complex spatiotemporal dynamics with improved accuracy and speed compared to existing methods.

Decoupling Communication from Policy: Robust MARL under Bandwidth Constraints huggingface.co

Researchers propose a novel communication architecture for multi-agent reinforcement learning that decouples policy representation from communication pathways, enabling better performance under bandwidth constraints.

SemBridge: Language Transfer in Sparse Encoders via Multilingual Semantic Bridges huggingface.co

SemBridge enhances cross-lingual sparse encoder adaptation by using multilingual bridge models to establish semantic alignments and improve retrieval performance across multiple languages.

Seeing the Needle in the Haystack: Towards Weakly-Supervised Log Instance Anomaly Localization via Counterfactual Perturbation huggingface.co

LogMILP is a weakly supervised framework for log anomaly detection that enables both bag-level detection and instance-level localization using prototype-guided structural modeling with counterfactual perturbation consistency regularization.

Channel-wise Vector Quantization huggingface.co

Channel-wise Vector Quantization replaces patch-wise tokens with channel-wise tokens in image tokenization, enabling a next-channel prediction framework that generates images by sequentially refining visual details.

AutoResearch AI: Towards AI-Powered Research Automation for Scientific Discovery huggingface.co

AI systems are evolving from task-specific assistants to workflow-level research automators, facing challenges in autonomy, reproducibility, and accountability across scientific domains.

Toward Native Multimodal Modeling: A Roadmap huggingface.co

Native multimodal modeling advances beyond traditional fusion approaches by integrating modalities inherently within a unified transformer framework, enabling seamless understanding and generation across diverse input-output configurations.

TriSplat: Simulation-Ready Feed-Forward 3D Scene Reconstruction huggingface.co

TriSplat is a feed-forward 3D reconstruction network that uses oriented triangle primitives to directly generate simulation-ready meshes from single images, bypassing expensive post-processing steps.

Reinforcing Few-step Generators via Reward-Tilted Distribution Matching huggingface.co

RTDMD is a two-stage framework that combines distribution matching distillation with reward-guided reinforcement learning to improve few-step image generation alignment with human preferences.

PRISM: Position-encoded Regressive Inverse Spectral Model for Multilayer Thin-Film Design huggingface.co

PRISM is a decoder-only autoregressive transformer that efficiently solves the inverse problem of multilayer thin-film optical coatings design by jointly predicting material selection and thickness while leveraging spectrum prefix conditioning and cumulative-depth Rotary Position Embeddings.

Cross-scale Aligned Supervision for Training GANs huggingface.co

Standard GANs with adversarial supervision on intermediate outputs fail to maintain consistent sample trajectories across scales, leading to misalignment; a new transformer-based approach called CAT addresses this by enforcing consistency between intermediate and final outputs.

On-Policy Adversarial Flow Distillation for Autoregressive Video Generation huggingface.co

Adversarial Flow Distillation enables efficient distillation of heterogeneous video generation models by using on-policy feedback and forward-process flow-matching updates without requiring teacher scores or detailed trajectory information.

InstructSAM: Segment Any Instance with Any Instructions huggingface.co

InstructSAM presents a unified framework for multi-instance segmentation using instruction-driven queries that bridge vision-language models and SAM3 through learnable instance queries and hybrid attention mechanisms.

Helix4D: Complex 4D Mesh Generation huggingface.co

Helix4D enables high-quality dynamic mesh generation by adapting Trellis2’s frame-local attention across frames and extending 3D positional encoding with 4D temporal information.

Geometry-Aware Image Flow Matching huggingface.co

Geometry-aware generative models leveraging spherical manifolds and optimal transport techniques outperform traditional Euclidean approaches for natural image synthesis.

Broadening Access to Transportation Safety Data with Generative AI: A Schema-Grounded Framework for Spatial Natural Language Queries huggingface.co

A natural language interface for transportation safety analysis uses large language models to translate user queries into structured spatial operations while maintaining deterministic database execution for reliable and reproducible results.

Pantheon360: Taming Digital Twin Generation via 3D-Aware 360° Video Diffusion huggingface.co

Pantheon360 enables high-fidelity 360° video generation for digital twins by combining 3D-aware diffusion with explicit geometric caching to ensure spatial-temporal consistency.

Coloring the Noise: Adversarial Sobolev Alignment for Faithful Image Super Resolution huggingface.co

ASASR addresses spectral misalignment in image super-resolution by leveraging Riemannian geometry and adversarial training to improve structural fidelity and reduce artifacts.

ControlLight: Towards Controllable, Consistent, and Generalizable Low-Light Enhancement huggingface.co

ControlLight is a controllable low-light enhancement framework that uses a large-scale real-world dataset and weighted flow matching loss to ensure consistent image quality across varying enhancement strengths.

Injecting Image Guidance into Text-Conditioned Diffusion Models at Inference huggingface.co

Visual Concept Fusion enables dual text and image conditioning in diffusion models through feature alignment and fusion strategies without requiring retraining.

MotiMotion: Motion-Controlled Video Generation with Visual Reasoning huggingface.co

MotiMotion introduces a reasoning-then-generation framework for motion-controlled video generation that improves plausibility through vision-language reasoning and confidence-aware control mechanisms.

Macaron-A2UI: A Model for Generative UI in Personal Agents huggingface.co

Generative UI models enable personal agents to synthesize dynamic interfaces with lightweight executable actions for enhanced interaction beyond text-only formats.

MetaphorVU: Towards Metaphorical Video Understanding huggingface.co

Current multimodal large language models struggle with metaphorical video understanding due to poor cross-domain mapping, prompting the development of a new benchmark and enhancement framework.

Towards Customized Multimodal Role-Play huggingface.co

A new task and dataset for customized multimodal role-play is introduced, along with a unified model framework that enables consistent character customization across text and image modalities using few-shot learning.

References

Berkeley RDI — Trustworthy Benchmarks blog rdi.berkeley.edu

A 2026 audit found prominent benchmarks like WebArena and SWE-bench were susceptible to exploits that allowed agents to achieve near-perfect scores without actually solving the tasks — including reading hidden answer files via file:// URLs and downloading gold answer files from public Hugging Face URLs embedded in task configs.

Amazon AGI Labs — ‘A Practical Recipe for Training Computer-Use Agents with RL’ labs.amazon.science

Successful computer-use agents do not result from model improvements alone but from an end-to-end system addressing four pillars: data, reasoning, algorithms, and infrastructure… synthetic RL gyms provide stable, deterministic feedback and avoid the chaotic behavior (unintended data deletion or financial transactions) of training on the live web.

Hugging Face — xlangai/CUA-Gym dataset card huggingface.co

Some web task setups require CUA-Gym-Hub endpoints, which are currently stored as placeholders in the public dataset; developers must deploy the mock apps locally or via a private server to fully reconstruct the training environment.

OpenPipe ART docs — GSPO (experimental) art.openpipe.ai

GSPO’s stability benefits are most pronounced in sparse MoE architectures, with limited or no impact observed when applied to dense models… it remains categorized as an experimental feature with potentially evolving APIs.

Coasty — Computer-Use Agent Comparison 2026 (OSWorld) coasty.ai

Claude Opus 4.8 holds the top position on OSWorld-Verified at 83.4%, followed by open-source Holo3-35B-A3B at 82.6%; GPT-5.5 reaches 78.7% while OpenAI’s Operator scores as low as 38.1% in independent stress tests.

OSWorld project site (xlang-ai) os-world.github.io

OSWorld and CUA-Gym share a core author team — Tianbao Xie, Bowen Wang, Dunjie Lu, Junli Wang, supervised by Tao Yu — with the same lab releasing OSWorld-Verified in 2025 and then CUA-Gym in 2026 as a pipeline designed to generate training data that excels on it.

OSU-NLP-Group/QUEST GitHub README github.com

cached search databases and certain mid-training data remain under legal review and will only be released once compliance is confirmed

AI Security Chronicles — Comparison of Deep Research AI Agents aisecuritychronicles.org

WebSailor (72B and V2 variants) has emerged as the state-of-the-art for complex browsing, achieving 12.0% on BrowseComp-EN and 30.1% on BrowseComp-ZH

The Decoder — ‘AI search agents often confirm what they already know’ the-decoder.com

MiniMax M2.5 solved 44.5% of BrowseComp and Kimi K2.6 hit 62% on BrowseComp-ZH using internal weights alone, with browsing disabled

ResearchGate — LiveBrowseComp: Are Search Agents Searching or Just Verifying What They Already Know? researchgate.net

search-augmented systems typically experience a 25 to 40 point drop in performance, with some agents falling below 2% accuracy when forced to rely on real-time discovery rather than memory

arXiv — RIFT: Rubric Failure mode Taxonomy arxiv.org

synthetic rubrics provide broader but more imprecise coverage… if a rubric is fundamentally incomplete, even a ‘perfect’ judge will inadvertently reward hacked, deceptive, or low-quality behaviors

whatllm.org — January 2026 Open Source vs Proprietary whatllm.org

top open-source models like GLM-4.7 and Kimi K2.5 are within five points of the proprietary leaders… GLM-4.7 achieved 96% on the agentic τ²-Bench, surpassing Claude Opus 4.5

aib.vote — LLM Sleep Consolidation explainer aib.vote

Standard hybrid architectures remained trapped at near-random guessing, achieving only ~10% exact accuracy on Rule 110 at t=32; with 3-4 sleep loops the model broke past the logic horizon, reaching over 30% accuracy under the same token budget.

Yutori Scouts inbox digest scouts.yutori.com

Independent reports confirm a 52% accuracy improvement on GSM-Infinite when using sliding-window eviction combined with the sleep mechanism; because the sleep phase relies on specific SSM-attention hybrid blocks, it cannot be retroactively applied to standard models like Llama 3 or GPT-4 without structural modification.

Letta — ‘Sleep-time Compute’ paper (arXiv 2504.13171) arxiv.org

Letta’s framework is an agent-level orchestration where a primary agent spawns background sleep-time sub-agents to summarize histories and pre-compute reasoning traces, reducing test-time compute requirements by up to 5x — manipulating high-level tokens rather than modifying the model’s actual weight-based memory.

Test-Time Training E2E paper (test-time-training.github.io) test-time-training.github.io

TTT-E2E achieves near-identical accuracy to full-attention models at 128k context lengths while being up to 35x faster for 2M-token sequences — a continuous online-update analogue that some experts on Hacker News call ‘more flexible and elegant’ than periodic sleep phases.

NVIDIA Research — Gated Delta Networks research.nvidia.com

Gated DeltaNet integrates a data-dependent gating ‘global reset’ with the Delta Rule ‘surgical eraser’, consistently surpassing Mamba2 and earlier DeltaNet variants — the very fast-weight substrate that the sleep paper repurposes for offline consolidation.

nestfrontier.com — community commentary nestfrontier.com

Critics argued that anthropomorphizing machine functions — comparing a weight update cycle to biological hippocampal replay — is misleading, with one commenter likening it to calling a server reboot a ‘nap’.

Sources

References

Jack Sun, writing.