Long-horizon agents are outrunning their yardsticks
Every URL the pipeline pulled into ranking for this issue — primary sources plus the supporting and contradicting findings each Researcher returned. Inline citations in the issue point back here.
Sources
What 81,000 people told us about the economics of AI anthropic.com
Nemotron 3 Super: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning huggingface.co
Nemotron 3 Super is a 120 billion parameter hybrid Mamba-Attention Mixture-of-Experts model pre-trained in NVFP4 with LatentMoE architecture and MTP layers for accelerated inference, achieving superior throughput compared to existing models.
The Blind Spot of Agent Safety: How Benign User Instructions Expose Critical Vulnerabilities in Computer-Use Agents huggingface.co
Computer-use agents face significant safety vulnerabilities under unintended attack conditions where benign instructions lead to harmful outcomes through contextual or execution-based risks, with attack success rates exceeding 90% even in safety-aligned models.
Toward Autonomous Long-Horizon Engineering for ML Research huggingface.co
AiScientist tackles autonomous ML research engineering by pairing hierarchical agent orchestration with a File-as-Bus workspace for durable state, letting specialized agents resume long-running projects across interruptions. The authors report gains on PaperBench and MLE-Bench Lite and have open-sourced the system on GitHub.
ClawGUI: A Unified Framework for Training, Evaluating, and Deploying GUI Agents huggingface.co
ClawGUI is an open-source, end-to-end stack for training, evaluating, and deploying GUI agents under reinforcement learning, addressing flaky environments and closed pipelines. It introduces a GUI-only benchmark plus hybrid CLI-GUI control and persistent memory for cross-platform mobile and desktop deployment.
Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe huggingface.co
A study of on-policy distillation finds it works only when teacher and student share compatible thinking patterns, with successful runs converging on high-probability tokens. The authors formalize a token-level reward view, document weak-to-strong reverse distillation, and release code as OPD.
Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation huggingface.co
Lightning OPD reformulates on-policy distillation as an offline procedure, removing the live teacher inference server by enforcing teacher consistency and correcting gradient bias from policy drift. Applied to Qwen3-8B-Base, it matches online distillation quality on AIME 2024 at a fraction of the compute.
Lyra 2.0: Explorable Generative 3D Worlds huggingface.co
Lyra 2.0 generates explorable 3D scenes by extending camera-controlled video models with self-augmented histories and dense correspondences, fighting the spatial forgetting and temporal drift that break long-horizon generators. Feed-forward reconstruction yields 3D-consistent trajectories suitable for free navigation.
Parcae: Scaling Laws For Stable Looped Language Models huggingface.co
Parcae is a looped transformer architecture that constrains spectral norms and injection parameters to prevent residual explosion and loss spikes that plague loop-based models. The authors derive scaling laws showing Parcae improves quality per FLOP and per parameter over standard transformers.
Towards Long-horizon Agentic Multimodal Search huggingface.co
LMM-Searcher targets long-horizon multimodal web search by storing images as lightweight UID identifiers and fetching pixels on demand via a fetch-image tool, slashing token costs. Built on Qwen3-VL-Thinking-30A3B, it improves cross-modal multi-hop reasoning on MM-BrowseComp and MMSearch-Plus.
Many-Tier Instruction Hierarchy in LLM Agents huggingface.co
Large language model agents require robust instruction conflict resolution mechanisms that can handle arbitrary privilege levels across diverse real-world scenarios, revealing current models’ limitations in managing complex hierarchical instructions.
Grid2Matrix: Revealing Digital Agnosia in Vision-Language Models huggingface.co
Vision-language models fail to accurately reproduce visual details in controlled grid-to-matrix tasks, revealing a disconnect between visual encoding and language output that persists despite model scaling and alignment improvements.
SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks huggingface.co
Sequence-Level PPO addresses instability in long-chain-of-thought reasoning by reformulating the process as a contextual bandit problem with decoupled value functions for improved efficiency.
Beyond Perception Errors: Semantic Fixation in Large Vision-Language Models huggingface.co
Vision-language models exhibit semantic fixation by preferring default interpretations over alternative valid rule mappings, which can be mitigated through prompt interventions and training strategies.
Spatial Competence Benchmark huggingface.co
Three frontier models show declining accuracy on a new spatial competence benchmark, with performance saturating quickly under token budget constraints.
You Only Judge Once: Multi-response Reward Modeling in a Single Forward Pass huggingface.co
A multimodal reward model evaluates multiple responses simultaneously through concatenated input and cross-entropy scoring, achieving faster training and superior performance in open-ended generation tasks compared to traditional single-response approaches.
Masked by Consensus: Disentangling Privileged Knowledge in LLM Correctness huggingface.co
Large language models do not demonstrate superior self-awareness of answer correctness compared to external models, though they show domain-specific advantages in factual knowledge tasks when models disagree on predictions.
VideoFlexTok: Flexible-Length Coarse-to-Fine Video Tokenization huggingface.co
VideoFlexTok enables efficient video representation through variable-length token sequences that capture abstract information first, followed by fine-grained details, allowing for reduced computational requirements in video generation tasks.
LASA: Language-Agnostic Semantic Alignment at the Semantic Bottleneck for LLM Safety huggingface.co
Language-Agnostic Semantic Alignment (LASA) addresses LLM safety gaps across languages by targeting semantic bottlenecks where representations are primarily driven by shared semantics rather than language identity, achieving significant improvements in safety performance across multiple models.
Spec Kit Agents: Context-Grounded Agentic Workflows huggingface.co
Spec Kit Agents enhances AI coding agents through multi-agent workflows with context-grounding and validation hooks, improving code quality and compatibility in software development.
When Reasoning Models Hurt Behavioral Simulation: A Solver-Sampler Mismatch in Multi-Agent LLM Negotiation huggingface.co
Reasoning-enhanced large language models can perform poorly as simulators of boundedly rational behavior, exhibiting over-optimization and reduced diversity compared to models using bounded reflection strategies.
KnowRL: Boosting LLM Reasoning via Reinforcement Learning with Minimal-Sufficient Knowledge Guidance huggingface.co
KnowRL is a knowledge-guided reinforcement learning framework that improves reasoning in language models by optimizing compact, interaction-aware guidance subsets through constrained subset search and addressing pruning interaction paradoxes.
CONSCIENTIA: Can LLM Agents Learn to Strategize? Emergent Deception and Trust in a Multi-Agent NYC Simulation huggingface.co
Large language model agents demonstrate limited strategic behaviors including selective trust and deception in a simulated urban environment, remaining vulnerable to adversarial persuasion despite improved resistance over iterations.
Generative Refinement Networks for Visual Synthesis huggingface.co
Generative Refinement Networks introduce a novel visual synthesis approach that combines hierarchical binary quantization with adaptive refinement mechanisms to improve computational efficiency and visual quality in image generation.
Self-Adversarial One Step Generation via Condition Shifting huggingface.co
APEx enables efficient one-step text-to-image synthesis by eliminating adversarial training through endogenous gradient estimation from flow models, achieving superior quality and speed compared to existing methods.
LARY: A Latent Action Representation Yielding Benchmark for Generalizable Vision-to-Action Alignment huggingface.co
General visual foundation models trained without action supervision outperform specialized embodied models and demonstrate superior alignment between visual and physical action spaces compared to pixel-based approaches.
Learning Versatile Humanoid Manipulation with Touch Dreaming huggingface.co
A multimodal Transformer architecture that integrates tactile sensing with visual and proprioceptive data enables high-dexterity humanoid manipulation through contact-aware learning and predictive modeling.
GlotOCR Bench: OCR Models Still Struggle Beyond a Handful of Unicode Scripts huggingface.co
Vision-language models show limited generalization in OCR across diverse scripts, with performance closely tied to pretraining coverage and struggling with unfamiliar writing systems.
Accelerating Speculative Decoding with Block Diffusion Draft Trees huggingface.co
DDTree enhances speculative decoding by constructing draft trees from block diffusion drafter distributions and efficiently verifying multiple trajectories in parallel.
Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective huggingface.co
Feed-forward 3D reconstruction methods map images to 3D representations in a single forward pass, enabling efficient and generalizable reconstruction across scenes through shared architectural patterns and model design strategies.
Rethinking the Diffusion Model from a Langevin Perspective huggingface.co
The article provides a unified Langevin perspective on diffusion models, clarifying their theoretical foundations and connections between different mathematical formulations.
Habitat-GS: A High-Fidelity Navigation Simulator with Dynamic Gaussian Splatting huggingface.co
Habitat-GS extends Habitat-Sim by integrating 3D Gaussian Splatting for photorealistic rendering and gaussian avatars for dynamic human modeling, enabling improved agent generalization and human-aware navigation in embodied AI.
Do Thought Streams Matter? Evaluating Reasoning in Gemini Vision-Language Models for Video Scene Understanding huggingface.co
Research examines how internal reasoning traces affect video scene understanding in vision-language models, revealing that quality improvements from extended reasoning plateau quickly and that different model variants produce distinct reasoning patterns.
PokeRL: Reinforcement Learning for Pokemon Red huggingface.co
PokeRL presents a modular reinforcement learning system with environment wrapping, anti-loop mechanisms, and hierarchical rewards to train agents for early-game Pokemon Red tasks.
BERT-as-a-Judge: A Robust Alternative to Lexical Methods for Efficient Reference-Based LLM Evaluation huggingface.co
Large language model evaluation faces challenges with rigid lexical methods that confuse problem-solving ability with formatting compliance, prompting the introduction of BERT-as-a-Judge for more robust, scalable assessment of generative outputs.
Turing Test on Screen: A Benchmark for Mobile GUI Agent Humanization huggingface.co
Researchers propose humanization capabilities for autonomous GUI agents to avoid detection by digital platforms, introducing a benchmark and methods to balance imitability with task performance.
Hierarchical SVG Tokenization: Learning Compact Visual Programs for Scalable Vector Graphics Modeling huggingface.co
HiVG introduces a hierarchical SVG tokenization framework that improves autoregressive vector graphics generation by addressing geometric structure representation and spatial consistency issues through atomic and segment tokens, along with a novel initialization strategy and curriculum training.
Domain-Specific Latent Representations Improve the Fidelity of Diffusion-Based Medical Image Super-Resolution huggingface.co
Replacing generic VAEs with domain-specific autoencoders in latent diffusion models significantly improves medical image super-resolution quality, with reconstruction fidelity being predictable from autoencoder performance alone.
Seeing Through Touch: Tactile-Driven Visual Localization of Material Regions huggingface.co
Deep learning model for tactile localization that uses dense cross-modal feature interactions to identify material properties in images, overcoming limitations of existing methods through enhanced datasets and material-diversity pairing strategies.
SpotSound: Enhancing Large Audio-Language Models with Fine-Grained Temporal Grounding huggingface.co
SpotSound is an audio language model that improves temporal grounding in long-form audio by suppressing false timestamps and addressing challenges posed by sparse events in noisy backgrounds.
3DTV: A Feedforward Interpolation Network for Real-Time View Synthesis huggingface.co
3DTV combines lightweight geometry with learning for real-time sparse-view interpolation, achieving efficient and robust multi-view rendering without scene-specific optimization.
References
Maxim AI blog on OS-Harm getmaxim.ai
Claude 3.7 Sonnet complied with harmful misuse requests 70% of the time, while o4-mini was manipulated by 20% of injection tasks
VentureBeat on Salesforce CoAct-1 venturebeat.com
CoAct-1 operates through a hierarchical team consisting of an Orchestrator, a Programmer, and a GUI Operator… solves tasks in an average of 10.15 steps, significantly fewer than the 15+ steps required by GUI-only agents
Moonlight review of MirrorGuard themoonlight.io
On the ByteDance UI-TARS system, MirrorGuard reduced the rate of unsafe actions from 66.5% to 13.0%… while older defenses like GuardAgent and Think Twice suffered from high False Refusal Rates up to 22.2% and 62.39%, respectively, MirrorGuard maintained a marginal FRR of approximately 5.13%
OS-Harm benchmark docs (EPFL TML) mintlify.com
OS-Harm adopts a broader three-pronged approach: Deliberate Misuse, Environmental Injections, and Model Misbehavior — spontaneous unsafe actions triggered by benign but ambiguous tasks
R-Judge (EMNLP Findings 2024) aclanthology.org
GPT-4o achieved an F1 score of approximately 74.45%, becoming the only model to significantly exceed random performance — yet a recall gap persists; the model often misses subtle, compounded risks that emerge over a long trajectory
USC Viterbi (LIME Lab) ICLR 2026 announcement viterbischool.usc.edu
OS-Blind targets ‘unintended attack conditions’… the lab’s work emphasizes that verifiable safety in agents requires defenses that monitor the entire execution trajectory rather than just the initial user prompt
Forbes — Hamilton Mann critique forbes.com
the central risk is that the measure reflects Anthropic’s user base more than economy-wide AI adoption
MIT FutureTech (Thompson & Mertens) futuretech.mit.edu
AI’s impact as a ‘rising tide’ — a smooth, predictable increase in capability — rather than ‘crashing waves’ that blindside the workforce
Reddit r/singularity summarizing Acemoglu reddit.com
Acemoglu dismissed Amodei’s dire predictions as ‘motivated reasoning’… AI would effectively automate only ~5% of all work tasks over the next decade
Massenkoff & McCrory working paper (PolicyCommons) policycommons.net
the hiring rate for workers aged 22–25 in exposed occupations has slowed by approximately 14%
Northeastern researchers via Hugging Face dataset analysis huggingface.co
researchers at Northeastern University successfully de-anonymized 25% of the ‘scientist’ subset within a single day
Collin Wilkins — Claude Code Productivity Paradox collinwilkins.com
developers report feeling 20% to 50% faster… [but] a 91% increase in code review time as humans struggle to verify large volumes of AI-generated code
r/LocalLLaMA thread reddit.com
Nemotron 3 Super scored roughly 55% on private reasoning benchmarks when run via vLLM using the native NVFP4 precision, yet dropped to 40% when run through llama.cpp using standard GGUF quants
r/LocalLLaMA ‘no free lunch’ thread reddit.com
Nemotron 3 Super appears to classify a broad range of creative contexts as ‘infringement’ or ‘misuse’ … refusing to engage with popular internet culture like ‘Pepe the Frog’
shujisado.org corporate license analysis shujisado.org
Users are strictly prohibited from using Nemotron models or their outputs to develop or improve any competing AI models without express written consent … Article 8 requires licensees to indemnify NVIDIA against all third-party claims
bdtechtalks on NVFP4 bdtechtalks.com
training diverges if every linear layer is quantized to NVFP4 … Stable convergence requires a mixed-precision strategy where approximately 15% of the network—typically the final layers and the embedding/output heads—remains in higher precision (BF16 or FP8)
Maxime Labonne Substack maximelabonne.substack.com
Nemotron 3 Super delivers approximately 2.2x higher throughput than dense models like GPT-OSS-120B … it slightly trails Qwen 3.5 in raw reasoning accuracy, though it remains significantly faster in production
EmergentMind LatentMoE topic page emergentmind.com
low-rank latent approximations may fail to capture fine-grained specialization if the latent dimension ℓ is too small … the added complexity of the ‘down-latent-up’ cycle adds roughly 9% more compute