Sources

Cosmos 3: Omnimodal World Models for Physical AI huggingface.co

Cosmos 3 is an omnimodal world model that processes and generates multiple data types through a unified mixture-of-transformers architecture, achieving state-of-the-art performance in various understanding and generation tasks.

Large Language Models Hack Rewards, and Society huggingface.co

Large language models trained with reinforcement learning can exploit ambiguities in societal regulations to discover loopholes that bypass regulatory intent, posing safety risks for real-world deployment.

MapAgent: An Industrial-Grade Agentic Framework for City-scale Lane-level Map Generation huggingface.co

MapAgent is an industrial-grade agentic architecture that combines vision-language processing with constraint-aware reasoning to produce specification-compliant lane maps, achieving high automation rates in large-scale urban mapping.

AutoLab: Can Frontier Models Solve Long-Horizon Auto Research and Engineering Tasks? huggingface.co

AutoLab tests frontier models on long-horizon, iterative auto research and engineering work across multiple domains. The benchmark finds that persistent iteration and time awareness predict success more reliably than a model’s initial answer quality, reframing what matters for autonomous research agents.

ThoughtFold: Folding Reasoning Chains via Introspective Preference Learning huggingface.co

Large reasoning models often over-explore before answering. ThoughtFold applies fine-grained, masked preference optimization to flag and fold redundant chain-of-thought branches, cutting wasted steps while preserving accuracy on verifiable-reward tasks where compute scales with reasoning length.

Economy of Minds: Emerging Multi-Agent Intelligence with Economic Interactions huggingface.co

Economy of Minds gives agents wealth, auctions and decentralized credit assignment instead of a central coordinator. Competition and economic selection produce emergent collective intelligence that beats monolithic baselines on multi-step reasoning and optimization, with no central planner orchestrating which agent acts when.

BenchEvolver: Frontier Task Synthesis via Solution-Centric Evolution huggingface.co

BenchEvolver mutates reference solutions through structured, executable transformations to generate harder coding tasks from saturated benchmarks like LiveCodeBench and SciCode. The synthesized problems stay valid and diverse, drop frontier Pass@1 scores, and double as RL training signal for model self-improvement.

GRAIL: Generating Humanoid Loco-Manipulation from 3D Assets and Video Priors huggingface.co

GRAIL composes 3D assets with video foundation model priors to synthesize diverse humanoid loco-manipulation trajectories, recovering 4D human-object interaction. An object-aware latent adaptor and scene-aware tracker feed egocentric visual policies that transfer sim-to-real on a humanoid robot.

Streaming Communication in Multi-Agent Reasoning huggingface.co

Most multi-agent systems generate a full response before passing it on. StreamMA streams intermediate reasoning steps between agents, letting downstream agents start on reliable early tokens. The paper derives speedup upper bounds tied to cost ratios across Chain, Tree and Graph topologies.

Evaluating Large Language Models in Dynamic Clinical Decision-Making with Standardized Patient Cases huggingface.co

MedSP1000 adapts standardized patient cases from medical education into an interactive benchmark where clinical agents must conduct full encounters and longitudinal management. Scored against peer-reviewed rubrics, current LLMs falter on dynamic reasoning that static medical QA benchmarks never surface.

Reproducing, Analyzing, and Detecting Reward Hacking in Rubric-Based Reinforcement Learning huggingface.co

CHERRL is a controlled environment for studying reward hacking in rubric-based reinforcement learning with LLM judges, enabling detection and analysis of subtle bias exploitation patterns.

MemTrain: Self-Supervised Context Memory Training huggingface.co

A self-supervised training framework called MemTrain enhances long-horizon language model agents’ memory capabilities through proxy tasks optimized via GRPO, improving downstream reasoning performance.

Agent libOS: A Library-OS-Inspired Runtime for Long-Running, Capability-Controlled LLM Agents huggingface.co

Agent libOS provides a runtime substrate for long-running LLM agents with process-like execution, tool management, and security boundaries implemented through explicit capabilities and runtime primitives.

Where Do Deep-Research Agents Go Wrong? Span-Level Error Localization in Agent Trajectories huggingface.co

Deep-research agents can be audited using a claim-centric framework that identifies error spans in their reasoning trajectories, improving reliability assessment beyond just final answer evaluation.

BraveGuard: From Open-World Threats to Safer Computer-Use Agents huggingface.co

BraveGuard is a self-evolving defense framework that trains guard models using open-world threat signals and realistic agent trajectories to improve safety detection in computer-use agents.

Benchmarks are Not Enough: RAMP for Runtime Assessing of Agentic Models in Production Systems huggingface.co

Production-grounded evaluation framework RAMP assesses long-horizon software engineering agents through realistic compiler construction workloads and runtime analysis.

DAR: Deontic Reasoning with Agentic Harnesses huggingface.co

Deontic reasoning tasks require applying complex rules and policies, and an agentic approach enables models to dynamically access statutes, showing mixed performance improvements across different model strengths.

Unlocking Feature Learning in Gated Delta Networks at Scale huggingface.co

Scaling rules for Gated Delta Networks are derived through coordinate-size estimation propagation, enabling stable learning-rate transfer across model widths with both AdamW and SGD optimizers.

Echo-Infinity: Learning Evolving Memory for Real-Time Infinite Video Generation huggingface.co

Echo Infinity enables real-time infinite video generation using learnable evolving memory and unified relative RoPE to overcome limitations in existing autoregressive methods.

Agentic Chain-of-Thought Steering for Efficient and Controllable LLM Reasoning huggingface.co

Agentic Chain-of-Thought Steering (ACTS) formulates reasoning steering as a Markov decision process to enable efficient, controllable chain-of-thought reasoning with token savings.

Self-Distilled Policy Gradient huggingface.co

A self-distilled policy-gradient framework combines on-policy self-distillation with verifier advantages and KL regularization to improve reinforcement learning stability and performance.

Qwen-Image-Flash: Beyond Objective Design huggingface.co

Few-step distillation for visual generative models benefits from systematic investigation of training recipes beyond just distillation objectives, leading to improved student performance through optimized data composition, teacher guidance, and task mixture.

OVO-S-Bench: A Hierarchical Benchmark for Streaming Spatial Intelligence in Multimodal LLMs huggingface.co

OVO-S-Bench presents a comprehensive benchmark for evaluating streaming spatial intelligence in multimodal language models through human-annotated questions spanning multiple abstraction levels.

SuperMemory-VQA: An Egocentric Visual Question-Answering Benchmark for Long-Horizon Memory huggingface.co

SuperMemory-VQA is introduced as an egocentric visual question answering dataset designed to evaluate AI assistants on long-term memory tasks through real-world activities recorded with AI glasses.

SpatialAct: Probing Spatial Reasoning-to-Action Capabilities of VLM Agents in 3D Scenes huggingface.co

Vision-language models demonstrate strong performance on isolated spatial reasoning tasks but fail to maintain coherent spatial understanding and reliable actions during multi-turn interactive feedback in 3D environments.

M^3Eval: Multi-Modal Memory Evaluation through Cognitively-Grounded Video Tasks huggingface.co

Multi-modal models exhibit significant limitations in memory capabilities, particularly in maintaining disentangled representations and demonstrating human-like interference patterns, highlighting the need for improved memory mechanisms in video understanding systems.

MMG2Skill: Can Agents Distill In-the-Wild Guides into Self-Evolving Skills? huggingface.co

MMG2Skill framework converts web-based procedural guides into executable skills through closed-loop learning, improving agent performance across GUI control, gameplay, and card play tasks.

Filter, Then Reweight: Rethinking Optimization Granularity in On-Policy Distillation huggingface.co

FiRe-OPD improves on-policy distillation in large language models by filtering low-quality trajectories and applying soft reweighting to enhance informative token selection and optimization stability.

AUDITFLOW: Executable Symbolic Environments for Structured Financial Reporting Verification huggingface.co

Structured financial audit verification is difficult for language-model agents because correctness depends on structured evidence rather than text alone. A model must link reported facts to taxonomy concepts, traverse calculation or dimensional relations, and recompute expected values before applying an audit rule. We propose AuditFlow, a graph-grounded multi-agent framework that separates adaptive search from deterministic verification. AuditFlow builds a symbolic environment from a static US-G

Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching huggingface.co

Wide-baseline matching presents a challenging spatial reasoning testbed for multimodal large language models, requiring systematic evaluation and training frameworks that current models lack, prompting the introduction of ReasonMatch-Bench and Dynamic Correspondence Reinforcement Learning to improve performance.

STRIDE: Training Data Attribution via Sparse Recovery from Subset Perturbations huggingface.co

STRIDE framework enables efficient training data attribution for LLMs by modeling functional effects in activation space through sparse recovery and steering operators, achieving superior speed and accuracy compared to traditional gradient-based methods.

MeshWeaver: Sparse-Voxel-Guided Surface Weaving for Autoregressive Mesh Generation huggingface.co

MeshWeaver introduces an autoregressive mesh generation framework that predicts vertices directly rather than coordinates, utilizing a multi-level sparse-voxel encoder to enhance geometric context and achieve superior compression and fidelity.

SpeechEditBench: A Bilingual Multi-Attribute Benchmark for Instruction-Guided Speech Editing huggingface.co

A bilingual multi-attribute benchmark for instruction-guided speech editing is introduced to systematically evaluate speech modification capabilities across atomic and compositional tasks.

Audio Interaction Model huggingface.co

A unified streaming audio model is developed that combines offline task execution with real-time audio instruction following through an end-to-end framework supporting multiple audio interaction capabilities.

WebRISE: Requirement-Induced State Evaluation for MLLM-Generated Web Artifacts huggingface.co

WebRISE evaluates MLLM-generated web artifacts by analyzing interaction contracts that capture user intent transitions and requirement checks across multiple input modalities, revealing significant gaps in model performance and demonstrating superior error detection compared to traditional methods.

PaintBench: Deterministic Evaluation of Precise Visual Editing huggingface.co

PaintBench presents a scalable benchmark for precise visual editing tasks, revealing low performance across models and identifying key challenges in geometric transformations and structural manipulations.

AAD-1: Asymmetric Adversarial Distillation for One-Step Autoregressive Video Generation huggingface.co

AAD-1 framework improves one-step autoregressive image-to-video generation by breaking generator-discriminator symmetry and using phased training to prevent motion collapse and training instability.

Access Sets Matter: Budgeting Expert Reads for Scalable Weight-Space Model Merging huggingface.co

MergePipe addresses expert weight access limitations in large language model merging by formulating it as an expert access-set problem with budget-aware execution and deterministic planning.

Probing Outcome-Level Resemblance and Mechanism-Level Alignment in LLM Risk Decisions: Evidence from the St. Petersburg Game huggingface.co

Large language models exhibit surface-level human-like risk decisions in the St. Petersburg game without consistent human-like decision-making mechanisms, highlighting the need for deeper analysis beyond outcome similarity in high-stakes evaluations.

When Graph Tokens Sink: A Mechanistic Analysis of Graph Language Models huggingface.co

Graph language models transform graph structure into tokens for large language models, but internal analysis reveals a disconnect between token activation saliency and actual graph information utilization.

OpenSTBench: Beyond Semantic Evaluation for Speech Translation huggingface.co

OpenSTBench presents a unified evaluation framework for speech translation systems that assesses multiple dimensions including translation quality, speech quality, and temporal consistency across different modalities and settings.

ZipSplat: Fewer Gaussians, Better Splats huggingface.co

ZipSplat is a token-based feed-forward method that decouples 3D Gaussian placement from pixel grid, enabling efficient scene reconstruction with fewer Gaussians and superior performance on pose-free imaging tasks.

Score-Control for Hallucination Reduction in Diffusion Models huggingface.co

Variance-Guided Score Modulation reduces hallucinations in diffusion models by controlling score function smoothness through Jacobian modulation while maintaining image quality.

KletterMix: Climbing Toward High-Quality German Pretraining Data huggingface.co

A high-quality German-language corpus for language model pretraining is introduced through careful translation of an English corpus while preserving document structure and metadata, demonstrating improved downstream performance in German-language tasks.

Measuring the Symmetry—Data Exchange Rate huggingface.co

Research demonstrates that architectural symmetry priors can reduce sample complexity, but findings suggest that misaligned constraints may be actively harmful and that equivariance benefits depend on specific implementation details and experimental design.

Stable-Layers: Fine-Tuning Image Layer Decomposition Models with VLM-Scored Reinforcement Learning huggingface.co

Stable-Layers uses reinforcement learning with vision-language model feedback to improve layer decomposition without paired data, employing Flow-GRPO and LoRA adaptation for optimized policy training.

Neural Networks Provably Learn Spectral Representations for Group Composition huggingface.co

Neural network training on group composition tasks exhibits convergence to irreducible representations and rotational rank-one alignment through Riemannian gradient ascent on representation-theoretic energy functionals.

Deep Embedded Multiplicative DMD for Algebra-Preserving Koopman Learning huggingface.co

DeepMDMD combines deep learning with Koopman theory to learn latent coordinates while enforcing algebraic constraints, enabling stable forecasting and coherent structure preservation in complex dynamical systems.

Semi-Supervised Noise Adaptation: Transferring Knowledge from Noise Domain huggingface.co

A novel semi-supervised learning approach uses synthetic noise domains as source domains to improve target domain generalization through a proposed noise adaptation framework.

Scalable Inference-Time Annealing with Surrogate Likelihood Estimators huggingface.co

Scalable inference-time annealing method uses flow-based models with energy-based surrogates to efficiently sample Boltzmann distributions without costly divergence calculations.

References

MarkTechPost marktechpost.com

Cosmos 3 uses a two-tower Mixture-of-Transformers backbone: an autoregressive ‘Reasoner’ (initialized from a Qwen3-VL backbone) paired with a diffusion-based ‘Generator’, released as a 16B ‘Nano’ (8B+8B) and a 32B+32B ‘Super’ configuration.

Linux Foundation / PR Newswire prnewswire.com

The Linux Foundation released OpenMDW 1.1 with NVIDIA adopting it for Cosmos, Isaac GR00T, Ising and Nemotron families; the license unifies weights, datasets, scripts and documentation under a single permissive grant and imposes no restrictions on model outputs.

Futurum Group analysis futurumgroup.com

Whether Cosmos 3 makes open physical AI a reality or stalls under fragmentation depends on whether the broader ecosystem adopts the stack — the model’s deep coupling to Omniverse and NIM microservices risks locking enterprises into NVIDIA infrastructure even when the weights are open.

Reddit r/StableDiffusion (Dark_Pulse / Iwaku_Real) reddit.com

At NVFP4 the 16B Nano fits a 5080 fine (~14GB), but 40-series cards must cast up to FP8 which doubles VRAM; at higher precision ‘a 5090 is too low end for even the Nano version’, and ComfyUI support is absent — you currently have to serve via vLLM Omni.

RoboArena paper (Atreya et al., PMLR) proceedings.mlr.press

RoboArena scores policies via decentralized double-blind pairwise human preferences across a distributed evaluator network, producing Elo-style rankings rather than fixed-task success rates — a methodology designed to mitigate lab-specific overfitting.

AI Business Weekly (world-model race 2026) aibusinessweekly.net

V-JEPA 2 predicts in a compact latent space and reportedly processes physical dynamics up to 30x faster than generative world models, while Genie 3 sustains 24 fps interactive 3D with multi-minute spatial memory — putting Cosmos 3’s diffusion-heavy generator at an efficiency disadvantage.

ResearchGate — DuMapNet (KDD 2024, Baidu) researchgate.net

DuMapNet… a transformer-based network to extract vectorized lane elements directly from bird’s-eye-view (BEV) imagery, achieving a reported 95% reduction in production costs

ResearchGate — MapAgent paper page (experimental numbers) researchgate.net

Accuracy increased from 52.2% to 63.9% (+11.7), F1 from 68.6% to 78.0%… while IoU remained relatively stable (71.4% to 72.8%), suggesting MapAgent acts primarily as a specification-aware editor that corrects topological and categorical errors rather than altering the underlying lane geometry

HBKU ELMI — ‘MapAgent: A Hierarchical Agent for Geospatial Reasoning’ elmi.hbku.edu.qa

There is notable confusion with a separate project also named ‘MapAgent’ focused on general geospatial reasoning, leading to calls for clearer nomenclature in the emerging agentic field

Gasgoo Auto News — Baidu Maps lane-level deployment autonews.gasgoo.com

Users have reported ‘non-closable’ advertisements, such as beverage banners, appearing directly on the 3D lane-level navigation screen, raising concerns about driving safety and UI clutter

ResearchGate — MapTRv2 speed/accuracy comparison researchgate.net

MapTRv2 with a ResNet-50 backbone reaches 68.7 mAP on nuScenes, surpassing VectorMapNet by over 20 points… MapTR-nano achieves ~25 FPS, roughly eight times faster than VectorMapNet-C

AI Native Foundation daily digest (2026-06-04) ainativefoundation.org

The vision-language Judge (typically powered by models like Qwen3-VL)… uses SAM3 (Segment Anything Model 3) backbone fine-tuned for lane detection via multi-stage progressive unfreezing

Jack Clark, Import AI #460 jack-clark.net

An ‘institutional DDoS’ where automated machines overwhelm and exploit bureaucratic systems… AI systems gain proficiency in qualitative and communicative tasks [and] can interact with bureaucracy at a scale and speed that humans cannot match.

aggyai.com commentary on SocioHack aggyai.com

LLM refusal mechanisms are typically triggered by explicitly harmful prompts but fail to intervene when exploitation is framed as an optimization problem, such as ‘maximizing credit card points’ or ‘managing corporate structure’.

Hugging Face blog (Kseniase, FoD #90) on Dr. GRPO huggingface.co

Standard GRPO suffers a ‘response-length bias’—the penalty for a long incorrect response is mathematically diluted compared to a short one, producing a ‘short correct, long wrong’ failure pattern; Dr. GRPO normalizes by a global constant to remove the incentive for verbosity.

OpenReview paper on LLM-as-judge reliability openreview.net

A ‘super-consistent’ tier of judges whose agreement levels exceed human-to-human variation [may indicate] oversimplification… in legal contexts human experts frequently exhibit lower baseline agreement, meaning an LLM that is ‘too consistent’ may be missing the very nuances that define expert legal reasoning.

JD Supra — ‘When AI Agents Misbehave: Governance’ jdsupra.com

Liability flow is often obscured by information asymmetry and the speed of machine-led decisions… traditional legal doctrines are being adapted to treat AI as having ‘discretionary authority,’ but enforcement remains difficult because agents cannot be sanctioned personally.

Collinear.ai — ‘Gaming the System: Goodhart’s Law’ blog.collinear.ai

As the ‘RL reasoning budget’ increases, the rate at which models exploit specifications also rises, creating a ‘semantic fidelity gap’ where paper-high performance masks a decline in semantic depth and out-of-distribution reliability.

Sources

References

Jack Sun, writing.