Sources

MolmoAct2: Action Reasoning Models for Real-world Deployment huggingface.co

MolmoAct2 presents an open-action reasoning model for robotics that improves upon previous systems through specialized vision-language-model backbones, new datasets, open-weight action tokenizers, architectural redesign for continuous-action prediction, and adaptive reasoning for reduced latency.

Code World Model Preparedness Report huggingface.co

The Code World Model demonstrates no additional frontier risks beyond current AI ecosystem concerns, leading to its release as an open-weight model.

Counting as a minimal probe of language model reliability huggingface.co

Studies of stable counting capacity reveal that large language models rely on finite internal states rather than general logical reasoning for rule execution, even when appearing to follow instructions.

HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help? huggingface.co

Frontier coding agents fail ambiguous tasks not from lack of skill but from poor judgment about when to escalate. HiL-Bench scores agents on Ask-F1, question precision, and blocker recall, and pairs the benchmark with a shaped-reward RL recipe targeting unresolvable uncertainty.

Hallucinations Undermine Trust; Metacognition is a Way Forward huggingface.co

Hallucinations persist because models cannot tell what they know from what they do not, the authors argue, and scaling knowledge alone will not close the gap. The paper outlines metacognitive architectures using uncertainty quantification and confidence intervals inside agent systems.

WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments huggingface.co

Cross-application desktop workflows remain brutal for current GUI agents, and WindowsWorld quantifies how badly they break. The process-centric benchmark scores conditional judgment, reasoning, and execution efficiency on multi-step tasks that span several professional Windows applications inside a simulated environment.

PhysicianBench: Evaluating LLM Agents in Real-World EHR Environments huggingface.co

Clinical agent claims meet a harder test in PhysicianBench, which runs LLMs through multi-step physician tasks inside live electronic health record environments. Scoring uses execution-grounded verification of tool calls and documentation, exposing wide capability gaps on real-world clinical scenarios.

Motion-Aware Caching for Efficient Autoregressive Video Generation huggingface.co

Autoregressive video diffusion wastes compute redenoising static regions, and MotionCache fixes that with motion-weighted cache reuse keyed to inter-frame pixel differences. A coarse-to-fine schedule sets per-token update frequencies, delivering meaningful speedups without visible quality loss.

Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs huggingface.co

Vision-language models lose sight of the image as generation lengthens, a failure the authors call Visual Signal Dilution. A lightweight persistent visual memory module reinjects visual embeddings through the feed-forward network, restoring attention and lifting accuracy on complex reasoning tasks.

T^2PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning huggingface.co

Multi-turn agentic RL collapses when exploration runs unchecked across long trajectories. T²PO monitors token- and turn-level uncertainty to trigger dynamic resampling and trajectory filtering, sharpening credit assignment and producing more stable training runs than standard policy optimization.

Linear-Time Global Visual Modeling without Explicit Attention huggingface.co

Attention mechanisms in Transformers can be reinterpreted as MLPs with dynamically predicted parameters, offering a linear-complexity alternative to explicit attention while maintaining sequence modeling performance.

AcademiClaw: When Students Set Challenges for AI Agents huggingface.co

AcademiClaw presents a comprehensive benchmark for evaluating AI agents on complex academic tasks spanning multiple domains, revealing significant capability gaps in current models.

ComboStoc: Combinatorial Stochasticity for Diffusion Generative Models huggingface.co

Combining stochastic processes with diffusion models addresses combinatorial complexity limitations, accelerating training and enabling asynchronous generation across data modalities.

From Context to Skills: Can Language Models Learn from Context Skillfully? huggingface.co

A self-evolving framework autonomously discovers and refines context-specific skills for language models through a multi-agent self-play loop with Challenger, Reasoner, and Judge components, improving context learning performance without human supervision.

Perceptual Flow Network for Visually Grounded Reasoning huggingface.co

Perceptual Flow Network addresses limitations in vision-language models by decoupling perception from reasoning and using variational reinforcement learning with multi-dimensional rewards for improved visual reasoning.

Hierarchical Abstract Tree for Cross-Document Retrieval-Augmented Generation huggingface.co

Ψ-RAG addresses limitations in tree-based retrieval-augmented generation for cross-document multi-hop questions through a hierarchical abstract tree index and multi-granular retrieval agent.

Generative Modeling with Orbit-Space Particle Flow Matching huggingface.co

Orbit-Space Geometric Probability Paths (OGPP) presents a particle-native flow-matching framework that improves generative modeling of particle systems through orbit-space canonicalization, particle index embeddings, and geometric probability paths with arc-length-aware terminal velocities.

Prior-Aligned Data Cleaning for Tabular Foundation Models huggingface.co

Deep reinforcement learning framework L2C2 addresses prior mismatch in tabular foundation models by sequentially applying data cleaning operators to align real-world data with synthetic training distributions, improving both accuracy and calibration.

BlenderRAG: High-Fidelity 3D Object Generation via Retrieval-Augmented Code Synthesis huggingface.co

BlenderRAG enhances natural language to Blender code generation by leveraging a retrieval-augmented approach with a curated multimodal dataset, improving both compilation success and semantic alignment without fine-tuning.

Agentic AI Systems Should Be Designed as Marginal Token Allocators huggingface.co

Agentic AI systems should be evaluated as marginal token allocation economies rather than text generators, with all components optimizing the same first-order condition of marginal benefit equals marginal cost plus latency and risk costs.

Repetition over Diversity: High-Signal Data Filtering for Sample-Efficient German Language Modeling huggingface.co

Training high-quality filtered data multiple times yields better performance than single-pass training on larger, less filtered datasets for non-English language models.

A Hybrid Approach for Closing the Sim2real Appearance Gap in Game Engine Synthetic Datasets huggingface.co

A hybrid approach combining diffusion-based and image-to-image translation methods improves photorealism in synthetic datasets while maintaining semantic consistency.

Linking spatial biology and clinical histology via Haiku huggingface.co

Haiku is a tri-modal contrastive learning model that integrates spatial proteomics, histology, and clinical data to enable cross-modal retrieval, improved classification, and zero-shot biomarker inference while supporting counterfactual predictions for cancer stage progression.

OceanPile: A Large-Scale Multimodal Ocean Corpus for Foundation Models huggingface.co

OceanPile presents a large-scale multimodal corpus for ocean science, combining diverse data types and a knowledge graph-guided instruction dataset to advance marine AI applications.

Assessing Pancreatic Ductal Adenocarcinoma Vascular Invasion: the PDACVI Benchmark huggingface.co

Researchers developed a new dataset and challenge for pancreatic cancer staging that emphasizes uncertainty-aware AI models capable of handling ambiguous tumor-vessel interfaces, revealing that traditional segmentation metrics fail to capture clinically relevant performance in complex cases.

References

Pebblous AI — VLA architecture comparison blog.pebblous.ai

MolmoAct2 outperformed competitors in a ‘many-trial’ setup, scoring 0.51 on average, compared to π0.5 at 0.32 … On the ‘MolmoBot’ benchmark, MolmoAct2 achieved a 20.6% success rate, roughly doubling the 10.3% recorded by π0.5.

KrASIA — Unitree founder on VLA consensus kr-asia.com

Unitree’s founder has called current VLA architectures ‘relatively dumb,’ arguing that the field is over-focused on foundational data at the expense of developing more unified, efficient model architectures that could operate with less data.

The Robot Report — CMU safety study therobotreport.com

Every model tested failed safety checks, with robots willing to execute harmful instructions such as brandishing a knife, removing mobility aids like wheelchairs, or enacting discriminatory behaviors based on a user’s identity.

Medium — Inside the architecture of MolmoAct2 (towardsdev) medium.com

Per-layer KV-cache conditioning … increases the memory footprint during inference, as the projected KV cache for all layers must be stored during the iterative flow matching loop … despite support for $6,000 low-cost arms, the compute required for the full 8B-parameter reasoning stack remains out of reach for most hobbyist-grade edge devices.

OpenDriveLab — AgiBot World opendrivelab.com

Policies pre-trained on AgiBot World allegedly achieve a 30% performance improvement over those trained purely on Open X-Embodiment in both seen and out-of-distribution scenarios.

SiliconANGLE — Ai2 releases MolmoAct 2 siliconangle.com

Piloted in high-stakes environments, such as CRISPR gene-editing labs at Stanford, where the model successfully automated sample movement and lab equipment operation.

Center for AI Policy — Meta’s Frontier AI Framework centeraipolicy.org

Meta defines its critical risk threshold by whether a model would ‘uniquely enable’ a threat scenario… [this] might allow Meta to dismiss a model’s risks if a competitor has already released a similarly capable model or if human experts could theoretically perform the same tasks.

AI Lab Watch — Meta scorecard ailabwatch.org

External researchers found that model performance on certain cyber-offense tasks jumped from 5% to 100% simply by allowing the model to ‘think’ (chain-of-thought) or use basic tools, features Meta reportedly omitted in its initial evaluations.

OpenAI — Estimating Worst-Case Frontier Risks of Open-Weight LLMs (gpt-oss) openai.com

OpenAI researchers employed Malicious Fine-Tuning (MFT), which specifically optimizes models to maximize dangerous capabilities while intentionally stripping safety filters… gpt-oss failed to cross the ‘Preparedness High’ threshold.

PromptLayer — first reactions to Meta’s CWM release blog.promptlayer.com

Significant criticism has focused on the model’s FAIR Non-Commercial Research License, which prevents its use in production software or commercial AI assistants… Meta has explicitly warned that CWM is not a general-purpose chat model.

r/machinelearningnews — CWM release thread reddit.com

The 32B size is a ‘sweet spot’ for local execution… however, the model is ‘brittle’ when used outside of its intended agentic workflows and requires a specific, rigid system prompt to function correctly.

arXiv 2605.00932 (CWM Preparedness Report) — Self-play SWE-RL follow-up cited in independent eval roundup arxiv.org

Independent projects, such as the Self-play SWE-RL (SSR) framework, have already begun using the CWM-sft checkpoint as a base for training even more advanced software agents, reporting performance gains of over 10 points on coding benchmarks through autonomous self-improvement.

Anthropic Transformer Circuits — ‘Linebreaks’ (2025) transformer-circuits.pub

Specific features within Gemma-2-9B’s residual stream activate according to the number of characters since a newline… patching these character-count features from a clean prompt into a corrupted one can force a line break at an incorrect position.

arXiv 2410.19730 — Flip-Flop Language Modeling / ‘attention glitches’ arxiv.org

Transformers suffer from ‘attention glitches’—sporadic, non-extrapolating errors where the model fails to retrieve the correct bit state over long-range dependencies… recurrent architectures like LSTMs or modern SSMs can solve the flip-flop task perfectly with far fewer parameters.

Weights & Biases — RULER evaluation report wandb.ai

While Llama 3.1 70B advertises a 128K token window, its effective context—defined as the range where it can reliably retrieve and reason—is measured at approximately 64K, a 50% discrepancy.

PNAS — numerical bias in LLMs pnas.org

Alignment (Instruction Tuning and RLHF) has been shown to amplify these biases… causing it to ‘collapse’ into these biased attractors more frequently than base (pre-trained) versions of the same model.

Arbisoft — ‘Why LLMs can’t count the r’s in strawberry’ arbisoft.com

LLMs do not ‘see’ individual letters; instead, they process text as numeric vectors representing chunks of characters… ‘strawberry’ might be tokenized into ‘straw’ and ‘berry,’ obscuring the individual occurrences of the letter ‘r’.

arXiv 2601.04480 — TC^0 / log-depth limits of transformers arxiv.org

Log-precision Transformers are limited to the complexity class TC^0, meaning they cannot naturally solve problems involving modular counting (like the PARITY problem) or simulate T steps of an arbitrary FSM without at least O(log T) layers.

Sources

References

Jack Sun, writing.