Sources

The Shape of Addition: Geometric Structures of Arithmetic in Large Language Models huggingface.co

Large language models show arithmetic fragility due to geometric structures in residual streams, where neural noise causes quantization failures that can be detected and corrected through geometric analysis.

LLMs Can Leak Training Data But Do They Want To? A Propensity-Aware Evaluation of Memorization in LLMs huggingface.co

PropMe framework evaluates language model memorization by distinguishing between forced reproduction capabilities and natural propensity, using SimpleTrace for deterministic attribution and propensity-transformed metrics across open models and datasets.

Dream.exe: Can Video Generation Models Dream Executable Robot Manipulation? huggingface.co

Video generation models were evaluated through robotic manipulation tasks to assess their ability to reflect physical reality, revealing that visual quality does not predict executable motion accuracy.

Regret Minimization with Adaptive Opponents in Repeated Games huggingface.co

A new game-theoretic framework replaces external regret with repeated policy regret to handle opponents that adapt across rounds. The authors prove sublinear regret using non-convex optimization with a convex surrogate, yielding stronger equilibrium guarantees that align with subgame perfect equilibria in repeated games.

SABER: Benchmarking Operational Safety of LLM Coding Agents in Stateful Project Workspaces huggingface.co

SABER evaluates coding agents inside stateful project workspaces rather than isolated prompts, revealing high harmful safety-violation rates even from aligned models. The benchmark argues that refusal tests miss environment-aware risks like overwriting files or corrupting state during multi-step agent execution.

Reinforcement Learning Elicits Contextual Learning of Unseen Language Translation huggingface.co

Reinforcement learning trains models to treat translation as a contextual meta-skill, using grammar and dictionary snippets at inference rather than memorized pairs. The approach beats supervised fine-tuning on chrF for unseen languages and transfers zero-shot to new linguistic contexts.

Rethinking Continual Experience Internalization for Self-Evolving LLM Agents huggingface.co

Researchers dissect continual learning for self-evolving agents along three axes: experience granularity, injection pattern, and internalization regime. Principle-level experiences with on-policy context-distillation avoid the capability collapse seen in instance-level, step-wise injection during multi-iteration training.

Latent Reasoning with Normalizing Flows huggingface.co

A TARFlow-based framework runs intermediate reasoning in continuous latent space while keeping autoregressive KV-cache decoding intact. The method supports probabilistic sampling, likelihood estimation, and policy-gradient optimization, posting gains over discrete chain-of-thought on code-generation benchmarks.

Meta-Cognitive Memory Policy Optimization for Long-Horizon LLM Agents huggingface.co

Metacognitive Memory Policy Optimization rewards agents for reducing belief entropy about the latent task state, not just final answers. Targeting epistemic uncertainty in recursive summaries cuts the information loss that degrades memory-augmented LLMs across long-horizon trajectories.

Combinatorial Synthesis: Scaling Code RLVR via Atomic Decomposition and Recombination huggingface.co

Atomic Decomposition and Recombination breaks seed problems into reusable units, then recombines them into novel verifiable tasks spanning algorithms, tool use, and data science. The pipeline scales RLVR training data beyond heuristic seed expansion while preserving automatic reward checking.

OPRD: On-Policy Representation Distillation huggingface.co

On-Policy Representation Distillation (OPRD) improves upon traditional on-policy distillation by aligning student and teacher representations in hidden-state space rather than just output space, resulting in reduced variance and improved training efficiency.

World-Language-Action Model for Unified World Modeling, Language Reasoning, and Action Synthesis huggingface.co

World-language-action models combine textual instruction processing with robot state prediction through an autoregressive transformer backbone, enabling efficient long-horizon task execution and cross-embodiment learning.

LLM Anonymization Against Agentic Re-Identification huggingface.co

AURA is an LLM-powered anonymization framework that balances privacy protection against agentic web-search re-identification while preserving contextual utility through adaptive privacy scopes and mask-reconstruct methods.

ForeSci: Evaluating LLM Agents for Forward-Looking AI Research Judgment huggingface.co

ForeSci is a temporally controlled benchmark that evaluates LLM agents’ ability to make forward-looking research decisions from historical evidence across fast-moving AI domains.

The Shadow Price of Reasoning: Economic Perspective on Optimal Budget Allocation for LLMs huggingface.co

Inference-time scaling is enhanced through constrained optimization that allocates computational resources based on economic principles, improving performance in resource-constrained environments.

Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models huggingface.co

GeoVR enhances multimodal large language models with 3D awareness by restructuring their semantic latent space through geometric knowledge distillation from 3D foundation models using multiple geometric targets.

RobotValues: Evaluating Household Robots When Human Values Conflict huggingface.co

RobotValues benchmark evaluates household robot planners in value-conflict scenarios, revealing that vision-language models exhibit default value preferences and struggle to override them when instructed to prioritize conflicting values.

Code2LoRA: Hypernetwork-Generated Adapters for Code Language Models under Software Evolution huggingface.co

Code2LoRA is a hypernetwork framework that generates repository-specific LoRA adapters for code language models, supporting both static and evolving codebases with efficient parameter-efficient fine-tuning.

AdaCodec: A Predictive Visual Code for Video MLLMs huggingface.co

AdaCodec reduces video encoding redundancy by transmitting full visual tokens only when scene prediction fails, otherwise encoding compact inter-frame change descriptions.

TIDE: Proactive Multi-Problem Discovery via Template-Guided Iteration huggingface.co

TIDE is a template-guided iterative framework for discovering hidden problems from context, using iterative discovery and thought templates to improve problem identification and resolution across document and code environments.

Unsupervised Skill Discovery for Agentic Data Analysis huggingface.co

DataCOPE is an unsupervised framework that discovers reusable data-analysis skills through verifier-guided exploration, improving analytical performance in both report-style and reasoning-style tasks.

Imagine Before You Predict: Interleaved Latent Visual Reasoning for Video Event Prediction huggingface.co

Future-L1, an interleaved latent visual reasoning framework, improves video event prediction by maintaining visual semantics in latent space during autoregressive decoding, achieving state-of-the-art results on FutureBench and TwiFF-Bench benchmarks.

Benchmark Everything Everywhere All at Once huggingface.co

Automated benchmark creation system generates diverse evaluation datasets with minimal human intervention, enabling continuous model assessment across multiple domains.

MLEvolve: A Self-Evolving Framework for Automated Machine Learning Algorithm Discovery huggingface.co

MLEvolve is an LLM-based multi-agent framework that enables long-horizon machine learning algorithm discovery through improved search mechanisms, memory systems, and adaptive coding strategies.

EvoDS: Self-Evolving Autonomous Data Science Agent with Skill Learning and Context Management huggingface.co

EvoDS introduces a self-evolving autonomous data science agent that enhances its capabilities through skill acquisition and adaptive context management via reinforcement learning.

Video2LoRA: Parametric Video Internalization for Vision-Language Models huggingface.co

Video2LoRA enables efficient video processing in vision-language models by predicting Low-Rank Adaptation weights from video representations, reducing computational costs while maintaining video-faithful outputs.

Trust Region Q Adjoint Matching huggingface.co

Trust Region Q-Adjoint Matching (TRQAM) addresses instability in off-policy reinforcement learning by adaptively controlling path-space KL divergence through projected dual descent, enabling stable fine-tuning of pretrained flow policies.

Flash-WAM: Modality-Aware Distillation for World Action Models huggingface.co

Flash-WAM introduces a modality-aware step-distillation framework for world-action models that achieves real-time inference by adapting consistency functions to different noise regimes in video and action streams.

Discrete-WAM: Unified Discrete Vision-Action Token Editing for World-Policy Learning huggingface.co

Discrete-WAM introduces a unified discrete latent vision-action world policy that enables compositional causal reasoning and counterfactual reasoning in autonomous driving through aligned discrete tokens and a shared discrete diffusion framework.

SePO: Self-Evolving Prompt Agent for System Prompt Optimization huggingface.co

Self-Evolving Prompt Optimization (SePO) enhances agent performance by jointly optimizing both task and prompt agent system prompts through evolutionary search, demonstrating superior accuracy across diverse benchmarks.

Complexity-Balanced Diffusion Splitting huggingface.co

Complexity-Balanced Splitting (CBS) allocates generative capacity across specialized sub-networks by partitioning the diffusion timeline based on local complexity measures, improving synthesis quality without increasing inference cost.

Is This Edit Correct? A Multi-Dimensional Benchmark for Reasoning-Aware Image Editing huggingface.co

RE-Edit benchmark evaluates image editing systems on five reasoning dimensions to assess logical consistency beyond visual plausibility.

AdaPlanBench: Evaluating Adaptive Planning in Large Language Model Agents under World and User Constraints huggingface.co

AdaPlanBench presents a dynamic interactive benchmark for evaluating LLM agents’ ability to adaptively plan under progressively revealed world and user constraints through multi-turn interactions.

Towards One-to-Many Temporal Grounding huggingface.co

One-to-Many Temporal Grounding addresses the challenge of localizing multiple disjoint video segments for a single textual query through a comprehensive benchmark, novel reward functions, and improved policy optimization.

Revising Context, Shifting Simulated Stance: Auditing LLM-Based Stance Simulation in Online Discussions huggingface.co

LLM-based stance simulation exhibits context sensitivity when subjected to counterfactual revisions, with both text-only and multimodal approaches showing robust stance transitions across different polarization mechanisms.

LoomVideo: Unifying Multimodal Inputs into Video Generation and Editing huggingface.co

LoomVideo presents an efficient 5B-parameter unified architecture for video generation and editing that reduces computational overhead through novel conditioning mechanisms and multi-modal alignment techniques.

AURA: Intent-Directed Probing for Implicit-Need Surfacing in Situated LLM Agents huggingface.co

AURA enhances query answering by incorporating an intent inference step that estimates implicit needs and optimizes tool usage through gap scoring, achieving better implicit-need coverage and reduced probe consumption compared to standard approaches.

ArcANE: Do Role-Playing Language Agents Stay in Character at the Right Time? huggingface.co

Role-playing language agents require dynamic character development that evolves through narratives, necessitating benchmarks that evaluate psychological trajectory alignment rather than static factual recall, with ArcANE demonstrating superior performance when character arc information is conditioned into models.

Personal AI Agent for Camera Roll VQA huggingface.co

A conversational AI agent is developed for personal camera roll visual question answering, featuring hierarchical memory and specialized tools for navigating large visual datasets with personalized content.

MAOAM: Unified Object and Material Selection with Vision-Language Models huggingface.co

A unified vision-language model framework enables precise object and material selection through text or click interactions, supporting diverse editing workflows with improved robustness.

AffordanceVLA: A Vision-Language-Action Model Empowering Action Generation through Affordance-Aware Understanding huggingface.co

AffordanceVLA introduces a unified framework that uses structured affordance forecasting as an intermediate representation to improve the precision of perception-action mapping in robotic manipulation by leveraging vision-language models.

Towards Truly Multilingual ASR: Generalizing Code-Switching ASR to Unseen Language Pairs huggingface.co

Code-switching automatic speech recognition models show limited generalization across unseen language pairs despite attempts at model merging and domain generalization techniques.

VideoKR: Towards Knowledge- and Reasoning-Intensive Video Understanding huggingface.co

VideoKR presents a large-scale video reasoning dataset and benchmark designed to enhance knowledge-intensive video understanding through expert-domain content and human-in-the-loop example generation.

BRepCLIP: Contrastive Multimodal Pretraining on BRep Primitives for CAD Understanding huggingface.co

BRepCLIP enables multimodal representation learning for CAD models by aligning boundary representation geometry with language and image embeddings through contrastive pretraining, achieving superior retrieval and classification performance compared to point-based methods.

Absorbing Complexity: An Interaction-Native Knowledge Harness for Financial LLM Agents huggingface.co

Financial AI agents struggle with user complexity, but a new architecture called InKH addresses this by embedding complexity into the system through structured knowledge management and temporal memory mechanisms.

SEAOTTER: Sensor Embedded Autoencoding with One-Time Transcode for Efficient Reconstruction huggingface.co

A compression framework for cloud robotics combines learned latent representations with standard JPEG compatibility to achieve faster encoding and decoding while maintaining high perceptual quality.

MechVQA: Benchmarking and Enhancing Multimodal LLMs on Comprehensive Mechanical Drawing Understanding huggingface.co

Mechanical engineering drawing understanding is improved through a specialized dataset and domain-specific model that outperforms existing baselines by leveraging multi-stage training and high-density visual question answering annotations.

The Road Ahead in Autonomous Driving: The KITScenes Multimodal Dataset huggingface.co

KITScenes Multimodal dataset provides high-fidelity European driving data with comprehensive 3D maps and diverse urban environments for embodied AI research.

Quality-Guided Semi-Supervised Learning for Medical Image Segmentation huggingface.co

A quality-guided semi-supervised learning framework for medical image segmentation that uses a dedicated quality predictor to improve pseudolabel reliability and segmentation performance.

Multimodal Music Recommendation System using LLMs huggingface.co

A multimodal framework for session-based music recommendation integrates audio, lyric, and semantic signals with LLM-based sequential reasoning to improve recommendation accuracy.

References

Aerni et al., ‘Measuring Non-Adversarial Reproduction of Training Data in LLMs’ (arXiv 2411.10242) arxiv.org

In tests using innocuous prompts—such as asking a model to write a tutorial or a letter—popular conversational models produced outputs where up to 15% of the text consisted of verbatim snippets from the internet; in some instances, entire responses were found to be 100% matches to training data.

Medium / Patent AI Lab, ‘NYT vs OpenAI Lawsuit Update 2026: Did Regurgitation Kill the Fair Use Defense?’ medium.com

OpenAI defends these instances as ‘rare bugs’ rather than intended features, arguing that the Times used ‘deceptive prompts’ or ‘prompt hacking’ to force the model into an unnatural state of recall… the ability to prove that a model tends to memorize—rather than just being capable of it under duress—may determine the future of AI copyright liability.

Carlini et al., ICLR 2025 proceedings on extractable memorization and alignment proceedings.iclr.cc

Attacks such as the ‘poem-poem-poem’ divergence show that production models can be forced to bypass their alignment filters and spit out megabytes of training data through simple repetitive prompts… alignment acts as a fragile ‘harness artifact’ rather than a fundamental change to the model’s underlying knowledge.

ResearchGate figure: ‘Span length distributions for Common Pile Comma model’ researchgate.net

Approximately 23% of prefix-attack spans fall within the 21-50 token range, compared to only 12-16% for generic or specific prompts; under generic prompts the Comma model produces average verbatim spans of roughly 27.95 tokens, jumping to 50.35 tokens under direct prefix attacks.

Javier Rando blog, infini-gram and memorization critique javirando.com

Suffix-array engines excel at counting but are significantly slower than in-memory baselines when reconstructing long text passages… it cannot distinguish between homonyms because its understanding is confined to local string frequencies—an exact-match search engine, not a semantic one.

PropMe GitHub repo (N-essuno/PropMe) github.com

Reproducing the framework requires Python 3.11.9+, building an infini-gram index for the target corpus, and precomputing unigram probabilities; the PropMe metric returns a null value (0.0) when both prefix-based capability and non-adversarial propensity values are zero, and any tokenizer mismatch between the LLM and the index invalidates retrieval.

dou.ac paper summary (intervention table) paper.dou.ac

performance metrics peak at an optimal intervention threshold of δ = 0.12… Token Accuracy 0.8727 → 0.8973 (+0.0246); Question Accuracy 0.2320 → 0.3300 (+0.0980)

arXiv HTML of Shape of Addition (probe ablation) arxiv.org

a Model Output probe achieved 98.81% while a Ground Truth probe lagged at 94.85%, confirming activation vectors physically drift to incorrect basins during errors

Nikankin et al., ‘Arithmetic Without Algorithms’ (ICLR 2025, OpenReview) openreview.net

LLMs actually rely on a ‘bag of heuristics’ — a distributed set of specialized MLP neurons that recognize specific number ranges or modulo patterns — rather than a single clean algorithm

Nanda et al., ‘Progress Measures for Grokking via Mechanistic Interpretability’ (arXiv:2301.05217) arxiv.org

one-layer transformers trained on modular arithmetic develop a ‘trig-based’ circuit… addition performed via rotation in a Fourier basis (the ‘Clock’ algorithm)

Anthropic transformer-circuits.pub, attribution graphs case study transformer-circuits.pub

internal ‘validation circuits’—often centered on ‘consistency heads’ that match digits—frequently fire in the middle layers before the computation is complete, so the model may ‘know’ a result is inconsistent but fail to correct it

ChatPaper summary (models & benchmarks) chatpaper.com

experiments primarily utilized Qwen3-4B, Llama-3.1-8B, and Mistral-7B… accuracy measured using GSM8K and MATH datasets, focusing on multi-digit and multi-operand addition where ‘off-by-one’ errors are most frequent

Moonlight review of Dream.exe themoonlight.io

visual quality is a poor predictor of physical executability… replacing estimated depth maps with ground-truth depth from the simulator leads to a marked improvement in execution success across all tested models

Futurism — Yann LeCun on Sora as world simulator futurism.com

generating pixels is a fundamentally inefficient way to learn world dynamics… ‘doomed to failure’ for understanding the world

EmergentMind — WorldModelBench summary emergentmind.com

even top-tier closed models suffer from mass conservation violations (~12%) and gravity errors (~7%)

RoboCasa 2026 leaderboard (GR00T N1.5 results) robocasa.ai

43.0% success rate on seen atomic tasks… drops to 9.6% for seen composite tasks and 4.4% for zero-shot unseen composite tasks

Dtsbourg — Chris Paxton-style critique of video world models dtsbourg.me

the seconds required for generation make it unusable for real-time reactive control where robots must respond… in milliseconds… most models fail when objects, camera positions, or environments shift even slightly

VideoPhy-2 benchmark paper (arXiv 2602.13294) arxiv.org

frontier models achieve only roughly 22% ‘joint performance’ (satisfying both semantic adherence and physical logic) on difficult test cases

Sources

References

Jack Sun, writing.