Sources

Fast Byte Latent Transformer huggingface.co

Byte-level language models overcome slow autoregressive generation through diffusion-based parallel processing and speculative decoding techniques that improve both speed and quality.

Rethinking RL for LLM Reasoning: It’s Sparse Policy Selection, Not Capability Learning huggingface.co

Reinforcement learning in language models primarily corrects uncertainty at specific decision points rather than acquiring new capabilities, enabling a more efficient RL-free approach called ReasonMaxxer that achieves comparable performance with significantly reduced training costs.

Mean Mode Screaming: Mean—Variance Split Residuals for 1000-Layer Diffusion Transformers huggingface.co

Deep diffusion transformers face structural instability at extreme depths due to mean-dominated collapse triggered by mean mode screaming, which is mitigated through mean-variance split residuals that maintain stable training while preserving performance.

Rethinking State Tracking in Recurrent Models Through Error Control Dynamics huggingface.co

Affine recurrent networks, including State-Space Models and Linear Attention, cannot correct hidden-state drift once representations are preserved, capping them to finite-horizon solutions. The paper formalizes a distinguishability ratio and readability threshold that explain why accumulated error, not expressive capacity, governs tracking failure.

MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI huggingface.co

MLS-Bench evaluates whether agents can produce generalizable, scalable ML methods and finds today’s systems lean on engineering-style tuning over genuine method discovery. Bottlenecks trace to missing scientific insight rather than compute, even when test-time scaling and adaptive context are provided.

HumanNet: Scaling Human-centric Video Learning to One Million Hours huggingface.co

HumanNet releases a million-hour, richly annotated human-centric video corpus and shows egocentric footage can substitute for robot demonstrations when training vision-language-action models. The team validates the transfer on the Magic Cobot platform, covering activity understanding, motion generation, and human-to-robot policy learning.

What if AI systems weren’t chatbots? huggingface.co

The chatbot paradigm is framed not as a neutral UI choice but a dominant sociotechnical configuration with structural downsides across legal, economic, and environmental systems. The authors push researchers to consider non-conversational interfaces that better match task structure and accountability.

Who Prices Cognitive Labor in the Age of Agents? Compute-Anchored Wages huggingface.co

Treating AI agents as a production technology that converts compute capital into cognitive labor, the paper derives a Compute-Anchored Wage via CES aggregation. The model predicts wage-setting moves from labor markets to compute capital markets, with sharp factor-share consequences.

Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex huggingface.co

Group-based policy gradients in RL with verifiable rewards share a geometric structure on the LLM response simplex, which the authors exploit in Listwise Policy Optimization. LPO performs explicit target projection through divergence minimization, yielding monotonic improvement and steadier training than first-order approximations.

LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling huggingface.co

AutoTTS recasts test-time scaling as controller synthesis over reasoning trajectories and probe signals, letting LLMs discover their own inference strategies. A beta parameterization and execution-trace feedback deliver better accuracy-cost tradeoffs with little overhead beyond the base model.

UniPrefill: Universal Long-Context Prefill Acceleration via Block-wise Dynamic Sparsification huggingface.co

UniPrefill is a prefill acceleration framework that works across various model architectures and integrates seamlessly with vLLM to improve long-context inference efficiency.

Q-RAG: Long Context Multi-step Retrieval via Value-based Embedder Training huggingface.co

Q-RAG enables efficient multi-step retrieval for large language models through reinforcement learning fine-tuning of embedder models, achieving state-of-the-art performance on long-context benchmarks.

Scaling Continual Learning to 300+ Tasks with Bi-Level Routing Mixture-of-Experts huggingface.co

A novel continual learning framework called CaRE with a bi-level routing mixture-of-experts mechanism is proposed for class-incremental learning, demonstrating superior performance on very long task sequences exceeding 300 tasks.

STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal Generation huggingface.co

Autoregressive normalizing flows based on Transformer architecture enable unified multimodal generation by aligning text and image processing through shared causal masking and KV-cache mechanisms.

Learning Visual Feature-Based World Models via Residual Latent Action huggingface.co

Visual world models predicting future visual features through residual latent action representations achieve superior performance and efficiency compared to existing methods while enabling novel robot learning approaches.

Shallow Prefill, Deep Decoding: Efficient Long-Context Inference via Layer-Asymmetric KV Visibility huggingface.co

SPEED is a phase-asymmetric KV-visibility policy that reduces long-context inference costs in decoder-only language models by processing prompt tokens in lower layers during prefill while maintaining full-depth attention during decoding.

DecodingTrust-Agent Platform (DTap): A Controllable and Interactive Red-Teaming Platform for AI Agents huggingface.co

A comprehensive platform and autonomous agent framework for evaluating and enhancing AI agent security through controlled red-teaming across multiple real-world domains and simulation environments.

MatryoshkaLoRA: Learning Accurate Hierarchical Low-Rank Representations for LLM Fine-Tuning huggingface.co

MatryoshkaLoRA introduces a hierarchical low-rank adaptation framework that dynamically adjusts rank selection through a diagonal matrix insertion, improving accuracy-performance trade-offs over existing methods.

MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference huggingface.co

MISA replaces the dense token-wise indexing in sparse attention with a routed mixture-of-experts approach that reduces computational cost while maintaining performance and handling long contexts effectively.

MDN: Parallelizing Stepwise Momentum for Delta Linear Attention huggingface.co

Linear attention models face challenges with information decay and convergence, which are addressed through a momentum-based approach that improves training efficiency and performance over existing models like Mamba2 and GDN.

HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents huggingface.co

HyperEyes is a parallel multimodal search agent that enables concurrent entity searches while optimizing inference efficiency through dual-grained reinforcement learning and a specialized benchmark for evaluating both accuracy and efficiency.

Rubric-based On-policy Distillation huggingface.co

Rubric-based on-policy distillation achieves improved sample efficiency over traditional logit-based methods by using structured semantic rubrics instead of teacher logits.

TextLDM: Language Modeling with Continuous Latent Diffusion huggingface.co

TextLDM adapts visual latent diffusion transformers to language modeling by mapping discrete tokens to continuous latents and using representation alignment for improved text generation quality.

AEM: Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning huggingface.co

A novel supervision-free credit assignment method for reinforcement learning in language model agents that adapts entropy dynamics at the response level to improve exploration-exploitation trade-offs and task performance.

UniSD: Towards a Unified Self-Distillation Framework for Large Language Models huggingface.co

Self-distillation framework UniSD systematically addresses challenges in autoregressive language model adaptation through integrated mechanisms for supervision reliability, representation alignment, and training stability.

What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion huggingface.co

Research investigates latent manifold properties for diffusion models and proposes a Prior-Aligned AutoEncoder that explicitly optimizes latent space structure for improved generative modeling.

From Storage to Experience: A Survey on the Evolution of LLM Agent Memory Mechanisms huggingface.co

Large language model agents rely on memory mechanisms that evolve through three stages—storage, reflection, and experience—driven by consistency, dynamic environments, and continual learning goals.

DiffRetriever: Parallel Representative Tokens for Retrieval with Diffusion Language Models huggingface.co

DiffRetriever enables efficient multi-token retrieval using diffusion language models by generating representations in parallel rather than sequentially, achieving superior performance over autoregressive methods.

Sparse Autoencoders as Plug-and-Play Firewalls for Adversarial Attack Detection in VLMs huggingface.co

SAEgis detects adversarial attacks on vision-language models using sparse autoencoders trained for reconstruction, achieving strong performance across domains without additional training.

SpecBlock: Block-Iterative Speculative Decoding with Dynamic Tree Drafting huggingface.co

Speculative decoding with SpecBlock combines block-iterative drafting and path dependence to improve LLM inference speed while maintaining accuracy through adaptive mechanisms.

Steering Visual Generation in Unified Multimodal Models with Understanding Supervision huggingface.co

Understanding-oriented post-training framework enhances generative models by using comprehension tasks as supervisory signals for improved image generation and editing.

Empirical Evidence for Simply Connected Decision Regions in Image Classifiers huggingface.co

Decision regions in deep neural networks exhibit simple connectivity, demonstrated through quad-mesh filling procedures and Coons patch analysis.

PrefixGuard: From LLM-Agent Traces to Online Failure-Warning Monitors huggingface.co

PrefixGuard enables effective online monitoring of LLM agents through trace analysis and prefix-based risk scoring, demonstrating strong performance across multiple benchmark tasks while providing diagnostic insights for alert reliability.

4DThinker: Thinking with 4D Imagery for Dynamic Spatial Understanding huggingface.co

4DThinker enables vision-language models to perform dynamic spatial reasoning through 4D latent mental imagery, using scalable data generation and novel fine-tuning methods that outperform existing approaches.

Normalizing Trajectory Models huggingface.co

Normalizing Trajectory Models introduce a novel approach to diffusion-based generation by modeling each reverse step as an expressive conditional normalizing flow with exact likelihood training, enabling high-quality sample generation in few steps while maintaining likelihood framework.

SkCC: Portable and Secure Skill Compilation for Cross-Framework LLM Agents huggingface.co

SkCC is a compilation framework that uses a strongly-typed intermediate representation to enable portable deployment of agent skills across different platforms while ensuring security and improving performance.

InterLV-Search: Benchmarking Interleaved Multimodal Agentic Search huggingface.co

InterLV-Search benchmark evaluates interleaved language-vision agentic search by repeatedly using textual and visual evidence to condition later search, revealing current systems’ limitations in visual evidence seeking and multimodal evidence integration.

Flow-OPD: On-Policy Distillation for Flow Matching Models huggingface.co

Flow-OPD addresses limitations in Flow Matching text-to-image models through a two-stage alignment approach combining on-policy distillation and manifold anchor regularization, achieving significant improvements in generation quality and alignment metrics.

Anisotropic Modality Align huggingface.co

Research addresses the modality gap in multimodal models by proposing an anisotropic geometric correction framework that enables effective unpaired modality alignment through structured representation transformation.

A^2RD: Agentic Autoregressive Diffusion for Long Video Consistency huggingface.co

A$^2$RD, an Agentic Auto-Regressive Diffusion architecture, addresses long video synthesis challenges through a closed-loop process with memory tracking, adaptive generation, and hierarchical self-improvement mechanisms.

ModelLens: Finding the Best for Your Task from Myriads of Models huggingface.co

ModelLens is a unified framework that recommends models in real-world scenarios by learning from public leaderboard data to rank unseen models on unseen datasets without requiring costly evaluations.

IntentGrasp: A Comprehensive Benchmark for Intent Understanding huggingface.co

IntentGrasp is a benchmark for evaluating large language models’ intent understanding capability, demonstrating poor performance across 20 models and showing significant improvements with intentional fine-tuning.

PACEvolve++: Improving Test-time Learning for Evolutionary Search Agents huggingface.co

PACEvolve++ enables adaptive policy selection in evolutionary search through a reinforcement learning framework that decouples hypothesis generation from execution while adapting optimization strategies across evolutionary phases.

Beyond Retrieval: A Multitask Benchmark and Model for Code Search huggingface.co

A new code search benchmark called CoREB is introduced that addresses limitations of existing datasets by providing contamination-limited, multitask evaluation across text-to-code, code-to-text, and code-to-code retrieval tasks with fine-tuned reranking capabilities.

SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation huggingface.co

SCOPE is a specification-guided framework that maintains semantic commitments throughout text-to-image generation to improve complex visual intent fulfillment.

LiVeAction: a Lightweight, Versatile, and Asymmetric Neural Codec Design for Real-time Operation huggingface.co

LiVeAction is a lightweight neural codec architecture that improves rate-distortion performance for resource-constrained devices by using an FFT-like structure and variance-based rate penalty instead of adversarial losses.

BalCapRL: A Balanced Framework for RL-Based MLLM Image Captioning huggingface.co

A balanced reinforcement learning framework for image captioning that jointly optimizes correctness, coverage, and linguistic quality while improving performance over existing methods.

R^3-SQL: Ranking Reward and Resampling for Text-to-SQL huggingface.co

R$^3$-SQL addresses inconsistencies in scoring functionally equivalent SQL queries and improves candidate recall through unified reward ranking and agentic resampling techniques.

CGM-JEPA: Learning Consistent Continuous Glucose Monitor Representations via Predictive Self-Supervised Pretraining huggingface.co

A self-supervised pretraining framework for continuous glucose monitoring data achieves superior cross-modal and cross-cohort performance by predicting masked latent representations and incorporating cross-view distributional objectives.

MACE-Dance: Motion-Appearance Cascaded Experts for Music-Driven Dance Video Generation huggingface.co

MACE-Dance is a music-driven dance video generation framework that combines cascaded Mixture-of-Experts with diffusion models and specialized training strategies to achieve high-quality visual appearance and realistic human motion.

References

MarkTechPost marktechpost.com

BLT-D-16 (block size 16) is the fastest variant, reaching roughly 87 to 92 percent reduction in estimated memory bandwidth versus BLT, but with a meaningful score drop on tasks such as ARC Easy and HumanEval, indicating that aggressive block parallelism trades off accuracy for speed.

ArxivIQ Substack review arxiviq.substack.com

BLT-S is highlighted as a ‘zero-compromise’ upgrade, achieving a 77% bandwidth reduction with 100% preservation of task accuracy… the paper’s results rely on estimated bandwidth rather than raw wall-clock benchmarks on physical hardware like H100s.

Lossfunk Letters (on BD3-LM, Arriola/Kuleshov) letters.lossfunk.com

When the block size is set to one, the model functionally collapses into a standard autoregressive model; conversely, setting the block size to the full sequence length recovers a pure diffusion model.

r/LocalLLaMA discussion on speculative decoding reddit.com

On powerful hardware like the NVIDIA H100, the overhead of a large draft model can negate speed gains if the target model is already running in its compute-efficient regime… high-concurrency workloads introduce the ‘ragged tensor’ problem, where users in a batch accept different numbers of tokens, leading to GPU misalignment and latency spikes.

arxiv 2509.11252 (BLT retrofit study) arxiv.org

‘retrofitting’ existing models like Llama 3 into a BLT framework results in a significant performance drop, implying that the benefits of the architecture can only be realized through expensive, end-to-end training from scratch.

AllenAI blog (BoLMo) allenai.org

EvaByte is noted for its extreme data efficiency, rivaling top-tier token-based models like Llama 3 while using 5x less training data… EvaByte researchers reported ‘byte-level collapses’ during early pre-training, where models generated ‘bizarre typos’ that only resolved after several thousand steps.

ArxivIQ Substack — NeurIPS 2025 recap of Yue et al. arxiviq.substack.com

RLVR-tuned models dominate at low values of k, but base models frequently match or exceed their RL-trained counterparts when k is large (e.g., k=256 or higher)… RLVR actually narrows the model’s reasoning scope by focusing on a few high-reward trajectories.

Shenzhi Wang et al. — ‘High-Entropy Minority Tokens Drive RLVR’ project page shenzhi-wang.github.io

Restricting policy-gradient updates exclusively to the top 20% of high-entropy tokens yielded +11.04 on AIME’25 and +7.71 on AIME’24 for Qwen3-32B; training on the low-entropy 80% led to severe performance degradation.

ProRL (Liu et al., NVIDIA), arXiv 2505.24864 arxiv.org

Base models often fail completely (0% success) on complex logical puzzles even with thousands of samples, while ProRL-trained models have achieved up to 100% success on these same tasks… uncovering novel reasoning strategies inaccessible to the base model.

oxRL controlled study (Findings of EMNLP 2024) aclanthology.org

At 1.5B parameters, online RL (SGRPO) outperformed DPO on GSM8K by approximately 9 percentage points; at 7B parameters, the RL-free SimPO variant actually surpassed RL-based methods on the same benchmark.

LonePatient arXiv digest (May 2026) lonepatient.top

Reports indicate the cost of training a reasoning-capable model using this method dropped to as little as $4, a three-order-of-magnitude reduction… the farukakgul/ReasonMaxxer GitHub repository is in its nascent stages, currently showing a minimal count of 1 star.

Hugging Face blog on GRPO (NormalUhr) huggingface.co

RL fine-tuning is inherently sparse, naturally updating only 5%–30% of a model’s weights even when the full model is unfrozen… LoRA acts as an implicit KL regularizer, preventing the model from drifting too far from its original distribution.

Wang et al., ‘DeepNet: Scaling Transformers to 1000 Layers’ (ResearchGate) researchgate.net

DeepNorm… bounds the expected magnitude of model updates… successfully scaled Transformers up to 1,000 layers (i.e., 2,500 attention and feed-forward sub-layers) without difficulty.

hyper.ai search profile — Pengqi Lu beta.hyper.ai

Prior to the 1000-layer milestone, Lu was recognized for developing Qwen2VL-Flux, a framework that integrated the Qwen2-VL vision-language model with the FLUX architecture; the mv-split repository and 1000-layer DiT weights were released under his personal ‘StableKirito’ and ‘erwold’ accounts.

ngjoo.com paper notes on 2605.06169 ngjoo.com

the 1000-layer run was a ‘separate scale-validation point’ and not part of the matched 400-layer frontier comparison, meaning its efficiency relative to smaller models remains an open research question.

AI Native Foundation Daily Paper Digest (2026-05-11) ainativefoundation.org

certain validation methods ‘check code against itself rather than intent,’ suggesting that while the architecture is stable, the underlying ‘regime theory’ for predicting the exact onset of MMS remains partially empirical.

ChatGPT-ArXiv-Paper-Assistant digest (daizedong.github.io) daizedong.github.io

the analysis of Softmax Jacobian null spaces and mean-preservation is specific to Transformer-style attention… transferability of these findings to attention-free mixers (like Mamba or CNN-based diffusers) remains to be tested.

arXiv HTML — Mean Mode Screaming (paper appendix) arxiv.org

Appendix H reveals that the token mean implicitly carries the diffusion timestep signal… ‘hard centering’ (forcing mean to zero) is detrimental because it destroys this useful global information.

Sources

References

Jack Sun, writing.