Sources

Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion huggingface.co

Orthrus is a dual-architecture framework that combines autoregressive LLMs with diffusion models to achieve fast parallel token generation while maintaining exact inference fidelity through shared KV caches and consensus mechanisms.

AI CFD Scientist: Toward Open-Ended Computational Fluid Dynamics Discovery with Physics-Aware AI Agents huggingface.co

An AI system for computational fluid dynamics autonomously discovers physics corrections through vision-language verification and domain-specific code modification, outperforming general AI scientists in validity checking and scientific claim generation.

MinT: Managed Infrastructure for Training and Serving Millions of LLMs huggingface.co

MinT is a managed infrastructure system that enables efficient low-rank adaptation training and serving by keeping base models resident and moving lightweight adapter revisions, scaling across multiple dimensions including large model architectures, reduced storage requirements, and distributed policy management.

Towards Self-Evolving Agentic Literature Retrieval huggingface.co

PaSaMaster iterates over intent analysis and evidence-grounded ranking to cut hallucinations and compute cost in academic retrieval. The system ships with its own PaSaMaster Benchmark and open code, aiming to make agentic search both more accurate and cheaper per query than fixed retrieval pipelines.

Position: LLM Inference Should Be Evaluated as Energy-to-Token Production huggingface.co

Accuracy and latency miss the real constraints of serving LLMs, the authors argue, proposing joules/token, FLOPs/token, and utilization-adjusted output as headline metrics. The framework folds in PUE, cooling, and routing alongside techniques like KV-cache compression and quantization to score real deployments.

Retrieval from Within: An Intrinsic Capability of Attention-Based Models huggingface.co

Rather than bolting on an external retriever, INTRA uses decoder attention queries over pre-encoded evidence chunks to fetch context from within the model itself. The approach removes retriever-generator mismatch and improves both evidence recall and end-to-end answer quality.

From Generalist to Specialist Representation huggingface.co

The paper establishes identifiability guarantees for pulling task-relevant representations out of generalist backbones without parametric assumptions or interventions. Using temporal dependence and sparsity regularization, it gives theoretical footing to disentangling latent factors that downstream specialists need.

Retrieval is Cheap, Show Me the Code: Executable Multi-Hop Reasoning for Retrieval-Augmented Generation huggingface.co

PyRAG synthesizes Python that drives retrieval and reasoning step by step, using compiler errors as deterministic self-repair signals. The execution-grounded approach beats standard RAG on PopQA, HotpotQA, 2WikiMultihopQA, MuSiQue, and Bamboogle by exposing intermediate states the model can inspect.

The DAWN of World-Action Interactive Models huggingface.co

DAWN introduces World-Action Interactive Models that pair a World Predictor with a world-conditioned action denoiser in a shared semantic latent space. Recursive refinement between scene evolution and control yields stronger long-horizon trajectories on autonomous driving benchmarks than latent generative baselines.

Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization huggingface.co

Exploration-Aware Policy Optimization uses variational inference to score candidate actions by informational gain, triggering exploration selectively rather than uniformly. A fine-grained reward and exploration-aware grouping lift agent performance on both text-based and GUI-based benchmarks over standard RL baselines.

Asymmetric Flow Models huggingface.co

Asymmetric Flow Modeling enables efficient high-dimensional flow-based generation by restricting noise prediction to low-rank subspaces while maintaining full-dimensional data prediction, achieving superior performance in pixel-space text-to-image generation through effective fine-tuning from latent models.

AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation huggingface.co

Software engineering agents are evaluated using a process-level framework that reveals differences between effective and ineffective approaches, identifying patterns like lucky passes and providing quality scoring for improved assessment.

MAP: A Map-then-Act Paradigm for Long-Horizon Interactive Agent Reasoning huggingface.co

Interactive LLM agents suffer from delayed environmental perception and epistemic bottlenecks due to reactive understanding during execution, which the proposed Map-then-Act Paradigm (MAP) addresses by acquiring environmental knowledge beforehand through global exploration, task-specific mapping, and knowledge-augmented execution.

Source or It Didn’t Happen: A Multi-Agent Framework for Citation Hallucination Detection huggingface.co

CiteTracer is a multi-agent system that detects fabricated citations by classifying them into a 12-code taxonomy and using structured extraction, evidence retrieval, and specialized classifiers for real, potential, and hallucinated citations.

Visual Aesthetic Benchmark: Can Frontier Models Judge Beauty? huggingface.co

Current multimodal models struggle to match human expert aesthetic judgment in comparative image selection tasks, as demonstrated by the Visual Aesthetic Benchmark which reveals significant performance gaps and shows that fine-tuning on expert examples can improve accuracy.

Qwen-Image-VAE-2.0 Technical Report huggingface.co

Qwen-Image-VAE-2.0 is a high-compression Variational Autoencoder suite that improves reconstruction fidelity and diffusability through enhanced architecture, large-scale training, and semantic alignment strategies.

Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context huggingface.co

Long-context continued pre-training enhances vision-language models’ ability to handle extended documents while maintaining performance across diverse contexts through strategic data mixture design.

MulTaBench: Benchmarking Multimodal Tabular Learning with Text and Image huggingface.co

Multimodal tabular learning benchmarks reveal that task-specific embedding tuning improves performance over frozen pretrained embeddings, particularly when modalities provide complementary predictive signals.

EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents huggingface.co

EVA-Bench presents a comprehensive evaluation framework for voice agents that simulates realistic conversations and measures performance across multiple voice-specific failure modes using novel accuracy and experience metrics.

From Pixels to Concepts: Do Segmentation Models Understand What They Segment? huggingface.co

CAFE is a new benchmark for evaluating concept-faithful segmentation in promptable models through attribute-level counterfactual manipulation, revealing that accurate mask prediction does not guarantee semantic grounding.

MC-RFM: Geometry-Aware Few-Shot Adaptation via Mixed-Curvature Riemannian Flow Matching huggingface.co

A novel Riemannian flow-matching framework for few-shot adaptation that models feature displacement on a mixed-curvature manifold combining hyperbolic and Euclidean spaces, outperforming existing methods across multiple benchmarks.

FlowCompile: An Optimizing Compiler for Structured LLM Workflows huggingface.co

FlowCompile is a structured LLM workflow compiler that optimizes complex multi-agent tasks by performing compile-time exploration of workflow configurations to balance accuracy and latency without retraining.

FAAST: Forward-Only Associative Learning via Closed-Form Fast Weights for Test-Time Supervised Adaptation huggingface.co

FAAST enables efficient task adaptation by compiling labeled examples into fast weights through forward-only computation, achieving significant speedup and memory savings over traditional backpropagation methods.

Revisiting DAgger in the Era of LLM-Agents huggingface.co

DAgger-style training for long-horizon language model agents combines supervised fine-tuning and reinforcement learning benefits by using teacher-student policy interpolation with on-policy interactions.

An Empirical Study of Automating Agent Evaluation huggingface.co

Automated agent evaluation using AI assistants requires specialized domain knowledge and procedural skills to achieve reliable results, as demonstrated by the EvalAgent system that improves evaluation accuracy through structured evaluation skills and a meta-evaluation framework.

F-GRPO: Factorized Group-Relative Policy Optimization for Unified Candidate Generation and Ranking huggingface.co

A unified framework combines candidate generation and ranking in a single autoregressive model using factorized group-relative policy optimization to address credit assignment challenges in end-to-end retrieval optimization.

Many-Shot CoT-ICL: Making In-Context Learning Truly Learn huggingface.co

Many-shot in-context learning for reasoning tasks exhibits different scaling behaviors than non-reasoning tasks, with demonstration ordering and selection significantly impacting performance.

FeatCal: Feature Calibration for Post-Merging Models huggingface.co

Feature drift analysis in model merging leads to FeatCal, a calibration method that reduces performance gaps through layer-wise weight updates without gradient descent, achieving superior benchmark results and efficiency.

HAGE: Harnessing Agentic Memory via RL-Driven Weighted Graph Evolution huggingface.co

HAGE introduces a weighted multi-relational memory framework that enables query-conditioned traversal over unified relational memory graphs, improving long-horizon reasoning accuracy through adaptive memory retrieval and reinforcement learning-based optimization.

AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation huggingface.co

AnyFlow introduces a novel any-step video diffusion distillation framework that improves upon consistency distillation by optimizing full ODE sampling trajectories through flow-map transition learning and backward simulation techniques.

MemReread: Enhancing Agentic Long-Context Reasoning via Memory-Guided Rereading huggingface.co

MemReread addresses long-context reasoning challenges by avoiding intermediate retrieval and employing question decomposition with rereading to recover discarded information, maintaining linear time complexity.

Context Training with Active Information Seeking huggingface.co

Context optimization methods enhanced with active information seeking via search and browser tools achieve superior performance across diverse domains while maintaining data efficiency and robustness.

Learning Agentic Policy from Action Guidance huggingface.co

Agentic reinforcement learning for large language models leverages action data from human interactions as reference guidance to improve exploration and reduce dependence on costly supervised fine-tuning.

PersonalAI 2.0: Enhancing knowledge graph traversal/retrieval with planning mechanism for Personalized LLM Agents huggingface.co

PersonalAI 2.0 enhances LLM-based systems through external knowledge graph integration with dynamic multistage query processing and adaptive information search mechanisms.

RoboEvolve: Co-Evolving Planner-Simulator for Robotic Manipulation with Limited Data huggingface.co

RoboEvolve combines vision-language and video generation models in a co-evolutionary framework to enable scalable robotic manipulation with improved data efficiency and continuous learning capabilities.

SafeHarbor: Hierarchical Memory-Augmented Guardrail for LLM Agent Safety huggingface.co

SafeHarbor is a novel framework for LLM agents that establishes precise decision boundaries through context-aware defense rules, featuring a hierarchical memory system and self-evolution mechanism to balance safety and utility.

FrameSkip: Learning from Fewer but More Informative Frames in VLA Training huggingface.co

FrameSkip is a data-layer frame selection method that improves VLA policy training by prioritizing high-importance frames based on action variation and visual-coherence metrics.

TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking huggingface.co

TrackCraft3R enables efficient dense 3D tracking from monocular video by adapting video diffusion transformers to follow physical points across frames using dual-latent representation and temporal RoPE alignment.

RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation huggingface.co

RealICU benchmark evaluates large language models for ICU decision support using hindsight-annotated patient trajectories, revealing limitations in clinical recommendation accuracy and early interpretation bias.

Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling huggingface.co

Recent image editing models have achieved remarkable progress in instruction following, multimodal understanding, and complex visual editing. However, existing benchmarks often fail to faithfully reflect human judgment, especially for strong frontier models, due to limited task difficulty and coarse-grained evaluation protocols. In parallel, reward models have become increasingly important for RL-based image editing optimization, yet existing reward model benchmarks still rely on unrealistic ev…

Predicting Decisions of AI Agents from Limited Interaction through Text-Tabular Modeling huggingface.co

AI agents can predict counterpart decisions in negotiation games by combining tabular features with LLM-based text representations and hidden states from a frozen observer model, outperforming direct prompting methods.

PresentAgent-2: Towards Generalist Multimodal Presentation Agents huggingface.co

PresentAgent-2 is an agentic framework that generates presentation videos from user queries by conducting research, creating multimodal slides, and producing interactive content across single, discussion, and interaction modes.

Results and Retrospective Analysis of the CODS 2025 AssetOpsBench Challenge huggingface.co

Competition retrospectives are useful when they explain what a leaderboard measured, how hidden evaluation changed conclusions, and which design patterns were rewarded. We revisit the CODS 2025 challenge, a privacy-aware Codabench competition on industrial multi-agent orchestration built on . We combine final rank sheets, a 300-submission server log, 149-team registrations, best-submission exports, the organizer winners report, the companion system paper, and verified planning-track source tre

Vividh-ASR: A Complexity-Tiered Benchmark and Optimization Dynamics for Robust Indic Speech Recognition huggingface.co

Research identifies studio-bias in multilingual ASR fine-tuning and proposes R-MFT method to improve spontaneous speech performance while maintaining efficiency.

KL for a KL: On-Policy Distillation with Control Variate Baseline huggingface.co

On-Policy Distillation with control variate baseline stabilizes training through policy-gradient reinforcement learning techniques while maintaining efficiency and performance.

The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs huggingface.co

On-policy distillation with reward extrapolation exhibits a safety threshold beyond which structured output tasks lose format preservation, with empirical validation showing performance parity at reduced parameter count when operating below this threshold.

Offline Preference Optimization for Rectified Flow with Noise-Tracked Pairs huggingface.co

Rectified flow models require prior noise information for effective preference optimization, which PNAPO addresses by augmenting preference data with noise samples and employing dynamic regularization for improved training efficiency.

IndicMedDialog: A Parallel Multi-Turn Medical Dialogue Dataset for Accessible Healthcare in Indic Languages huggingface.co

A parallel multi-turn medical dialogue dataset spanning English and nine Indic languages is introduced, along with a fine-tuned model using parameter-efficient adaptation for personalized symptom elicitation.

BEACON: A Multimodal Dataset for Learning Behavioral Fingerprints from Gameplay Data huggingface.co

BEACON is a large-scale multimodal dataset capturing diverse behavioral signals during competitive gaming to advance continuous authentication and behavioral biometrics research.

M2Retinexformer: Multi-Modal Retinexformer for Low-Light Image Enhancement huggingface.co

A multi-modal deep learning framework enhances low-light images by integrating depth cues, luminance priors, and semantic features through cross-attention fusion and adaptive gating mechanisms.

References

Distributed Randomness blog on TiDAR (NVIDIA) distributedrandomness.com

TiDAR (Think in Diffusion, Talk in Autoregression)… uses a single Transformer with a structured attention mask that partitions the forward pass into causal and bidirectional regions, drafting future tokens via diffusion while verifying the previous draft autoregressively in a single forward pass.

Google Developers Blog — DFlash on TPU v5p developers.googleblog.com

diffusion-style speculative decoding (DFlash) achieved an average end-to-end speedup of 2.29x, nearly doubling the 1.30x gain provided by EAGLE-3… with peak speedups of nearly 6x on complex coding tasks.

NVLabs Fast-dLLM v2 project page nvlabs.github.io

Fast-dLLM v2 adapts pretrained AR models (like Qwen 2.5) into block-diffusion models with just 1 billion tokens of fine-tuning—a 500x reduction in training data compared to previous diffusion models—achieving a 2.5x speedup over standard AR decoding.

HuggingFace chiennv/Orthrus-Qwen3-8B repo / GitHub issues huggingface.co

early adopters report Issue #3 (fails on MacOS CPU/MPS with Qwen3-1.7B) and note that as of late May 2026 the author has not released the training code, preventing independent verification of the 16% fine-tuning process.

Medium — speculative decoding acceptance rate analysis medium.com

When the acceptance rate drops below ~0.5–0.6, the computational overhead of generating and rejecting draft tokens exceeds the time saved… MTP nearly tripled coding speeds but resulted in a net loss of tokens per second for creative tasks.

Hacker News discussion on diffusion-head speedups news.ycombinator.com

real-world wall-clock speedups in production stacks like SGLang or vLLM may be closer to 4x due to overheads in the verification pass… it ‘moves the bottleneck’ from memory latency to compute, consuming 20%+ additional compute per generated token.

LMSys blog — S-LoRA (Nov 2023) lmsys.org

S-LoRA can serve thousands of LoRA adapters on a single GPU or across multiple GPUs with a small overhead… up to 4x higher throughput than vLLM and 30x higher than PEFT.

OpenPipe blog — production S-LoRA deployment openpipe.ai

We store over 10,000 adapters in system RAM while keeping the most active ~125 on GPU, eliminating cold-start latency for infrequently used models.

SuperintelligenceNews — Thinking Machines Tinker API superintelligencenews.com

Tinker provides four Python-native primitives — forward_backward, optim_step, sample, weight management — handling the heavy lifting of distributed GPU scheduling while preserving algorithmic control.

vLLM GitHub issue #33791 github.com

Dynamic loading via /v1/load_lora_adapter triggers high CPU usage and significantly slower inference compared to static loading via —lora-modules at startup; rank-64 adapters incur far larger hits than rank-16.

Macaron MindLab — ‘Router Replay R3: Why It Failed and How We Fixed It’ macaron.im

Small numerical drifts between vLLM and Megatron caused the Pearson correlation between training and inference routing probabilities to drop as low as 0.14 on DeepSeek-V3 before R3 alignment.

Macaron blog — Tinker compatibility notes macaron.im

By using the command import mint as tinker, users can adapt existing Tinker-based codebases to run on MindLab’s infrastructure… MinT keeps base models resident, moving only adapter revisions.

Krank, Kronbichler & Wall (TUM) — DNS of periodic hills up to Re_H=10,595 portal.fis.tum.de

Direct Numerical Simulation of Flow over Periodic Hills up to Re_H=10,595 … identified a significant discrepancy in the streamwise velocity overshoot directly above the hill crest when compared to the experimental data of Rapp and Manhart (2011) and Breuer et al. (2009).

ResearchGate — ‘Generalization Limits of Data-Driven Turbulence Models’ researchgate.net

models trained on the periodic hill … exhibit a ‘generalizability gap’ when applied to disparate geometries, such as square ducts or curved backward-facing steps

Uni-Stuttgart preprint on data-driven RANS closures elib.uni-stuttgart.de

increasing training set diversity is unlikely to provide a remedy due to the inherent lack of a unique, local mapping between mean flow features and Reynolds stresses

TheMoonlight.io review of DeepScientist themoonlight.io

only 1–3% of its generated ideas ultimately result in measurable scientific progress; roughly 60% of its failures stem from implementation errors … consuming over 20,000 GPU hours for a single discovery cycle

arXiv 2605.12674 — Foam-Agent paper arxiv.org

Foam-Agent (v2.0) achieved an 88.2% execution success rate when paired with Claude 3.5 Sonnet, a significant lead over existing baselines such as MetaOpenFOAM (55.5%) and OpenFOAM-GPT (37.3%)

OpenReview submission (NeurIPS ML4PS 2025) openreview.net

reliance on Gaussian patches to ‘fix’ local errors resembles classical ‘point-wise tuning’ rather than the derivation of universal physical constants

Sources

References

Jack Sun, writing.