Orthrus drafts Qwen3 blocks, MinT serves million LoRAs, CFD agent gates with VLM
Three independent systems releases today: a Qwen3 diffusion drafter, a million-LoRA serving stack, and a VLM-gated CFD agent loop.
Orthrus drafts Qwen3 blocks, MinT serves million LoRAs, CFD agent gates with VLM
TL;DR
- Orthrus posts 7.83× tokens-per-pass on frozen Qwen3-8B — roughly 4× wall-clock on the released checkpoint.
- MinT serves 10⁶ LoRAs over 1.04T-param bases, with 18.3× handoff measured vs full checkpoints.
- AI CFD Scientist catches 14/16 planted silent failures via a VLM physics gate on OpenFOAM runs.
- PyRAG beats standard RAG on 5 multi-hop QA sets by compiling reasoning into Python.
- EAPO triggers exploration only when uncertainty is high, lifting agent RL benchmarks.
Three stand-alone systems releases anchor the research feed today, with no shared thread beyond the fact that each ships a strong demo and a quieter qualifier in its own writeup. Orthrus bolts a trainable diffusion head onto a frozen Qwen3-8B to draft token blocks, posting a headline 7.83× tokens-per-forward-pass that collapses to roughly 4× wall-clock on the released checkpoint. Macaron’s MinT lives at the serving layer — a literal import mint as tinker swap that scales LoRA management to 10⁶ adapters over trillion-parameter bases, with a packed MoE-LoRA tensor layout doing the real engineering work. AI CFD Scientist wires a vision-language physics gate into an OpenFOAM agent loop, catching 14 of 16 planted silent failures, though independent turbulence work flags its Spalart–Allmaras tweak as unlikely to transfer past the one Re_h=5,600 hill case.
The brief pile leans agentic and retrieval-shaped: PyRAG compiles multi-hop reasoning to Python, PaSaMaster turns literature search into a self-evolving loop, INTRA retrieves from a model’s own activations, and EAPO scores when an agent should explore at all. A position paper on joules-per-token and a nonparametric identifiability proof round out the theory end.
Orthrus claims 7.8× LLM speedup; wall-clock looks like 4×
Source: hf-daily-papers · published 2026-05-11
TL;DR
- Orthrus adds a trainable diffusion head to a frozen Qwen3 backbone, drafting token blocks over a shared KV cache.
- The headline 7.83× is tokens-per-forward-pass on Pseudo2code, not wall-clock — the 8B averages 5.39 TPF across reasoning benchmarks.
- Independent runs on the released 8B checkpoint land near ~3.7-4× wall-clock, well below the paper’s number.
- Training code is unreleased as of late May 2026, leaving the “16% params, <1B tokens, 24h on 8×H200” recipe unverified.
What Orthrus actually does
Orthrus is a dual-view decoder. The pre-trained AR model (Qwen3 1.7B/4B/8B) stays frozen and owns the KV cache. A lightweight diffusion head — new Q/K/V projections at each layer, 16% extra parameters — is distilled via forward-KL against the AR teacher to predict a block of K tokens (e.g. 32) in a single forward pass. The block then goes back through the frozen AR head, which checks each position; the longest matching prefix is accepted, the first mismatch is replaced by the AR token, and the cycle repeats.
The result is speculative decoding without a separate draft model. There’s no second network to load, no second KV cache to maintain, and because both views share the cache, memory overhead is O(1) in context length. That’s the load-bearing engineering trick: traditional speculative decoding pays for the drafter’s KV state, which is why its wall-clock gains decay as context grows. Orthrus reports stable speedups out to 40K tokens.
The crowded field it lands in
Orthrus is the third or fourth entry in a 2025-2026 race toward drafter-free AR+diffusion hybrids. NVIDIA’s TiDAR uses a single Transformer with a structured causal/bidirectional mask to draft and verify in one forward pass 1. Google’s DFlash on TPU v5p reports 2.29× average and ~6× peak end-to-end speedups over EAGLE-3 2. NVLabs’ Fast-dLLM v2 hits 2.5× wall-clock with a comparable ~1B-token fine-tuning budget on Qwen 2.5 3. The shared-KV, lossless-via-AR-verification design is now the convergent answer; Orthrus’s specific contribution is the frozen-backbone + intra-model consensus framing.
Against Fast-dLLM v2’s 61.5% on MATH-500 and LLaDA-1.5’s 57.4%, Orthrus’s exact-parity 86.2% is a real win — but only because the verification step forces parity with the AR base (96.0% on GSM8K, identical to vanilla Qwen3-8B), not because the diffusion head is smarter. The accuracy story is “we don’t degrade,” which is the table stakes other diffusion LMs failed to clear.
Read the 7.8× carefully
The headline number is Tokens Per Forward pass on Pseudo2code. Wall-clock is a different question. Practitioners on HN estimate real-world gains in SGLang/vLLM closer to ~4×, because the verification pass itself eats 20%+ extra compute per accepted token 4. One third-party deployment of the 8B checkpoint measured 44.8→164.6 t/s, roughly 3.7×.
The deeper caveat is acceptance-rate sensitivity. Multi-Token Prediction analysis shows that once acceptance drops below ~0.5-0.6, the draft-and-reject overhead turns net-negative — MTP tripled coding throughput but lost ground on creative tasks 5. Orthrus’s evaluations are math and code, where token distributions are low-entropy and acceptance is naturally high. Open-ended generation is the regime where the trick gets tested.
What’s missing
The “16% trainable params, <1B tokens, 24h on 8×H200” claim is the most quotable line in the paper, and it’s also the one nobody outside the authors can currently reproduce. Training code is unreleased as of late May 2026, and the published checkpoint fails on MacOS MPS for the 1.7B variant 6. The inference repo is up; the recipe that produced the head is not. Until that lands, treat 7.8× as an upper bound from one lab on one benchmark family, and 3-4× wall-clock as the number you should plan around.
Macaron’s MinT clones Tinker, adds MoE LoRA at trillion scale
Source: hf-daily-papers · published 2026-05-12
TL;DR
- Macaron’s MinT manages 10⁶ LoRA adapters over shared bases up to 1.04T params via a catalog/CPU/GPU tier hierarchy.
- Migration is literally
import mint as tinker, making MinT a Thinking Machines Tinker clone with a managed backend. - Headline 18.3× handoff speedup is measured against full-checkpoint materialization, not against S-LoRA or Punica.
- Real novelty is MoE-specific: a packed tensor layout collapses ~37k small objects to 672, delivering 8.5× faster live loads.
The pitch, and what’s actually new
MinT (MindLab Toolkit) treats LoRA adapters as the unit of currency: base models stay resident, and only compact “adapter revisions” move between training and serving. The paper claims a million-scale addressable catalog over trillion-parameter bases, with a three-tier hierarchy — durable storage for the catalog, hundreds of warm adapters in CPU RAM, up to 64 active adapters per GPU decoding step.
That hierarchy is not new. S-LoRA demonstrated thousands of adapters on a single GPU in late 2023, claiming 4× higher throughput than vLLM and 30× over PEFT via Unified Paging and SGMV kernels 7. OpenPipe has been running essentially the same catalog/CPU/GPU pattern in production with 10,000 adapters in RAM and ~125 hot on GPU 8. The widely-cited 18.3× handoff speedup is measured against the strawman of materializing a full merged checkpoint per variant — not against any of these systems.
What MinT genuinely contributes lives in the MoE-specific machinery, where prior multi-tenant LoRA work mostly hadn’t gone.
The Tinker-clone subtext
The paper buries an important detail: MindLab’s own materials describe migration as import mint as tinker, and Tinker compatibility is a design goal 9. Thinking Machines Lab’s Tinker exposes four Python primitives — forward_backward, optim_step, sample, and weight management — and handles distributed scheduling while leaving algorithmic control to the user 10. MinT replicates that surface and adds a managed adapter lifecycle on Macaron’s infrastructure. Read as a research paper, MinT looks like systems work; read as a product launch, it’s a Tinker competitor with a benchmarks appendix. Both readings are valid, but only one is on the cover.
R3 and the MoE training-inference gap
The most substantive technical contribution is Router Replay (R3). MoE RL pipelines have a quietly catastrophic problem: numerical drift between the inference engine (vLLM) and the training engine (Megatron) causes expert routing decisions to diverge. MindLab’s deep-dive reports that on DeepSeek-V3, the Pearson correlation between rollout and training routing probabilities dropped to 0.14 before R3 alignment 11. If that number generalizes, most prior MoE RL runs were training on near-random gradient signals.
R3 forces consistent expert paths between rollout and training. For dynamic sparse attention models like GLM-5, where exact replay is impossible, MinT falls back to IcePop-style correction — zeroing importance weights on high-discrepancy tokens. The paper is candid that this filters the mismatch rather than fixing it.
The cold-load wall
MinT’s “staircase” caveat — unique cold adapters serialize at ~1.36s each — isn’t a MinT-specific flaw. vLLM’s dynamic /v1/load_lora_adapter endpoint has open issues reporting high CPU usage and degraded inference, with rank-64 adapters penalized far worse than rank-16 12. The packed-MoE representation gets MinT an 8.5–8.7× speedup on the live-load path and median sub-200ms loads for cache misses, which is real progress. But the underlying serialized loader architecture is shared with the rest of the ecosystem and remains unsolved.
What to take away
Discount the “million adapters” headline — that scale has been approached for two years. Take seriously the MoE pieces (packed tensors, R3, IcePop integration), which address a gap nobody else has published a clean fix for. And read the paper knowing the product context: MinT is Macaron’s bid to host the workload Tinker exposes.
AI CFD Scientist’s VLM gate catches 14 of 16 silent failures
Source: hf-daily-papers · published 2026-05-11
TL;DR
- AI CFD Scientist wires a vision-language “physics gate” into an OpenFOAM agent loop, catching 14/16 planted silent failures.
- quadRecTail, a Spalart–Allmaras tweak, cut skin-friction RMSE by 7.89% on one periodic-hill case at Re_h=5,600.
- Independent turbulence-modeling work flags that geometry-local Gaussian patches rarely transfer to ducts or backward-facing steps.
- Against ARIS and DeepScientist, the system’s edge is withholding claims when evidence is thin, not generating better ones.
What the system actually does
Built on Foam-Agent (which itself hits 88.2% execution success on CFD-LLMBench vs. 55.5% for MetaOpenFOAM and 37.3% for OpenFOAM-GPT 13), AI CFD Scientist closes the loop from literature review through C++ source-code modification, mesh-independence checks, and LaTeX manuscript drafting. GPT-5.5 (Codex) is the reasoning backbone. The novel piece sits between “simulation ran” and “claim made”: a vision-language model that renders flow fields to PNG and inspects them for missing features, wrong magnitudes, and broken post-processing.
That gate is the article’s real news. In a planted-failure study, the VLM caught 14 of 16 silent failures that produced clean solver logs — 100% recall on missing deliverables and broken post-processing, 50% on unsettled convergence. In the backward-facing-step task, it flagged a sign error in a post-processor and forced the system to withhold a closure ranking. ARIS and DeepScientist, running the same task without a physics gate, confidently reported rankings built on the flipped data.
The discovered model is a proof-of-concept, not a closure
The headline scientific claim — a quadrupolar source term added to the SA $\tilde{\nu}$ equation that reduces $C_f$ RMSE from 0.004297 to 0.003958 against Krank et al.’s DNS — needs reading in context. Krank’s own DNS paper flagged an unresolved streamwise velocity overshoot above the hill crest that disagrees with the Rapp & Manhart and Breuer experiments 14. So the optimization target itself isn’t a clean ground truth, and the win is tuned to a single Reynolds number on a single geometry.
The broader literature is openly skeptical of this style of fix. Data-driven SA closures trained on periodic hills routinely degrade when applied to square ducts or curved backward-facing steps 15, and a Stuttgart preprint argues that more training diversity won’t save them, because there is no unique local mapping from mean-flow features to Reynolds stresses 16. A NeurIPS ML4PS reviewer put it bluntly:
reliance on Gaussian patches to ‘fix’ local errors resembles classical ‘point-wise tuning’ rather than the derivation of universal physical constants 17
quadRecTail — four wall-normalized Gaussian patches — fits that description precisely.
Why it still matters
The autonomous-scientist genre has a credibility problem. Independent review of DeepScientist puts only 1–3% of generated ideas at “measurable progress,” with ~60% of failures traced to implementation errors and single discovery cycles burning 20,000+ GPU hours 18. Against that backdrop, AI CFD Scientist’s ~$40 / 44-iteration budget looks cheap — but only because the search space is narrow (SA source-term coefficients, one geometry) and the gate forces conservatism.
The transferable contribution isn’t the new closure; it’s the architectural move of putting a vision-based validity check between solver and claim. That addresses the exact failure mode — “the simulation ran, therefore the result is real” — that has discredited prior AI-scientist demos. The discovered model will likely not survive contact with a second geometry. The gate methodology will.
Round-ups
PyRAG recasts multi-hop QA as executable Python programs
Source: hf-daily-papers
PyRAG synthesizes Python that drives retrieval and reasoning step by step, using compiler errors as deterministic self-repair signals. The execution-grounded approach beats standard RAG on PopQA, HotpotQA, 2WikiMultihopQA, MuSiQue, and Bamboogle by exposing intermediate states the model can inspect.
INTRA shows attention models can retrieve from their own activations
Source: hf-daily-papers
Rather than bolting on an external retriever, INTRA uses decoder attention queries over pre-encoded evidence chunks to fetch context from within the model itself. The approach removes retriever-generator mismatch and improves both evidence recall and end-to-end answer quality.
EAPO teaches agents to explore only when uncertainty is high
Source: hf-daily-papers
Exploration-Aware Policy Optimization uses variational inference to score candidate actions by informational gain, triggering exploration selectively rather than uniformly. A fine-grained reward and exploration-aware grouping lift agent performance on both text-based and GUI-based benchmarks over standard RL baselines.
PaSaMaster turns literature search into a self-evolving agent loop
Source: hf-daily-papers
PaSaMaster iterates over intent analysis and evidence-grounded ranking to cut hallucinations and compute cost in academic retrieval. The system ships with its own PaSaMaster Benchmark and open code, aiming to make agentic search both more accurate and cheaper per query than fixed retrieval pipelines.
Position paper reframes LLM inference as energy-to-token production
Source: hf-daily-papers
Accuracy and latency miss the real constraints of serving LLMs, the authors argue, proposing joules/token, FLOPs/token, and utilization-adjusted output as headline metrics. The framework folds in PUE, cooling, and routing alongside techniques like KV-cache compression and quantization to score real deployments.
DAWN jointly models scenes and actions for long-horizon driving
Source: hf-daily-papers
DAWN introduces World-Action Interactive Models that pair a World Predictor with a world-conditioned action denoiser in a shared semantic latent space. Recursive refinement between scene evolution and control yields stronger long-horizon trajectories on autonomous driving benchmarks than latent generative baselines.
Nonparametric proof: specialist features are identifiable from generalist models
Source: hf-daily-papers
The paper establishes identifiability guarantees for pulling task-relevant representations out of generalist backbones without parametric assumptions or interventions. Using temporal dependence and sparsity regularization, it gives theoretical footing to disentangling latent factors that downstream specialists need.
Footnotes
-
Distributed Randomness blog on TiDAR (NVIDIA) — https://distributedrandomness.com/tidar/
↩TiDAR (Think in Diffusion, Talk in Autoregression)… uses a single Transformer with a structured attention mask that partitions the forward pass into causal and bidirectional regions, drafting future tokens via diffusion while verifying the previous draft autoregressively in a single forward pass.
-
Google Developers Blog — DFlash on TPU v5p — https://developers.googleblog.com/supercharging-llm-inference-on-google-tpus-achieving-3x-speedups-with-diffusion-style-speculative-decoding/
↩diffusion-style speculative decoding (DFlash) achieved an average end-to-end speedup of 2.29x, nearly doubling the 1.30x gain provided by EAGLE-3… with peak speedups of nearly 6x on complex coding tasks.
-
NVLabs Fast-dLLM v2 project page — https://nvlabs.github.io/Fast-dLLM/v2/
↩Fast-dLLM v2 adapts pretrained AR models (like Qwen 2.5) into block-diffusion models with just 1 billion tokens of fine-tuning—a 500x reduction in training data compared to previous diffusion models—achieving a 2.5x speedup over standard AR decoding.
-
Hacker News discussion on diffusion-head speedups — https://news.ycombinator.com/item?id=48154865
↩real-world wall-clock speedups in production stacks like SGLang or vLLM may be closer to 4x due to overheads in the verification pass… it ‘moves the bottleneck’ from memory latency to compute, consuming 20%+ additional compute per generated token.
-
Medium — speculative decoding acceptance rate analysis — https://medium.com/@jiminlee-ai/speculative-decoding-making-llms-faster-without-sacrificing-quality-973384f9c165
↩When the acceptance rate drops below ~0.5–0.6, the computational overhead of generating and rejecting draft tokens exceeds the time saved… MTP nearly tripled coding speeds but resulted in a net loss of tokens per second for creative tasks.
-
HuggingFace chiennv/Orthrus-Qwen3-8B repo / GitHub issues — https://huggingface.co/chiennv/Orthrus-Qwen3-8B
↩early adopters report Issue #3 (fails on MacOS CPU/MPS with Qwen3-1.7B) and note that as of late May 2026 the author has not released the training code, preventing independent verification of the 16% fine-tuning process.
-
LMSys blog — S-LoRA (Nov 2023) — https://www.lmsys.org/blog/2023-11-15-slora/
↩S-LoRA can serve thousands of LoRA adapters on a single GPU or across multiple GPUs with a small overhead… up to 4x higher throughput than vLLM and 30x higher than PEFT.
-
OpenPipe blog — production S-LoRA deployment — https://openpipe.ai/blog/s-lora
↩We store over 10,000 adapters in system RAM while keeping the most active ~125 on GPU, eliminating cold-start latency for infrequently used models.
-
Macaron blog — Tinker compatibility notes — https://macaron.im/blog/thinking-machines-lab-tinker-updates
↩By using the command
import mint as tinker, users can adapt existing Tinker-based codebases to run on MindLab’s infrastructure… MinT keeps base models resident, moving only adapter revisions. -
SuperintelligenceNews — Thinking Machines Tinker API — https://superintelligencenews.com/research/tinker-api-thinking-machines-fine-tuning-open-models/
↩Tinker provides four Python-native primitives — forward_backward, optim_step, sample, weight management — handling the heavy lifting of distributed GPU scheduling while preserving algorithmic control.
-
Macaron MindLab — ‘Router Replay R3: Why It Failed and How We Fixed It’ — https://macaron.im/mindlab/research/router-replay-r3-why-it-failed-and-how-we-fixed-it
↩Small numerical drifts between vLLM and Megatron caused the Pearson correlation between training and inference routing probabilities to drop as low as 0.14 on DeepSeek-V3 before R3 alignment.
-
vLLM GitHub issue #33791 — https://github.com/vllm-project/vllm/issues/33791
↩Dynamic loading via /v1/load_lora_adapter triggers high CPU usage and significantly slower inference compared to static loading via —lora-modules at startup; rank-64 adapters incur far larger hits than rank-16.
-
arXiv 2605.12674 — Foam-Agent paper — https://arxiv.org/abs/2605.12674
↩Foam-Agent (v2.0) achieved an 88.2% execution success rate when paired with Claude 3.5 Sonnet, a significant lead over existing baselines such as MetaOpenFOAM (55.5%) and OpenFOAM-GPT (37.3%)
-
Krank, Kronbichler & Wall (TUM) — DNS of periodic hills up to Re_H=10,595 — https://portal.fis.tum.de/en/publications/direct-numerical-simulation-of-flow-over-periodic-hills-up-to-re-/
↩Direct Numerical Simulation of Flow over Periodic Hills up to Re_H=10,595 … identified a significant discrepancy in the streamwise velocity overshoot directly above the hill crest when compared to the experimental data of Rapp and Manhart (2011) and Breuer et al. (2009).
-
ResearchGate — ‘Generalization Limits of Data-Driven Turbulence Models’ — https://www.researchgate.net/publication/385673624_Generalization_Limits_of_Data-Driven_Turbulence_Models
↩models trained on the periodic hill … exhibit a ‘generalizability gap’ when applied to disparate geometries, such as square ducts or curved backward-facing steps
-
Uni-Stuttgart preprint on data-driven RANS closures — https://elib.uni-stuttgart.de/server/api/core/bitstreams/502fd866-4ee8-4a53-b279-34f3702db464/content
↩increasing training set diversity is unlikely to provide a remedy due to the inherent lack of a unique, local mapping between mean flow features and Reynolds stresses
-
OpenReview submission (NeurIPS ML4PS 2025) — https://openreview.net/forum?id=Q6a9W6kzv5
↩reliance on Gaussian patches to ‘fix’ local errors resembles classical ‘point-wise tuning’ rather than the derivation of universal physical constants
-
TheMoonlight.io review of DeepScientist — https://www.themoonlight.io/en/review/deepscientist-advancing-frontier-pushing-scientific-findings-progressively
↩only 1–3% of its generated ideas ultimately result in measurable scientific progress; roughly 60% of its failures stem from implementation errors … consuming over 20,000 GPU hours for a single discovery cycle