Sources

Unlocking large scale AI training networks with MRC (Multipath Reliable Connection) openai.com

OpenAI introduces MRC (Multipath Reliable Connection), a new supercomputer networking protocol released via OCP to improve resilience and performance in large-scale AI training clusters.

vLLM V0 to V1: Correctness Before Corrections in RL huggingface.co

Google’s Gemma 4 AI models get 3x speed boost by predicting future tokens arstechnica.com

Up to 3x the speed with no loss of quality—is it too good to be true?

Vibe coding and agentic engineering are getting closer than I’d like simonwillison.net

On Heavybit’s High Leverage podcast, Simon Willison admits the line he drew between throwaway ‘vibe coding’ and disciplined ‘agentic engineering’ is dissolving in his own workflow: he no longer reviews every line Claude Code writes for production, treating the agent like a black-box internal team whose output he trusts until it breaks.

References

ServeTheHome (Patrick Kennedy) servethehome.com

MRC is the custom RDMA transport protocol for gigascale AI… NVIDIA does not see the industry collapsing onto a single UEC winner; instead hyperscalers will tune variants like MRC to their specific workloads.

Dell’Oro Group analysis delloro.com

OpenAI’s MRC initiative reinforces Ethernet’s expanding role in AI back-end networks, where Ethernet sales surpassed InfiniBand in 2025.

Ernest Chiang notes on AWS SRD ernestchiang.com

SRD sprays packets across up to 64 parallel paths via ECMP and delegates reordering to Nitro hardware—an approach AWS shipped years before MRC formalized similar ideas.

Midokura: UET vs Falcon and Beyond midokura.com

MRC is a production-hardened extension of RoCEv2 incorporating early UEC ideas, while UEC v1.0.2 (Jan 2026) remains the broader multivendor framework—the two are complementary, not rival, transports.

Data Center Knowledge datacenterknowledge.com

Although contributed to OCP, MRC’s advanced features—packet trimming, SRv6 source routing, out-of-order DMA—require high-end programmable NICs like Broadcom Thor Ultra or AMD Pensando Pollara, limiting near-term adoption to hyperscalers.

Bizety: AI Network Wars bizety.com

Critics argue source routing via SRv6 adds significant header overhead and shifts complexity to the NIC, potentially creating a new form of silicon lock-in where only specific vendors can support the full stack efficiently.

Thoughtworks/HuggingFace blog (EAGLE3 for Gemma 4) huggingface.co

EAGLE3 draft head for Gemma 4 31B… achieved a 1.72x speedup on the MT-Bench conversational benchmark… speed gains dropped to 1.05–1.14x on the SWEBench coding benchmark, as code is inherently less predictable than natural language.

r/LocalLLaMA benchmark thread reddit.com

29% average speedup (peaking at 50% for coding tasks) when using the E2B (4.65B) model to draft for the 31B target

Google AI Edge MTP overview (ai.google.dev) ai.google.dev

Centroids Masking groups the 262K-token vocabulary into ~4K clusters… reduces linear head computation by roughly 45x with negligible impact on accuracy

Maarten Grootendorst, ‘A Visual Guide to Gemma 4’ newsletter.maartengrootendorst.com

the native MTP heads were initially omitted from the Hugging Face release, remaining exclusive to Google’s LiteRT framework

r/LocalLLaMA — ‘Speculative decoding silently broken for Qwen 3.6’ reddit.com

metadata mismatches in early GGUF files (such as conflicting add_bos_token flags) could force inference engines into ‘token translation mode,’ effectively nullifying all speed gains

Medium — ‘Understanding MTP in DeepSeek-V3’ medium.com

DeepSeek explicitly credits Meta’s work (Gloeckle et al., 2024) in its technical reports as the foundational inspiration… DeepSeek-V3 uses auxiliary one-layer transformer modules that predict tokens in a causal chain

ScaleRL paper (arxiv 2510.13786) arxiv.org

Including the FP32 precision fix at the LM head improved asymptotic reward from 0.52 to 0.61, establishing it as a de-facto standard for stable large-scale RL training.

MiniMax-M1 technical report (arxiv 2506.13585) arxiv.org

Upgrading the LM head to FP32 precision raised correlation between training and rollout token probabilities from ~0.9x to ~0.99x, eliminating reward plateaus in long-CoT RL.

TRL GitHub issue #4159 github.com

TRL’s GRPOTrainer added logprobs_mode=“processed_logprobs” for vLLM ≥0.10.2; without it, raw logprobs combined with non-unit temperature produce a train-inference mismatch that destabilizes policy ratios.

0fd.org — ‘When Speed Kills Stability’ 0fd.org

Mitigations such as Masked Importance Sampling and dynamic LR scheduling can add up to 25% computational overhead, prompting debate over whether to prioritize raw inference speed or bitwise consistency.

EmergentMind — CISPO algorithm summary emergentmind.com

CISPO clips only the magnitude of the importance sampling weight while retaining the underlying gradient update, so rare ‘reflective’ tokens (e.g. ‘Wait’, ‘However’) still contribute — reaching target performance in roughly half the steps of GRPO.

Andrey Lukyanenko — GSPO paper review andlukyane.com

GSPO clips significantly more tokens than GRPO yet yields more effective gradients, and is naturally tolerant of the precision discrepancies and routing fluctuations inherent in MoE models — eliminating engineering hacks like Routing Replay.

Sources

References

Jack Sun, writing.