JS Wei (Jack) Sun

Scaffolding, not weights: where AI research is actually moving today

Today's most consequential AI research lives in the scaffolding — distributed training, agent harnesses, retrieval indexes — rather than the model weights themselves.

Scaffolding, not weights: where AI research is actually moving today

TL;DR

  • Decoupled DiLoCo cuts cross-datacenter bandwidth ~235x and holds 88% goodput under high failure rates — distributed training as availability engineering.
  • A teardown of Claude Code’s 500K-line source finds 98.4% deterministic harness, 1.6% AI logic — and the harness is where two CVEs already live.
  • Corpus2Skill replaces vector retrieval with an LLM agent walking a SKILL.md directory, lifting WixQA token F1 ~23% over agentic RAG.
  • Briefs cluster on the same axis: visual-RAG RL, sparse long-context updates, KV packet caching, surrogate routing, transport-flow RL.
  • An AIMO 3 retrospective pushes back — base model capability still dominates inference-time scaffolding tricks like majority voting and verifier selection.

Three features today, three different layers of the stack, one frame: the interesting work — and the interesting failures — are happening in the scaffolding around the model rather than in the weights. DeepMind’s Decoupled DiLoCo is an availability story dressed as a training story: a 235x bandwidth cut and 88% goodput under hardware churn, won by changing how learners synchronize rather than how they learn. The Claude Code teardown puts a number on the same intuition from the agent side — 98.4% deterministic harness, 1.6% model — and finds the CVEs and the false-negative rates living squarely in that 98.4%. Corpus2Skill rebuilds retrieval as an LLM walking a directory tree of SKILL.md files, treating the index itself as the unit of design.

The briefs sit alongside in the same register: visual-RAG RL, long-context sparsity, KV packet stitching, surrogate routing, transport-flow RL. The AIMO 3 retrospective is the day’s useful counterweight — a reminder that base capability still beats most of the clever scaffolding tricks people layer on top of it.

Decoupled DiLoCo trades lock-step for 88% goodput on flaky hardware

Source: deepmind-blog · published 2026-04-22

TL;DR

  • DeepMind’s Decoupled DiLoCo cuts inter-datacenter bandwidth ~235x (198 Gbps → 0.84 Gbps) on a 12B run.
  • Under high failure rates it holds 88% goodput vs. 27% for elastic data-parallel baselines.
  • A new primitive, Radial-Directional Averaging, makes the global update invariant to how many learners report in.
  • It’s a systems-availability win, not a paradigm shift — Nous goes lower-bandwidth, NVIDIA stays lock-step.

What actually changed

Conventional data-parallel training is a tyranny of the slowest chip: every accelerator must finish its step before the all-reduce fires. Decoupled DiLoCo breaks training into asynchronous “learner units” — independent islands of TPUs that compute locally and gossip pseudo-gradients over commodity WAN. DeepMind reports a 12B-parameter model trained across four U.S. regions running 20x faster than tightly-coupled sync, on 2–5 Gbps links. The independent paper walkthrough sharpens that claim: inter-datacenter bandwidth fell from roughly 198 Gbps to 0.84 Gbps, a ~235x reduction, with ML quality matching the synchronous Gemma 4 baseline 1.

The load-bearing result, though, isn’t speed — it’s goodput under chaos. When researchers injected high hardware failure rates, Decoupled DiLoCo retained 88% useful work; standard elastic data-parallel collapsed to 27% 2. That gap reframes the contribution as a fault-tolerance story dressed in distributed-training clothing.

flowchart LR
    subgraph DC1[Datacenter A · TPU v6e]
        L1[Learner unit 1]
    end
    subgraph DC2[Datacenter B · TPU v5p]
        L2[Learner unit 2]
    end
    subgraph DC3[Datacenter C · mixed]
        L3[Learner unit 3]
    end
    L1 -.async pseudo-grads.-> RDA{{Radial-Directional Averaging}}
    L2 -.async pseudo-grads.-> RDA
    L3 -.async pseudo-grads.-> RDA
    RDA --> G[(Global params)]
    G --> L1 & L2 & L3

The algorithmic glue is Radial-Directional Averaging (RDA), which decomposes each pseudo-gradient into a norm (radial) and a unit vector (directional), then averages them separately. The result: the magnitude of the global update no longer depends on how many learners happened to check in this round, so a quorum that shrinks or grows mid-training stays numerically stable 1. That’s also what makes mixing TPU v6e and v5p in one run viable — the faster chips don’t drown out slower ones in the average.

Where it sits in a crowded field

Decentralized training is no longer exotic. Prime Intellect’s OpenDiLoCo already trained across Canada, Finland, and the US on 127–935 Mbit/s links, with communication eating just 6.9% of wall-clock 3. Nous Research’s DisTrO pushes harder still, claiming 1,000–10,000x communication reduction via a DCT-compressed momentum buffer that runs over 10–100 Mbps consumer broadband 4. From the other direction, NVIDIA reported 96% scaling efficiency training Nemotron-4 340B across two datacenters 1,000 km apart — without giving up lock-step at all 5.

Decoupled DiLoCo’s niche is the middle: not the lowest bandwidth, not the tightest coupling, but fault-isolated asynchrony on heterogeneous accelerators. Practitioner threads concede the MapReduce-style partitioning isn’t conceptually new; the engineering of an async syncer that survives node loss on TPU pods is the actual contribution 6.

The part the blog post skips

“Potentially scary, national security wise.” 6

If frontier training no longer needs a single hyperscale campus or 200 Gbps of dark fiber between buildings, the compute-monitoring chokepoint that current export-control regimes lean on gets weaker. DeepMind frames the win as reclaiming “stranded” compute and extending older TPU lifecycles. Both true — and both reasons the governance conversation around distributed training is about to get harder, whether or not RDA is the primitive that wins.


Claude Code is 98.4% plumbing — and the plumbing is where the bugs live

Source: hf-daily-papers · published 2026-04-13

TL;DR

  • A reverse-engineering of Claude Code’s 500K-line TypeScript source finds 98.4% deterministic harness, 1.6% AI decision logic.
  • The paper’s “five human values” framing omits shipped features like Undercover Mode and anti-distillation decoys surfaced in a March 2026 leak.
  • Anthropic’s own auto-approval classifier hits 17% false negatives in production — and 81% on ambiguous prompts.
  • A pre-trust hook execution window already produced two CVEs, one rated CVSS 8.7.

The harness eats the brain

Liu et al. read every line of Claude Code v2.1.88 and report a striking ratio: roughly 98.4% of the codebase is deterministic operational infrastructure — context assembly, a five-layer compaction pipeline, permission gates, subagent worktrees — and only 1.6% is the prompting-and-decision layer that talks to the model. The thesis is that reliable agents are a systems-engineering problem, not a prompting problem, and that Claude Code’s “minimal decision harness” deliberately avoids the rigid state graphs of frameworks like LangGraph in favor of dense deterministic guardrails around a freely-reasoning model.

The architecture is an AsyncGenerator while-loop with five subsystems hanging off it:

flowchart LR
    A[CLAUDE.md + git env] --> B[Context assembly]
    B --> C[5-layer compaction]
    C --> D{Model call}
    D -->|tool_use| E[Permission gate]
    E -->|yoloClassifier| F[Tool exec / subagent]
    F --> B
    G[.claude/settings.json hooks] -. pre-trust .-> F

That dotted line is where the story gets interesting.

The ML permission layer doesn’t hold up where it matters

The paper credits a chain-of-thought classifier (yoloClassifier.ts) with solving “approval fatigue” — users currently rubber-stamp 93% of permission prompts, and auto-approval rates double from ~20% to >40% as developers log more sessions. But Anthropic’s own engineering post reports a 17% false-negative rate on production traffic, and on AmPermBench — a benchmark of deliberately ambiguous prompts — that jumps to 81% 7.

The classifier collapses precisely in the cases where automation matters most: under-specified intent and file-edit cleanup.

That is not a footnote on a working safety system. That is the safety system failing open in the exact regime users will lean on it.

The “temporal ordering flaw” was a shipping default

The paper mentions CVE-2025-59536 and CVE-2026-21852 in its limitations section. Check Point’s disclosure makes clear this is a class of bug, not an edge case: a SessionStart hook embedded in a malicious repo’s .claude/settings.json executes the instant a developer cds into the directory, before the trust dialog ever appears, and enableAllProjectMcpServers: true could silently authorize untrusted MCP servers 8. The “clone repo, open with AI” workflow that defines modern malware analysis is a self-detonating trap.

What the paper leaves out

A March 2026 source-map leak surfaced two subsystems absent from the “five human values” taxonomy. utils/undercover.ts activates when USER_TYPE === 'ant' is detected in a public repo and instructs the model to strip “Claude Code” branding, internal codenames, and Co-Authored-By Git trailers — engineered to allow undisclosed AI-authored commits into open source 9. An ANTI_DISTILLATION_CC flag injects decoy tools to poison competitor distillation traffic 9. A design-space analysis that claims to enumerate the values driving the architecture has to account for these.

The downstream numbers are real

Where the paper is conservative, the independent data is harsher. Anthropic’s own RCT shows full-delegation users scoring below 40% on comprehension quizzes while tutor-mode users hold above 65% — a bimodal split the paper averages into a single 17% drop 10. GitClear’s longitudinal corpus shows refactored/moved code falling from 25% (2021) to under 10% (2024) as copy-paste hit historic highs 11, which is exactly the “good local, poor global” failure mode the paper predicts. And the clean subagent-isolation story has a known crack: GitHub #51596 documents 8-character hex prefix collisions causing fresh sessions to inherit stale worktree state 12.

The taxonomy is useful. The architecture is impressive. The harness still ships with the holes.


RAG as filesystem navigation: Corpus2Skill turns Anthropic’s Skills into a knowledge map

Source: hf-daily-papers · published 2026-04-15

TL;DR

  • Corpus2Skill replaces vector retrieval with an LLM agent that ls/cats a hierarchical SKILL.md directory built from clustered document summaries.
  • On WixQA it hits 0.468 token F1 vs 0.381 for an agentic-RAG baseline — a ~23% relative lift, with context recall jumping from 0.498 to 0.673.
  • Prompt caching brings per-query cost to $0.089, roughly matching agentic search ($0.088); without caching it is 7–13× more expensive than single-shot RAG.
  • Fails on homogeneous tables (TatQA) and long extractive contracts (CUAD) — structural limits of any hierarchical-summary approach.

The pitch: don’t retrieve, browse

The conceptual move is small but pointed. Anthropic’s Agent Skills standard already lets an agent load a few dozen tokens of YAML metadata at startup and only cat the body when it decides a skill is relevant 13. Corpus2Skill repurposes that machinery from procedural skills (“how to file an expense report”) to informational ones (“where billing questions live in the corpus”). The vector database goes away. In its place: a directory tree of LLM-written summaries that the agent walks top-down, drilling into INDEX.md files until it finds a document ID, then calling get_document(id) for the full text.

flowchart LR
    A[Raw docs] --> B[Embed + K-Means]
    B --> C[LLM cluster summaries]
    C --> D[Re-embed + cluster again]
    D --> E[SKILL.md / INDEX.md forest]
    E --> F{Agent: ls, cat, get_document}
    Q[Query] --> F
    F --> R[Answer]

The numbers, in context

WixQA is a real test, not a strawman: Wix’s own engineers report dense-retrieval + GPT-4o baselines stall around 76–77% on the simulated set 14. Against that backdrop, Corpus2Skill’s 0.468 token F1 (vs 0.389 for RAPTOR, 0.381 for an iterative agentic baseline) and 0.739 LLM-judged factuality are meaningful gains, and the context-recall jump to 0.673 suggests the navigation agent is genuinely finding evidence the dense retrievers miss.

The economic claim is the load-bearing one. Independent analysis pegs agentic-RAG loops at ~3.3× input and ~1.9× output tokens versus tuned dense retrieval 15, which lines up with the paper’s admitted 7–13× overhead. The authors’ counter is ephemeral prompt caching: navigation files get reused across turns and queries, dragging per-query cost down to $0.089 — within a hair of the agentic baseline’s $0.088. That parity is the entire commercial argument, and no one outside the author group has replicated it yet.

The compile-side cost story looks favorable too. The E2GraphRAG benchmark found RAPTOR’s hierarchical-summary indexing roughly 10× cheaper than entity-graph extraction (0.25 vs 2.62 standardized units on NovelQA) 16, and Corpus2Skill’s K-Means + summarize loop sits in the same family — so building the skill forest should be far cheaper than a GraphRAG-style knowledge graph over the same corpus.

Where it breaks

The failure modes are honest and structural. Corpus2Skill loses on TatQA (thousands of near-identical financial tables, where summaries can’t discriminate) and CUAD (long legal contracts where the answer is a verbatim span, not a topic). Both are well-known weak spots for any approach that compresses documents into topical summaries — the discriminative signal just isn’t there. Hard clustering is another sharp edge: each document lives on exactly one path, so a doc that spans “Billing” and “SEO” is invisible from the wrong branch.

There’s also a security shadow worth flagging. Anthropic’s Skills format permits executable scripts in a scripts/ directory, and ecosystem reviewers have already warned about agents running malicious code from untrusted skill packages 17. Corpus2Skill’s read-only tool surface side-steps this — but any production deployment that lets the navigation agent execute helpers inside cluster directories reopens the hole.

What’s actually new

The reference implementation is self-described as “early release… work in progress, rough edges remain” 18, and the F1 deltas should be treated as provisional. The durable contribution is the reframe: a corpus is a filesystem, retrieval is cd, and the agent is the user. Whether that beats a well-tuned hybrid retriever in production — at the cached price the paper claims — is now a reproduction problem, not a conceptual one.

Round-ups

UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards

Source: hf-daily-papers

UniDoc-RL trains large vision-language models for document RAG by jointly optimizing retrieval, reranking, perception, and reasoning under one reinforcement learning loop, using a hierarchical action space and dense multi-reward supervision via Group Relative Policy Optimization rather than treating each stage independently.

KV Packet: Recomputation-Free Context-Independent KV Caching for LLMs

Source: hf-daily-papers

KV Packet eliminates the recomputation overhead in cross-document KV cache reuse by treating each cached document as an immutable packet stitched together with trainable soft-token adapters distilled self-supervised, cutting FLOPs and time-to-first-token on Llama-3.1 and Qwen2.5 versus CacheBlend, EPIC, and SAM-KV.

LongAct: Harnessing Intrinsic Activation Patterns for Long-Context Reinforcement Learning

Source: hf-daily-papers

LongAct targets long-context RL post-training by restricting updates to the high-magnitude entries in query and key vectors, a saliency-guided sparse scheme compatible with GRPO and DAPO that reports gains on LongBench v2 and RULER while cutting update cost.

Reinforcement Learning via Value Gradient Flow

Source: hf-daily-papers

Value Gradient Flow recasts behavior-regularized RL as an optimal transport problem solved by discrete gradient flow against a reference distribution, avoiding value over-optimization and enabling adaptive test-time scaling via a tunable transport budget. It reports gains on offline RL and LLM RL benchmarks.

TRACER: Trace-Based Adaptive Cost-Efficient Routing for LLM Classification

Source: hf-daily-papers

TRACER uses production traces to distill cheaper ML surrogates for LLM classification tasks, gating them behind a parity check that only routes traffic to the surrogate when its agreement with the original model crosses a configurable threshold. The system is open source with intent and NLI benchmarks.

Model Capability Dominates: Inference-Time Optimization Lessons from AIMO 3

Source: hf-daily-papers

A retrospective from the AIMO 3 mathematical-olympiad competition argues that base model capability and reasoning-strategy diversity dominate inference-time tricks: majority voting plateaus due to correlated errors, and a Diverse Prompt Mixer with high-temperature sampling outperforms heavier prompt engineering or verifier-based selection.

ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack

Source: hf-daily-papers

ASGuard defends against tense-shift jailbreaks (e.g., rewriting prompts in past tense) by using mechanistic circuit analysis to locate the specific attention heads responsible for brittle refusals, then applying targeted activation scaling and preventative fine-tuning rather than broad safety retraining.

Footnotes

  1. ArxivIQ Substack (paper walkthrough)https://arxiviq.substack.com/p/decoupled-diloco-for-resilient-distributed

    Inter-datacenter bandwidth requirements [drop] from approximately 198 Gbps to just 0.84 Gbps… Radial-Directional Averaging (RDA) decouples the radial component (the norm) from the directional component (the unit vector), ensuring the global update’s magnitude remains invariant to the number of learners.

    2
  2. MarkTechPosthttps://www.marktechpost.com/2026/04/23/google-deepmind-introduces-decoupled-diloco-an-asynchronous-training-architecture-achieving-88-goodput-under-high-hardware-failure-rates/

    Decoupled DiLoCo maintained 88% goodput under high failure rates, compared to just 27% for standard elastic data-parallel methods.

  3. Prime Intellect (OpenDiLoCo blog)https://www.primeintellect.ai/blog/opendiloco

    OpenDiLoCo maintained high efficiency on network bandwidths ranging from 127 to 935 Mbit/s… communication bottlenecks accounted for only 6.9% of total training time.

  4. VentureBeat (Nous DisTrO)https://venturebeat.com/ai/nous-research-is-training-an-ai-model-using-machines-distributed-across-the-internet

    DisTrO is built on the DeMo (Decoupled Momentum) optimizer… claiming 1,000x to 10,000x efficiency gains, allowing training on consumer-grade internet connections as slow as 10–100 Mbps.

  5. NVIDIA Developer Blog (Nemotron-4 340B)https://developer.nvidia.com/blog/turbocharge-llm-training-across-long-haul-data-center-networks-with-nvidia-nemo-framework/

    NVIDIA demonstrated training the Nemotron-4 340B model across two data centers 1,000 km apart… achieving 96% scaling efficiency across 3,072 GPUs.

  6. r/machinelearningnews discussionhttps://www.reddit.com/r/machinelearningnews/comments/1su5vds/google_deepmind_introduces_decoupled_diloco_an/

    The work partitioning scheme isn’t novel, but the scheme itself is… applying a MapReduce-style pattern to AI training to overcome high intra-node latency is the real challenge. Potentially scary, national security wise.

    2
  7. Anthropic engineering blog (Auto Mode)https://www.anthropic.com/engineering/claude-code-auto-mode

    Anthropic reports a 17% false negative rate (FNR) on production traffic [but] AmPermBench, which used deliberately ambiguous prompts, found an 81% FNR, suggesting the classifier struggles when user intent is underspecified

  8. The Hacker News (Check Point disclosure)https://thehackernews.com/2026/02/claude-code-flaws-allow-remote-code.html

    CVE-2025-59536 (CVSS 8.7) allowed remote code execution via malicious shell commands embedded in a repository’s .claude/settings.json file … a SessionStart hook the moment a developer launched Claude Code within an untrusted directory … bypassed the startup trust dialog

  9. Decode the Future (leak analysis)https://decodethefuture.org/en/claude-code-undercover-mode-killswitches-telemetry/

    Undercover Mode is triggered automatically when the software detects an Anthropic employee (USER_TYPE === ‘ant’) working in a public repository … instructing the AI to ‘not blow your cover’ and to strip … standard Git Co-Authored-By attribution lines

    2
  10. ByteIota summary of Anthropic RCThttps://byteiota.com/ai-coding-assistants-cut-developer-skills-by-17-anthropic-study/

    developers who delegated code generation entirely to the AI scored below 40% on follow-up quizzes … those who used AI as a tutor for conceptual inquiries maintained high scores above 65%

  11. Jonas.rs / GitClear 2025 report summaryhttps://www.jonas.rs/2025/02/09/report-summary-gitclear-ai-code-quality-research-2025.html

    the percentage of code being refactored or ‘moved’ has plummeted from 25% in 2021 to less than 10% in 2024, while ‘copy-pasted’ or cloned code has reached historic highs

  12. dev.to (Claude Code worktrees writeup)https://dev.to/thebrierfox/claude-code-worktrees-how-to-run-parallel-builds-without-merge-conflicts-56m2

    a ‘stale branch’ bug (GitHub #51596) where the system may reuse existing branches if 8-character hex prefixes collide, potentially contaminating new sessions with old uncommitted changes

  13. Anthropic Engineering — Agent Skillshttps://www.anthropic.com/engineering/equipping-agents-for-the-real-world-with-agent-skills

    At session startup, an agent only loads a few dozen tokens of metadata (name and description) from the YAML frontmatter; full procedural instructions are only read into the context window via filesystem tools when the agent determines the skill is relevant.

  14. Wix Engineering blog (WixQA)https://www.wix.engineering/post/advancing-enterprise-ai-how-wix-is-democratizing-rag-evaluation

    Standard setups using dense retrieval (E5) and GPT-4o hit approximately 76–77% accuracy on the simulated dataset, which is considered low compared to typical open-domain tasks.

  15. Towards Data Science — Agentic RAG vs Classic RAGhttps://towardsdatascience.com/agentic-rag-vs-classic-rag-from-a-pipeline-to-a-control-loop/

    Agentic RAG is systematically more expensive, requiring approximately 3.3x more input tokens and 1.9x more output tokens than ‘Enhanced RAG’ (optimized dense retrieval).

  16. E2GraphRAG benchmark (OpenReview, 2026)https://openreview.net/pdf/24e89d27a563f31865dc6c6a65b9c32e3570717f.pdf

    On the NovelQA dataset, standard GraphRAG incurred an indexing cost of 2.62 units whereas RAPTOR cost only 0.25 units — a nearly 90% reduction in token consumption.

  17. Firecrawl blog — Agent Skillshttps://www.firecrawl.dev/blog/agent-skills

    Because skills can bundle executable scripts in a scripts/ directory, there is a risk of agents running malicious code from untrusted repositories.

  18. GitHub — dukesun99/Corpus2Skill READMEhttps://github.com/dukesun99/Corpus2Skill

    early release … Work in Progress … the core pipeline functions end-to-end, rough edges remain.

Jack Sun

Jack Sun, writing.

Engineer · Bay Area

Hands-on with agentic AI all day — building frameworks, reading what industry ships, occasionally writing them down.

Digest
All · AI Tech · AI Research · AI News
Writing
Essays
Elsewhere
Subscribe
All · AI Tech · AI Research · AI News · Essays

© 2026 Wei (Jack) Sun · jacksunwei.me Built on Astro · hosted on Cloudflare