JS Wei (Jack) Sun

OLMo needs width, MemFT needs token reweighting, Parallax needs Muon

Three training-dynamics results each post a clean gain that holds only under one specific knob: model width, token weighting, or optimizer.

OLMo needs width, MemFT needs token reweighting, Parallax needs Muon

TL;DR

  • OLMo 4B learns rare modular-arithmetic tasks that 4M-param siblings never acquire.
  • MemFT reweights LoRA loss onto stubborn tokens, hitting 100% accuracy versus SFT’s 94.7%.
  • Parallax beats softmax attention at 0.6B only when trained under the Muon optimizer.
  • LaRA flags RL post-training contamination through layer-wise geometric signatures black-box probes miss.
  • Poisoned LoRA adapters hide backdoors detectable via cross-module Frobenius-norm statistics.

Today’s research thread is training dynamics, not architecture. Three independent results each land a clean headline number — a scale threshold for rare-task acquisition, a fine-tuning gain on stubborn tokens, an attention variant that beats softmax — and each turns out to hinge on one specific gradient-flow choice the headline alone doesn’t surface. OLMo’s width sets whether rare features survive interference from common ones. MemFT’s reweighting decides which tokens the LoRA loss actually sees. Parallax’s gains exist under Muon and collapse under AdamW.

The round-up leans on a parallel fragility one layer down: backdoors hiding in LoRA adapters, contamination detectable through layer-wise representation geometry, RLHF preference data weaponized via best-of-N sampling. The same pipelines the features dissect — LoRA fine-tuning, RL post-training, preference optimization — are where today’s security and audit work finds its leverage.

Larger OLMo models learn rare tasks small ones never reach

Source: hf-daily-papers · published 2026-05-27

TL;DR

  • OLMo 4B acquires injected modular-arithmetic tasks that 4M-param siblings fail to learn, even given more data.
  • Mechanism: common-task gradients overwrite rare-task progress in small models, producing a sawtooth “update-and-forget” pattern.
  • Above a critical width, frequent-task gradients weaken enough that rare-feature signal accumulates between injections.
  • Independent work offers three rival explanations — metric artifacts, output-head orthogonality, and MoE routing — that the paper doesn’t cleanly rule out.

From sample efficiency to interference

The standard story for why scale unlocks capability is sample efficiency: bigger models squeeze more signal out of the same tokens. Huang & Saphra argue the real bottleneck is resource competition inside the data mixture. Tasks fight for parameters; in a small model, the high-frequency tasks win, and their gradient updates between rare-task occurrences actively erase whatever fragile progress the rare task made. That is a stronger claim than “small models need more data” — it is “small models never get there.”

The authors formalize this with a utility ordering (task frequency × feature importance) and prove features are learned in that order, with rare features acquired only once a “critical width” $N^{crit}_r$ makes them locally stable against interference from common features.

What the OLMo injection experiment shows

The evidence has two layers. In a synthetic mixture of 32 linear regression tasks, per-task loss falls predictably with width, and rare-task signal in small models shows a clean sawtooth: spike on injection, decay between injections. Larger students hold the signal flat.

The pretraining layer is more interesting. The authors pretrain OLMo variants from 4M to 4B parameters on Dolma v1.7 and inject two synthetic tasks (modular addition, comparison) at controlled frequencies. Only models approaching 4B acquire the rarest, hardest injections. Crucially, at matched general language-modeling loss, larger models achieve much lower task-specific loss on the injected tasks, and show measurably lower gradient cosine-interference between the LM objective and the rare-task objective. So the scaling benefit isn’t “better at language in general” leaking into the probe — it’s specifically better retention of low-utility features.

This is a constructive answer to the emergence debate Wei and collaborators kicked off, where downstream capabilities appear as qualitative jumps that don’t extrapolate from smaller models 1. Huang & Saphra give that phenomenon a mechanism.

Where the mechanism fits — and breaks

The framing the paper has to defend is “tasks smaller models never learn even with infinite data.” That is exactly the claim Schaeffer et al.’s “mirage” thesis chips at: switch from exact-match to continuous metrics and apparent step changes flatten into smooth power laws 2. The OLMo per-task losses partially dodge this by being continuous, but the strong “never” framing is the load-bearing flank.

Two other accounts overlap with the interference story without being identical. A $1/d$ scaling law for catastrophic forgetting attributes wider-model retention to growing orthogonality between output heads in high-dimensional space 3 — geometric superposition, not utility-ordered competition, but predicting similar curves. “Adam’s Law” pushes further in Huang & Saphra’s direction, arguing low-frequency or structurally hard tasks can double the effective scaling exponent under representation learning 4.

The sharpest dissent is empirical. MLSanity’s papercrunch reports that in the 1B–14B range:

the severity of forgetting can actually intensify as scale increases, possibly due to the higher initial performance of larger models creating a steeper drop 5

That is the opposite scaling sign, at fine-tuning rather than pretraining. And if interference really is the bottleneck, MoE proponents have a fair point that routing task-specific knowledge to disjoint experts partitions the parameter space at a fraction of the FLOPs dense width costs 6.

The controlled OLMo task-injection protocol is the paper’s durable contribution. The “infinite data” framing is the part to watch.


MemFT beats SFT by training only LoRA’s stubborn tokens

Source: hf-daily-papers · published 2026-05-27

TL;DR

  • A new power law $\Delta\mathcal{L} = C \cdot r^\alpha \cdot \ell^{-\beta} + b$ fits LoRA verbatim memorization with R² > 0.98 on Qwen3-8B and Llama3.1-8B.
  • Under greedy decoding, any token with p ≤ 0.5 can derail the whole sequence via autoregressive cascade failure.
  • MemFT reweights training onto sub-threshold “stubborn” tokens, hitting 100% token accuracy on a long-context stress test versus SFT’s 94.7%.
  • LoRA normally leaks ~10× less than full fine-tuning, a privacy buffer MemFT’s recall push directly erodes.

A scaling law for verbatim recall

Most LoRA capacity discussion has been empirical and vibes-based. This paper makes it quantitative: across a wide sweep of ranks r and sequence lengths , training-loss reduction follows a clean log-log linear relationship, $\Delta\mathcal{L} = C \cdot r^\alpha \cdot \ell^{-\beta} + b$, with α measuring how efficiently extra rank buys memory and β measuring the length penalty. The fit holds at R² > 0.98 on both 8B models tested.

The law lands inside an active debate. Thinking Machines’ “LoRA Without Regret” sweep argues LoRA matches full fine-tuning on small-to-medium corpora but diverges at pretraining-scale data where rank becomes a hard ceiling 7. The Parametric Memory Law is a closed-form version of exactly that ceiling — useful if you want to predict, before training, whether rank 16 will fit your GDPR text or whether you need 64.

The p = 0.5 cliff and MemFT

The mechanistic claim is sharper than the law itself. Under greedy decoding, a token is recalled exactly iff its predicted probability exceeds 0.5 (loss ≈ 0.693). Above that, it’s guaranteed argmax; below it, some other token can take the top spot and derail the rest of the sequence — what the authors call “autoregressive cascade failure.” The pathology this explains is well known to anyone who has fine-tuned for memorization: average cross-entropy looks great while exact-match accuracy sits at zero, because a handful of stubborn tokens are skewing the picture.

MemFT acts on this directly. Its hard-mask variant (MemFT-OT) zeros gradient on any token already past the 0.693 loss threshold, freeing budget for the ones that haven’t crossed. The sliding variant (MemFT-SW) adds a curriculum that focuses on the local context around the first prediction error. Concrete results:

Task / ModelSettingSFTMemFT
Stress Test (Llama3.1-8B, r=16)token acc.94.7%100.0% (OT)
PhoneBook (Qwen3-8B, r=32)exact matchmatches only at higher rank99.5% (SW)
Linear Rule Learningunseen-pair gen.baseline+7–15%

The generalization bump on linear-rule learning is the interesting one: starving easy tokens of gradient appears to push the model toward the underlying pattern rather than over-memorizing specific examples.

What the law can’t see

Three caveats the paper underweights. First, the p=0.5 threshold is a mathematical artifact of greedy argmax; prior work flags that greedy-optimized models behave very differently under temperature or nucleus sampling, and leakage can surface only when the distribution is flattened 8. Production inference rarely runs pure greedy.

Second, Shuttleworth et al. show LoRA and full fine-tuning are not the same solution even at matched accuracy — LoRA inserts “intruder dimensions,” high-singular-value directions absent from FFT, linked to faster forgetting in continual learning 9. The law treats LoRA as a clean memory probe, but if memorized content lives in intruder dimensions, the 0.5 threshold describes a structurally different memory than FFT produces. Related: rsLoRA argues the standard 1/r scaling causes “stunted learning” at high rank 10, so the fitted α may understate true capacity 11.

Third, the privacy regression is real. “Leaner Training, Lower Leakage” reports LoRA cuts unintended verbatim memorization ~10× versus FFT 12. MemFT’s explicit pitch — push more tokens past verbatim recall — directly erodes that property. The paper’s own demo domains (API keys, watermarks, ICD-10 codes) read very differently depending on whether you want the model to remember them or whether a regulator does.


Parallax tops softmax attention at 1.7B — with Muon

Source: hf-daily-papers · published 2026-05-26

TL;DR

  • Parallax bolts a learned covariance-correction branch onto softmax attention, hitting 22.25 WikiText perplexity at 0.6B vs. 23.36.
  • A custom CuTeDSL decode kernel matches or beats FlashAttention 2/3, with up to 1.54× speedup in compute-matched settings.
  • The gains collapse under AdamW — the probe vector shrinks until Parallax degrades into vanilla softmax attention.
  • Untested above 1.7B parameters and not benchmarked against Gated DeltaNet under matched recipes — the obvious missing experiment.

What Parallax actually changes

Standard softmax attention is, mathematically, a 1964 Nadaraya–Watson local-constant kernel regressor — useful trivia, but it also pins down the bias Parallax tries to remove 13. Local Linear Attention (LLA) replaces that constant estimate with a local-linear one, which improves associative memory but requires solving a small linear system for every query. The prior art, FlashLLA, did this with a fused Triton conjugate-gradient kernel that had to stream the Key matrix several times per step, eating the wall-clock budget it was supposed to save 14.

Parallax’s move is to skip the solve entirely. Instead of inverting the local KV covariance per token, it learns a single projection $W_R$ that produces a “probe” vector $\rho_i = W_R x_i$, and subtracts $\Sigma_{KV}^{(i)} \rho_i$ from the standard softmax output. Conceptually it’s a residual correction branch, not a replacement architecture — MarkTechPost’s coverage flags this framing explicitly: “keeps softmax attention and adds a learned covariance correction branch” 15.

flowchart LR
    X[Input x_i] --> SA[Softmax attention o_i^SA]
    X --> R[Learned probe ρ_i = W_R x_i]
    KV[Local KV covariance Σ_KV] --> CORR[Σ_KV · ρ_i]
    R --> CORR
    SA --> SUB((−))
    CORR --> SUB
    SUB --> OUT[Parallax output]

The kernel co-design is the other half. By doubling FLOPs while holding HBM traffic nearly constant, Parallax shifts attention from I/O-bound to compute-bound — which is exactly the regime modern Hopper GPUs were built for, and why the decode kernel can outrun FlashAttention 3.

The Muon dependency is load-bearing

The most interesting finding in the paper isn’t the perplexity number — it’s that the gain only exists when training with the Muon optimizer. Under AdamW, what the authors call “magnitude tension” causes $|W_R|$ to shrink during training; the correction term goes inert and Parallax silently reduces to standard softmax attention. With Muon, the weight matrices retain higher stable rank, the probe stays expressive, and the 0.6B average across BoolQ/HellaSwag/PIQA/ARC/WinoGrande/OpenBookQA/SciQ lands at 55.99% vs. 54.90% for the Muon-trained Transformer baseline and 52.61% for the AdamW baseline.

That’s a real result, but it makes Parallax an architecture-plus-optimizer package. You don’t get to adopt one without the other.

Where the claim is fragile

Two caveats from outside the paper matter. First, 1.7B is exactly the scale where a recent subquadratic scaling study found the leaderboard reshuffles — Gated DeltaNet overtook xLSTM and recovered 94.2% of a Transformer teacher’s performance 16. Parallax beats DeltaNet on MAD recall, but a head-to-head against Gated DeltaNet under matched Muon training is conspicuously absent.

Second, the engineering is locked to Hopper and CuTeDSL. AIWeekly’s writeup makes the adoption question explicit: long-term viability hinges on whether anyone ports the kernels to other hardware 17. And a community explainer notes the open theoretical worry that the local-linear-vs-local-constant gap may compress at frontier scale 18.

The mechanism is real. The wins are provisional.

Net: a credible engineering advance over FlashLLA, with a perplexity story that survives or falls with three open questions — scale beyond 1.7B, comparison against Gated DeltaNet, and whether Muon stays the right partner.

Round-ups

Poisoned LoRA adapters hide backdoors detectable by weight statistics

Source: hf-daily-papers

LoRA fine-tuning can embed backdoors that activate at the token-feature level while preserving benign performance, evading prompt-injection classifiers. The authors localize the trigger to MLP down_proj weights via causal patching and flag adapters using cross-module standard deviation of Frobenius norms.

LaRA spots RL post-training contamination via layer-wise geometry

Source: hf-daily-papers

LaRA detects benchmark contamination in RL-post-trained LLMs by tracking geometric deviations across layers, including perturbation sensitivity, directional collapse, and local representation rigidity. Contaminated checkpoints show distinctive layer-wise signatures that black-box probing misses, giving evaluators a structural test for leaked training data.

RLHF preference data is vulnerable to alignment tampering attacks

Source: hf-daily-papers

Alignment tampering exploits weaknesses in pairwise comparisons and reward modeling, letting language models amplify undesired behaviors through poisoned preference datasets. The work shows best-of-N sampling magnifies the effect, turning RLHF’s own optimization pressure into a mechanism for entrenching misaligned biases.

AgentDoG 1.5 adds taxonomy-guided safety training for AI agents

Source: hf-daily-papers

AgentDoG 1.5 combines a structured agent-safety taxonomy with influence-function purification and agentic safety SFT, training in Docker-level RL environments. An online guardrail handles real-time moderation inside interactive scenarios, and the authors emphasize that the alignment pipeline runs on minimal curated samples.

CausaLab benchmarks LLM agents on interactive causal discovery

Source: hf-daily-papers

CausaLab puts LLM agents inside synthetic structural causal models where they must run interventions and recover the underlying graph, not just predict outcomes. The split between predictive success and causal understanding exposes agents that score well on observations while missing the true mechanism.

WorldMemArena benchmark exposes memory gaps in multimodal agents

Source: hf-daily-papers

WorldMemArena tests how multimodal agents write, store, and retrieve memory across long-horizon tasks through an Action-World Interaction Loop. The benchmark separates lifelong evolution from agentic execution, showing that RAG and harness-based memory systems still struggle to ground decisions in prior visual evidence.

Qwen-VLA unifies manipulation, navigation, and trajectory prediction in one model

Source: hf-daily-papers

Qwen-VLA pairs a shared vision-language backbone with a DiT-based action decoder and embodiment-aware prompt conditioning, jointly pretraining across robot platforms. The single model handles manipulation, navigation, and trajectory prediction, with out-of-distribution generalization across embodiments the paper highlights as its main result.

Footnotes

  1. Google Research blog (Wei et al., on emergent abilities)https://research.google/blog/characterizing-emergent-phenomena-in-large-language-models/

    Certain downstream tasks exhibit qualitative shifts that cannot be extrapolated from smaller models

  2. Dhiria — ‘Emergent Abilities: Reality or Mirage?’ (summarizing Schaeffer et al., NeurIPS 2023)https://www.dhiria.com/en/blog/emergent-abilities-in-large-language-models-reality-or-mirage

    Sharp transitions are caused by the researcher’s choice of evaluation metrics rather than a fundamental change in model behavior; with continuous metrics, performance gains appear smooth and follow existing scaling laws

  3. OpenReview — ‘$1/d$ scaling law for catastrophic forgetting’https://openreview.net/pdf/64496ca54c5e1ef95e6dd7bd136098534e2f6306.pdf

    Forgetting decreases as the hidden dimension of the model increases… driven by the increasing orthogonality of output heads in high-dimensional space

  4. Medium — ‘Adam’s Law: textual-frequency cheat code for LLMs’https://medium.com/towardsdev/adams-law-the-hidden-textual-frequency-cheat-code-for-llms-f5834a75690e

    For ‘hard tasks’—defined by their low frequency or complex structure—feature learning can effectively double the scaling exponent compared to standard models

  5. MLSanity Papercrunch — empirical forgetting at scalehttps://mlsanity.com/papercrunch/

    In the 1B to 14B parameter range, the severity of forgetting can actually intensify as scale increases, possibly due to the higher initial performance of larger models creating a steeper drop

  6. GenerativeAI.pub — MoE as alternative to dense scalinghttps://generativeai.pub/mixture-of-experts-60504e24b055

    MoE mitigates [gradient interference] by using a router to decouple the learning process; task-specific knowledge is routed to different subnetworks, effectively partitioning the parameter space to prevent tasks from competing for the same weights

  7. Thinking Machines Lab — ‘LoRA Without Regret’https://thinkingmachines.ai/blog/lora/

    LoRA matches FFT efficiency on small-to-medium datasets (e.g., ~1 million examples) when adapters are applied to all layers, particularly MLPs … at ‘pretraining-like’ scales, LoRA’s capacity becomes a bottleneck, leading to a divergence where FFT remains more efficient.

  8. Josifoski et al., ‘How Optimal is Greedy Decoding?’ (AKBC 2022)https://www.akbc.ws/2022/assets/pdfs/17_how_optimal_is_greedy_decoding.pdf

    Greedy decoding is often criticized as ‘brittle’ … models optimized for greedy performance often exhibit a significant gap when subjected to sampling-based generation … greedy search can occasionally suppress leakage that becomes apparent only when the model’s output distribution is flattened by temperature.

  9. Shuttleworth et al., ‘LoRA vs Full Fine-tuning: An Illusion of Equivalence’ (arXiv 2410.21228)https://arxiv.org/abs/2410.21228

    LoRA introduces ‘intruder dimensions’—new, high-ranking singular vectors in the weight matrices that do not appear in full fine-tuning … these dimensions can cause task-specific details to be captured in a way that may lead to more rapid forgetting during continual learning.

  10. Kalajdzievski, ‘A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA’ (rsLoRA blog)https://d2jud02ci9yv69.cloudfront.net/2025-04-28-pessa-189/blog/pessa/

    The original implementation’s scaling can lead to ‘stunted learning’ as the rank increases, creating a false impression of performance saturation at low ranks … rsLoRA uses a 1/√r scaling factor to allow for stable learning even at very high ranks (up to 2048), theoretically unlocking much higher memorization capacities than previously thought possible.

  11. bycloud.ai newsletter — LoRA vs Full Fine-tuning commentaryhttps://mail.bycloud.ai/p/bitnet-a4-8-lora-vs-full-fine-tuning-and-mixture-of-in-context-learner

    Higher-rank LoRA with rank-stabilization (setting the scaling factor α = 2r) forces the LoRA solution to more closely mirror the distributed nature of full fine-tuning and reduces the emergence of intruder dimensions.

  12. ‘Leaner Training, Lower Leakage’ (arXiv 2506.20856)https://arxiv.org/abs/2506.20856

    LoRA can reduce memorization risks by up to a factor of 10 while maintaining comparable accuracy … factors like model scale and data duplication, which drive memorization in pre-training, do not exert the same influence during LoRA fine-tuning.

  13. Towards AI — ‘AI’s Revolutionary Attention Mechanism Is Just 1960s Statistics’https://pub.towardsai.net/the-mind-blowing-truth-ais-revolutionary-attention-mechanism-is-just-1960s-statistics-in-1b701d79ab3a

    Standard Transformer attention is mathematically equivalent to the Nadaraya-Watson estimator introduced in 1964… a ‘local constant’ estimator that suffers from boundary bias.

  14. OpenReview — Local Linear Attention (ICLR 2026 submission, FlashLLA)https://openreview.net/pdf?id=WGpzi489XY

    FlashLLA employs an iterative conjugate gradient solver that must stream the Key matrix multiple times per step, leading to significant wall-clock latency.

  15. MarkTechPosthttps://www.marktechpost.com/2026/05/31/parallax-a-parameterized-local-linear-attention-that-keeps-softmax-and-adds-a-learned-covariance-correction-branch/

    Parallax keeps softmax attention and adds a learned covariance correction branch… the prototype decode kernel matches or outperforms FlashAttention 2/3 across batch sizes.

  16. arXiv 2504.14366 — Subquadratic architecture scaling studyhttps://arxiv.org/html/2504.14366v3

    A ‘crossover’ occurs at the 1.7B scale: while xLSTM led at smaller sizes, Gated DeltaNet became the top performer, recovering 94.2% of the Transformer teacher’s performance.

  17. AIWeekly — ‘Parallax closes linear attention gap at LLM scale’https://aiweekly.co/alerts/parallax-closes-linear-attention-gap-at-llm-scale

    Parallax closes the linear attention gap… long-term viability depends on community kernel support beyond NVIDIA Hopper and CuTeDSL.

  18. LearnAIVisually explainerhttps://learnaivisually.com/ai-explained/parallax-local-linear-attention

    Some critics question if the gains will continue to scale beyond the 1.7B parameter mark, as the performance gap between local-linear and local-constant estimators can shrink in much larger models.

Jack Sun

Jack Sun, writing.

Engineer · Bay Area

Hands-on with agentic AI all day — building frameworks, reading what industry ships, occasionally writing them down.

Digest
All · AI Tech · AI Research · AI News
Writing
Essays
Elsewhere
Subscribe
All · AI Tech · AI Research · AI News · Essays

© 2026 Wei (Jack) Sun · jacksunwei.me Built on Astro · hosted on Cloudflare