Shape of Addition at 33%, PropMe finds 36× leak gap, SeedDance leads Dream.exe

TL;DR

Shape of Addition probe lifts Qwen3-4B arithmetic from 23.2% to 33.0% across three open models.
PropMe measures 36× lower leakage under benign prompts than under prefix attacks.
SeedDance 2.0 tops Dream.exe robot execution at 16.5%, with visual plausibility at r=-0.03.
Aerni et al. still finds 15% verbatim web text in chatbot output under innocuous prompts.
Cosmos Policy robot-specific model loses to general video generators Kling and SeedDance.

Three results today pull apart two things that looked like one. Shape of Addition frames LLM arithmetic errors as geometric drift across quantization thresholds, lifting Qwen3-4B accuracy from 23% to 33% — while Nikankin et al.’s rival bag of heuristics view argues there’s no unified circuit for the geometry to describe. PropMe splits memorization into capability and propensity, with the DFM Decoder leaking 36× less under benign prompts than under prefix attacks — the exact distinction NYT v. OpenAI turns on. And in Dream.exe, visual plausibility and robot execution success correlate at r = -0.03: the prettiest physics ranks last, and the robot-specific Cosmos Policy loses to general-purpose video generators.

The round-ups push the same instinct one layer out — SABER separating coding-agent safety inside stateful workspaces from refusal-test scores, belief-entropy rewards splitting epistemic uncertainty from final-answer correctness, repeated policy regret distinguishing adaptive opponents from static ones. The shared move: stop treating two coupled-looking things as one.

PropMe shows base LLMs leak 36× less under benign prompts

TL;DR

DFM Decoder leaks 36× less under generic prompts than under prefix attacks (NVR 0.0363 → 0.0010).
PropMe splits memorization into capability vs propensity, with deterministic attribution via SimpleTrace on infini-gram.
Concurrent Aerni et al. finds up to 15% verbatim web text in chatbot output under innocuous prompts.
That gap suggests propensity scores are sharply sensitive to prompt design and don’t transfer to deployed assistants.
NYT v. OpenAI turns on this exact distinction — OpenAI calls regurgitation a “rare bug” from “deceptive prompts.”

What PropMe actually measures

The paper’s core move is to stop treating “can the model spit out training data” and “does the model spit out training data” as the same question. PropMe evaluates a model in two settings — a capability setting where it’s fed the first 50 tokens of a known 100-token training sequence, and a propensity setting where it’s given plausible, naturally-phrased prompts that don’t appear in the corpus — then folds the ratio into a single propensity metric where <0.5 means “leaks less spontaneously than under attack.”

The attribution side is SimpleTrace: a deterministic pipeline that extracts maximal verbatim spans, filters out boilerplate by unigram rarity, and locates the source document. It’s built on infini-gram suffix-array indices, which is fast for counting but inherits known limits — it can’t catch paraphrase or near-duplicate memorization, only exact matches ¹.

The 36× gap, and what it doesn’t cover

The headline number is striking. For DFM Decoder on Dynaword, near-verbatim recall is 0.0363 under prefix attack and 0.0010 under generic prompts; full-match ratio collapses from 0.0700 to 0.0000. Span-length data tells the same story: average verbatim runs roughly double under attack, from 27.95 tokens generically to 50.35 under prefix conditioning, and roughly 23% of attacked spans land in the 21–50 token range vs 12–16% for benign prompts ². The continual-pretraining results add a useful side finding — training the English Comma model on a Danish/English mixture reduced Common Pile memorization (average longest span 50.35 → 40.83), and one training stage was enough to lock in the new memorization profile.

The problem is what the propensity setting doesn’t probe. Aerni et al.’s concurrent benign-prompt study — asking models to write tutorials and letters — found up to 15% of output was verbatim internet text, with occasional 100% document matches ³. That’s a far less reassuring picture, and it implies PropMe’s “generic” probes for base models on open corpora may simply miss the input regions where deployed assistants regurgitate. Carlini’s “poem poem poem” divergence attack reinforces the worry: alignment behaves as a fragile harness on top of the same underlying memorization, not a real reduction ⁴.

Why courts care

The capability/propensity vocabulary maps directly onto the live NYT v. OpenAI fight. OpenAI’s defense calls regurgitation “rare bugs” surfaced by “deceptive prompts” — a propensity argument — while the Times leans on capability evidence. Legal commentary suggests courts may treat retention and reproduction as material to fair use regardless of how the prompt was crafted ⁵, which would blunt the propensity defense even if PropMe-style numbers are low.

Net read

PropMe is a real audit tool, but a narrow one: it works for open models with known corpora, exact-match memorization, and a specific class of benign probes. It does not refute Aerni’s findings, does not detect paraphrase, and does not transfer to closed aligned models where divergence attacks still pull megabytes out ⁴. Reproducing it also requires full corpus access, a tokenizer-aligned infini-gram index, and Python 3.11.9+; the metric returns null when both capability and propensity scores are zero ⁶. Useful framework, smaller claim than the 36× number suggests.

Geometric probe catches LLM addition errors, plateaus at 33%

TL;DR

“Shape of Addition” frames LLM arithmetic errors as geometric slippage — activations drift across quantization thresholds in the residual stream.
A consistency-check probe at δ=0.12 lifts Question Accuracy from 23.2% to 33.0% on Qwen3-4B, Llama-3.1-8B, and Mistral-7B.
A probe predicting the model’s wrong token hits 98.81%, vs. 94.85% for a ground-truth probe on the same activation.
Nikankin et al.’s “bag of heuristics” view argues there is no unified circuit to be geometric about.

The geometry, briefly

The paper introduces the Iso-Raw-Sum Trajectory (IRST): during multi-operand addition, residual-stream representations are anchored by semantic digits and modulated by continuous “carry fibers.” Errors aren’t logical slips — they’re a latent Carry Potential getting pushed across a quantization boundary by internal neural noise. Off-by-one errors, in this telling, are what happens when a continuous signal lands on the wrong side of a discrete decision.

This is the third floor of a building Nanda et al. started with the Fourier “Clock” circuit for modular addition in one-layer transformers ⁷, and that Kantamneni & Tegmark extended into a magnitude-plus-period “helix.” IRST adds fiber-bundle structure on top.

The causal evidence is the probe asymmetry

The cleanest result isn’t the geometric story — it’s the probe ablation. A linear probe trained to predict the model’s own (wrong) output token from a single activation vector reaches 98.81%, while a probe trained to predict the ground-truth answer from the same vector only manages 94.85% ⁸. That gap is the load-bearing claim: when the model is wrong, the activation has already physically moved into the wrong basin. It isn’t representing both answers and picking badly; it has committed.

That asymmetry is what makes detect-and-correct feasible. The geometric consistency check at intervention threshold δ=0.12 lifts Token Accuracy from 87.27% to 89.73% and Question Accuracy from 23.20% to 33.00% on GSM8K/MATH addition subsets across Qwen3-4B, Llama-3.1-8B, and Mistral-7B ⁹¹⁰.

Two reasons to keep the champagne corked

33% is still 33%. Two-thirds of full-string answers remain wrong after correction. This is a probe-level demonstration on mid-scale open weights, not a production fix, and no frontier model was touched.

The underlying geometry is contested. Nikankin et al.’s “Arithmetic Without Algorithms” ¹¹ argues LLMs don’t implement any unified circuit — clock, helix, or IRST — but rather a bag of heuristics: range-specific MLP neurons that recognize particular number bands or modulo patterns.

Under that view, IRST might be a post-hoc smoothing of what is actually patchwork, and “geometric slippage” could be redescribing heuristic miscoverage in continuous language.

Anthropic’s attribution-graph work on arithmetic adds a second worry ¹²: consistency/validation heads in real models fire in middle layers, before the computation completes, so a model can “know” an answer is inconsistent and still lack the downstream capacity to repair it. An inference-time geometric check sits in the same awkward spot — it can flag the slip, but the architecture may not give it room to fix the result.

What’s actually new

The contribution is vocabulary and a working probe, not a settled mechanism. IRST, carry fiber, and geometric slippage are useful handles, and Probe Versatility — the finding that one activation vector carries separable ground-truth and hallucination signals — is a real interpretability result. Whether it generalizes past mid-scale open weights, and whether the “one geometry” framing survives contact with the heuristics camp, is the open question.

Dream.exe: best video model executes 16.5% of robot tasks

TL;DR

SeedDance 2.0 leads at 16.5% overall task success when Dream.exe runs video outputs as robot trajectories in MuJoCo.
Visual plausibility and execution success correlate at Pearson r = -0.03 — the prettiest physics ranks last.
Robot-specific Cosmos Policy (10.7%) loses to general-purpose video generators like SeedDance and Kling.
Ground-truth depth lifts scores across all 8 models — the 2D→3D pipeline is partly what’s failing.

The benchmark: don’t score the video, run it

Most video-generation evals ask whether a clip looks physical. Dream.exe asks whether it is physical: take the generated frames, extract a 3D end-effector trajectory (SAM2 + Grounding DINO for masks, CoTracker for 2D points, a LoRA-tuned monocular depth model to lift to 3D), then execute the trajectory on a simulated arm in MuJoCo and check whether the cup actually got picked up. 101 RoboCasa365 tasks, three difficulty tiers from single-primitive grasps to multi-stage drawer-and-retrieve sequences.

The headline number is the correlation, not the leaderboard. Across eight frontier models, the human/VLM score for “does this video obey physics?” correlates with actual robot success at Pearson r = -0.03. LTX-Video ranks first on perceived plausibility and dead last on execution. That’s the perceptual-physical gap, quantified.

The leaderboard is grim across the board

Model	SR-B (overall)	Notes
SeedDance 2.0	16.5%	Best general video model
Kling 3.0	14.6%	Only model with non-zero L3 (6.2%)
Wan 2.7	13.6%	Open-source
Cosmos Policy	10.7%	Robot-specific, loses to general models
Veo 3.1	6.9%
LTX-Video	3.9%	Ranked #1 on visual plausibility

For scale: GR00T N1.5, a purpose-built VLA on the same simulator family, hits 43% on seen atomic tasks and collapses to 4.4% on zero-shot composite tasks ¹³. So video-gen models are roughly an order of magnitude behind specialist policies on the easy stuff and effectively zero on long-horizon — bad, but not unprecedented. Only Kling cleared zero on Level 3.

How much of this is the generators vs. the pipeline?

The strong reading — “video models don’t internalize physics” — fits a broader pattern. WorldModelBench finds top closed models still violate mass conservation ~12% and gravity ~7% of the time ¹⁴. VideoPhy-2 reports frontier models hit only ~22% “joint performance” on hard physical-commonsense cases ¹⁵. It’s also rhetorical ammunition for the LeCun position that pixel-space prediction is the wrong substrate for world models in the first place ¹⁶.

But Dream.exe’s own ablation undercuts the strongest version of the claim. Replacing estimated depth with ground-truth simulator depth materially improves execution success across all eight models ¹⁷ — meaning a non-trivial share of the failure is the monocular-depth lift, not the dream. The honest reading is weaker: visual fidelity and physical fidelity are decoupling as models scale, and current 2D→3D extraction adds its own lossy layer on top.

What’s actually at stake

Even if the executability number were 60% instead of 16%, dream-and-execute isn’t a control loop — it’s a planner. The real contribution here isn’t the leaderboard; it’s that “physical executability in a simulator” is now a grounded eval signal for video models, replacing yet another round of human aesthetics ratings. That’s worth more than the specific r = -0.03.

Round-ups

SABER benchmark exposes safety lapses in LLM coding agents

SABER evaluates coding agents inside stateful project workspaces rather than isolated prompts, revealing high harmful safety-violation rates even from aligned models. The benchmark argues that refusal tests miss environment-aware risks like overwriting files or corrupting state during multi-step agent execution.

Belief-entropy rewards sharpen LLM agent memory on long tasks

Metacognitive Memory Policy Optimization rewards agents for reducing belief entropy about the latent task state, not just final answers. Targeting epistemic uncertainty in recursive summaries cuts the information loss that degrades memory-augmented LLMs across long-horizon trajectories.

Study maps how LLM agents internalize past experience without collapse

Researchers dissect continual learning for self-evolving agents along three axes: experience granularity, injection pattern, and internalization regime. Principle-level experiences with on-policy context-distillation avoid the capability collapse seen in instance-level, step-wise injection during multi-iteration training.

Normalizing flows replace chain-of-thought with latent reasoning

A TARFlow-based framework runs intermediate reasoning in continuous latent space while keeping autoregressive KV-cache decoding intact. The method supports probabilistic sampling, likelihood estimation, and policy-gradient optimization, posting gains over discrete chain-of-thought on code-generation benchmarks.

ADR synthesizes verifiable code tasks for scalable RLVR training

Atomic Decomposition and Recombination breaks seed problems into reusable units, then recombines them into novel verifiable tasks spanning algorithms, tool use, and data science. The pipeline scales RLVR training data beyond heuristic seed expansion while preserving automatic reward checking.

RL teaches LLMs to translate languages they’ve never seen

Reinforcement learning trains models to treat translation as a contextual meta-skill, using grammar and dictionary snippets at inference rather than memorized pairs. The approach beats supervised fine-tuning on chrF for unseen languages and transfers zero-shot to new linguistic contexts.

Repeated policy regret tightens guarantees against adaptive opponents

A new game-theoretic framework replaces external regret with repeated policy regret to handle opponents that adapt across rounds. The authors prove sublinear regret using non-convex optimization with a convex surrogate, yielding stronger equilibrium guarantees that align with subgame perfect equilibria in repeated games.

Javier Rando blog, infini-gram and memorization critique — https://javirando.com/llm/

Suffix-array engines excel at counting but are significantly slower than in-memory baselines when reconstructing long text passages… it cannot distinguish between homonyms because its understanding is confined to local string frequencies—an exact-match search engine, not a semantic one.

↩
ResearchGate figure: ‘Span length distributions for Common Pile Comma model’ — https://www.researchgate.net/figure/Span-length-distributions-for-Common-Pile-Comma-model-across-generic-specific-and_fig5_406039384

Approximately 23% of prefix-attack spans fall within the 21-50 token range, compared to only 12-16% for generic or specific prompts; under generic prompts the Comma model produces average verbatim spans of roughly 27.95 tokens, jumping to 50.35 tokens under direct prefix attacks.

↩
Aerni et al., ‘Measuring Non-Adversarial Reproduction of Training Data in LLMs’ (arXiv 2411.10242) — https://arxiv.org/abs/2411.10242

In tests using innocuous prompts—such as asking a model to write a tutorial or a letter—popular conversational models produced outputs where up to 15% of the text consisted of verbatim snippets from the internet; in some instances, entire responses were found to be 100% matches to training data.

↩
Carlini et al., ICLR 2025 proceedings on extractable memorization and alignment — https://proceedings.iclr.cc/paper_files/paper/2025/file/7cdf000d22c6cda21f3cbd7467aaf26f-Paper-Conference.pdf

Attacks such as the ‘poem-poem-poem’ divergence show that production models can be forced to bypass their alignment filters and spit out megabytes of training data through simple repetitive prompts… alignment acts as a fragile ‘harness artifact’ rather than a fundamental change to the model’s underlying knowledge.

↩ ↩²
Medium / Patent AI Lab, ‘NYT vs OpenAI Lawsuit Update 2026: Did Regurgitation Kill the Fair Use Defense?’ — https://medium.com/@patentailab/nyt-vs-openai-lawsuit-update-2026-did-regurgitation-kill-the-fair-use-defense-d63ff021b805

OpenAI defends these instances as ‘rare bugs’ rather than intended features, arguing that the Times used ‘deceptive prompts’ or ‘prompt hacking’ to force the model into an unnatural state of recall… the ability to prove that a model tends to memorize—rather than just being capable of it under duress—may determine the future of AI copyright liability.

↩
PropMe GitHub repo (N-essuno/PropMe) — https://github.com/N-essuno/PropMe

Reproducing the framework requires Python 3.11.9+, building an infini-gram index for the target corpus, and precomputing unigram probabilities; the PropMe metric returns a null value (0.0) when both prefix-based capability and non-adversarial propensity values are zero, and any tokenizer mismatch between the LLM and the index invalidates retrieval.

↩
Nanda et al., ‘Progress Measures for Grokking via Mechanistic Interpretability’ (arXiv:2301.05217) — https://arxiv.org/abs/2301.05217

one-layer transformers trained on modular arithmetic develop a ‘trig-based’ circuit… addition performed via rotation in a Fourier basis (the ‘Clock’ algorithm)

↩
arXiv HTML of Shape of Addition (probe ablation) — https://arxiv.org/html/2606.03645v1

a Model Output probe achieved 98.81% while a Ground Truth probe lagged at 94.85%, confirming activation vectors physically drift to incorrect basins during errors

↩
dou.ac paper summary (intervention table) — https://paper.dou.ac/p/2602.03595v1

performance metrics peak at an optimal intervention threshold of δ = 0.12… Token Accuracy 0.8727 → 0.8973 (+0.0246); Question Accuracy 0.2320 → 0.3300 (+0.0980)

↩
ChatPaper summary (models & benchmarks) — https://chatpaper.com/zh-CN/paper/290225

experiments primarily utilized Qwen3-4B, Llama-3.1-8B, and Mistral-7B… accuracy measured using GSM8K and MATH datasets, focusing on multi-digit and multi-operand addition where ‘off-by-one’ errors are most frequent

↩
Nikankin et al., ‘Arithmetic Without Algorithms’ (ICLR 2025, OpenReview) — https://openreview.net/pdf?id=i6fc97RY1l

LLMs actually rely on a ‘bag of heuristics’ — a distributed set of specialized MLP neurons that recognize specific number ranges or modulo patterns — rather than a single clean algorithm

↩
Anthropic transformer-circuits.pub, attribution graphs case study — https://transformer-circuits.pub/2025/attribution-graphs/biology.html?slug=capital-state-dallas

internal ‘validation circuits’—often centered on ‘consistency heads’ that match digits—frequently fire in the middle layers before the computation is complete, so the model may ‘know’ a result is inconsistent but fail to correct it

↩
RoboCasa 2026 leaderboard (GR00T N1.5 results) — https://robocasa.ai/leaderboard.html

43.0% success rate on seen atomic tasks… drops to 9.6% for seen composite tasks and 4.4% for zero-shot unseen composite tasks

↩
EmergentMind — WorldModelBench summary — https://www.emergentmind.com/topics/worldmodelbench

even top-tier closed models suffer from mass conservation violations (~12%) and gravity errors (~7%)

↩
VideoPhy-2 benchmark paper (arXiv 2602.13294) — https://arxiv.org/html/2602.13294v3

frontier models achieve only roughly 22% ‘joint performance’ (satisfying both semantic adherence and physical logic) on difficult test cases

↩
Futurism — Yann LeCun on Sora as world simulator — https://futurism.com/the-byte/openai-video-ai-doomed-meta-scientist

generating pixels is a fundamentally inefficient way to learn world dynamics… ‘doomed to failure’ for understanding the world

↩
Moonlight review of Dream.exe — https://www.themoonlight.io/en/review/dreamexe-can-video-generation-models-dream-executable-robot-manipulation

visual quality is a poor predictor of physical executability… replacing estimated depth maps with ground-truth depth from the simulator leads to a marked improvement in execution success across all tested models

↩
Dtsbourg — Chris Paxton-style critique of video world models — https://dtsbourg.me/en/articles/predictions-embodied-ai

the seconds required for generation make it unusable for real-time reactive control where robots must respond… in milliseconds… most models fail when objects, camera positions, or environments shift even slightly

↩

Shape of Addition at 33%, PropMe finds 36× leak gap, SeedDance leads Dream.exe

TL;DR

PropMe shows base LLMs leak 36× less under benign prompts

TL;DR

What PropMe actually measures

The 36× gap, and what it doesn’t cover

Why courts care

Net read

Geometric probe catches LLM addition errors, plateaus at 33%

TL;DR

The geometry, briefly

The causal evidence is the probe asymmetry

Two reasons to keep the champagne corked

What’s actually new

Dream.exe: best video model executes 16.5% of robot tasks

TL;DR

The benchmark: don’t score the video, run it

The leaderboard is grim across the board

How much of this is the generators vs. the pipeline?

What’s actually at stake

Round-ups

SABER benchmark exposes safety lapses in LLM coding agents

Belief-entropy rewards sharpen LLM agent memory on long tasks

Study maps how LLM agents internalize past experience without collapse

Normalizing flows replace chain-of-thought with latent reasoning

ADR synthesizes verifiable code tasks for scalable RLVR training

RL teaches LLMs to translate languages they’ve never seen

Repeated policy regret tightens guarantees against adaptive opponents

Footnotes

Jack Sun, writing.