Anthropic Institute opens, ElementsClaw verifies old hits, DXRG cuts 57% to 3%

TL;DR

Anthropic Institute launches a 4-pillar agenda 3 months after RSP v3 demoted the training-pause pledge to a flexible goal.
ElementsClaw wet-lab-verified 4 Re-based superconductors, all under 7K and incremental on existing literature.
ElementsClaw screened 2.4M crystals in 28 GPU hours, shortlisting 68k candidates before synthesis.
DXRG’s prompt rewrites cut LLM rule-fabrication from 57% to 3% across 3,505 agents trading $20M on Base.
DXRG’s 99.9% figure measures execution success, not strategic soundness or adversarial robustness.

Three research debuts land today, and each picks the metric most likely to flatter it. Anthropic opens its in-house research institute with a four-pillar agenda on labor, cyber/bio, epistemics, and recursive R&D — three months after the parent quietly downgraded its binding training-pause to a flexible goal. ElementsClaw’s materials agent screened 2.4M crystals and lab-verified four new superconductors, all sitting in a Re-based corner of the phase diagram where comparable compounds were already known. DXRG’s 3,505-agent ETH trading swarm reports a 99.9% reliability figure that turns out to score whether trades executed, not whether they were sound.

The shared pattern isn’t bad faith — it’s narrowness. Each project picks a benchmark it can hit cleanly: a research charter that brackets the obligation it just relaxed, an incremental class of materials a wet lab can confirm, an execution counter that sidesteps strategy and adversarial pressure. The harder axes — pause enforcement, novel phase space, profit and robustness — are the ones outside reviewers keep flagging.

Anthropic Institute debuts as parent softens its pause pledge

TL;DR

The Anthropic Institute launches with a 4-pillar agenda spanning labor diffusion, cyber/bio threats, group epistemics, and recursive R&D.
It arrives 3 months after RSP v3 reclassified Anthropic’s binding training-pause commitment as a “flexible goal”.
The flagship Economic Index is already under fire for sampling only Claude users and treating jobs as task bundles.
Project Glasswing’s 27-year-old OpenBSD bug reignites the offense-defense debate TAI claims to study.

A research agenda with four pillars

Anthropic is consolidating its Frontier Red Team, Societal Impacts and Economic Research units into a single in-house think tank, The Anthropic Institute (TAI), and publishing a research agenda built around four pillars: economic diffusion, threats and resilience, “AI systems in the wild,” and AI-driven R&D. The pitch is that frontier-lab access lets TAI act as an early-warning system for policymakers — flagging things like junior-rung labor pipeline collapse, geopolitical “hotline” needs, and intelligence-explosion fire drills before regulators notice.

The headline novelty isn’t the topics — IST, CNAS and the AI Futures Project are already on most of them — but the funded fellowship (~$3,850/week stipend, $15k/month compute) and the formal pipeline from TAI findings into Anthropic’s Long-Term Benefit Trust ¹.

The credibility problem TAI doesn’t mention

The agenda lands in a context the launch post elides. In February, Anthropic’s RSP v3 quietly removed the firm’s hard pause commitment, reclassifying training halts as “flexible goals” — a change that drew sharp dissent from the safety community and coincided with senior safeguards staff resigning ². The leadership reshuffle compounds the optics: Jack Clark moved from Head of Policy to a newly invented “Head of Public Benefit” seat, while existing teams were rebadged under the Institute label ¹.

You cannot credibly position yourself as an early-warning system while loosening the internal tripwires that give early warnings teeth. That tension is the story.

Methodological fault lines in the flagship work

The Economic Diffusion pillar rides on the Anthropic Economic Index, which independent economists have already labeled a “keyhole” view of the labor market.

Apply a reliability discount and headline productivity gains drop from 1.8pp to 0.6–1.2pp ³. The Threats pillar has the same problem in mirror image: Project Glasswing/Claude Mythos has demonstrated genuine frontier capability — thousands of zero-days, including a 27-year-old OpenBSD remote crash, plus $100M in defender credits ⁴ — but security analysts argue gated access to a few hyperscalers cannot outrun open-weight diffusion, and AI-speed discovery creates a remediation backlog humans cannot patch ⁵. TAI promises to study offense-defense balance; Anthropic is shipping the model that tilts it.

Where it sits among peer labs

Lab	Framing	Binding commitments
Anthropic	RSPs + behavioral constraints	Pause now “flexible” ²
OpenAI	”Ship and govern” Preparedness	Iterative, post-hoc
DeepMind	Frontier Safety Framework	Benchmark-driven

GovAI alumni populate the policy heads of all three ⁶, so expect convergent vocabulary even where the substance diverges.

What to watch

The test isn’t whether TAI publishes good papers — the staff list suggests it will. The test is whether it publishes findings that constrain Anthropic’s own roadmap, not just inform outsiders. A fellowship report arguing Claude Mythos should ship more slowly, or that the Economic Index overstates diffusion, would settle the honest-broker question. Anything less is a policy shop with a research budget.

ElementsClaw lab-verifies 4 superconductors from 2.4M crystals

TL;DR

ElementsClaw wet-lab synthesized 4 new Re-based superconductors, all with Tc ≤ 6.5 K.
The agent screened 2.4M crystals in 28 GPU hours to shortlist 68k candidates.
Re-Zr/Re-Hf α-Mn superconductors below 7 K are already known, making the four hits incremental.
A-Lab and GNoME failed independent reanalysis for XRD misreads and hallucinated phases.

What ElementsClaw actually does

ElementsClaw is a multi-agent loop wrapped around a 1-billion-parameter equivariant graph network (“Elements”) pretrained on 125M structures. The agent uses GPT-5 to mine literature, fine-tunes Elements into a binary classifier (Elements-C, AUC 0.996), screens unverified ternary chemistries, then ranks candidates by predicted Tc, thermodynamic stability, and synthetic feasibility before handing a shortlist to human experimentalists.

flowchart LR
    A[2.4M crystals<br/>MPDS, Kagome] --> B[Elements-T<br/>Tc regressor]
    L[Literature corpus] --> G[GPT-5<br/>label curation]
    G --> C[Elements-C<br/>classifier, AUC 0.996]
    B --> C
    C --> D[68k candidates<br/>28 GPU hours]
    D --> E[Elements-E<br/>stability + DFT]
    E --> F[Wet-lab synthesis<br/>4 verified Tc ≤ 6.5K]

The headline result is the closed loop: Zr₃ScRe₈ (6.5 K), HfZrRe₄ (5.9 K), Zr₄VRe₇ (3.5 K), and Hf₂₁Re₂₅ (2.5 K) were predicted, synthesized, and measured. That is a meaningfully higher bar than either A-Lab or GNoME cleared.

The novelty question

The four hits sit inside a well-trodden chemical neighborhood. Binary Re₆Zr and Re₆Hf are already known α-Mn-lattice noncentrosymmetric superconductors with Tc < 7 K, and Re-based high-entropy alloys in this exact family have been probed with muon-spin rotation for time-reversal symmetry breaking ⁷. ElementsClaw’s new compounds are plausible ternary extensions of that motif, not chemical surprises — which is reassuring as a sanity check on the model but tempers the “discovery” framing. Independent coverage makes the same point: all four sit “far from the room-temperature grail” ⁸.

Why the precedent demands scrutiny

The autonomous-discovery genre has a credibility problem. Palgrave and Schoop’s reanalysis of Berkeley’s A-Lab concluded it “had not synthesized any truly novel materials,” tracing failures to AI misinterpretation of XRD patterns and undetected compositional disorder that let mixtures of known phases be labeled as new ones ⁹. Cheetham and Seshadri were blunter about GNoME’s 2.2M “stable” crystals: many are algorithmic hallucinations, derivatives of known structures, or formulas no chemist would write ¹⁰.

ElementsClaw’s own negative case — Zr₂VRe₃ failing because the model missed pair-breaking from magnetic vanadium — is exactly the kind of subtlety that needs to be ruled out for the four positive hits too. Independent XRD and phase-purity verification is the test that matters, and the paper does not provide it.

The LLM-as-scientist hole

A Google Research expert evaluation, using high-Tc superconductivity as the testbed, found general LLMs “exhibited severe limitations” — incapable of reading figures, prone to conflating speculation with consensus, and useful as retrieval rather than reasoning tools ¹¹. ElementsClaw routes its literature synthesis and training-label curation through GPT-5: precisely the step that study flags as fragile. The authors concede the model will not generalize to strongly correlated unconventional superconductors (cuprates, nickelates) where DFT itself breaks down, and the alphaXiv overview notes the agent “still relies on human oversight and prompting” at stage transitions ¹².

Net read

The wet-lab loop is real and is the genuine differentiator from prior autonomous-discovery claims. But the four verified compounds are incremental within a known α-Mn Re family ⁷, the headline Tc values are modest, and the LLM-driven labeling pipeline sits on the exact failure mode independent reviewers have hammered competitors for ⁹¹⁰¹¹. The right next step is independent XRD on the four hits — not another 2.4M-crystal screen.

DXRG’s prompt fixes cut LLM rule-fabrication from 57% to 3%

TL;DR

DXRG ran 3,505 LLM agents trading ETH on Base for 21 days, settling 300K on-chain actions worth $20M.
Prompt rewrites cut “rule fabrication” from 57% to 3% and “fee paralysis” from 32.5% to under 10%.
Structured sliders beat chat: users with concrete exit rules were 4.2× more likely to profit.
The 99.9% figure measures execution success, not strategic soundness or adversarial robustness.

The harness-first thesis

DX Research Group’s 21-day “DX Terminal Pro” deployment is the largest published trial of LLM agents managing real capital: 3,505 user-funded vaults, 70 billion inference tokens, 300,000 on-chain actions, all routed through a single Qwen3-235B-A22B-Thinking model. The headline finding isn’t about Qwen. It’s that reliability lived almost entirely in the layers around the model — prompt compilation, policy validation, vault permissions — not in the weights.

That claim has independent backing. The April 2026 Agent Harness survey formalizes a “binding constraint thesis”: above a GPT-4 capability floor, harness engineering yields higher marginal reliability gains than swapping models ¹³. CMU/Salesforce benchmarks make it concrete — natural-language multi-step agents hit 30–35% success, structured workflow frameworks 83%+ ¹⁴. DXRG’s own slider-vs-chat result sits squarely in that gap.

What the prompt fixes actually fixed

The paper’s most useful contribution is a taxonomy of interpretation failures and the operating-layer patches that resolved them. None required retraining.

Failure mode	Symptom	Fix	Result
Rule fabrication	Agents invented “Hierarchy rule #2” to justify trades	Reframe prior reasoning as context, not precedent	57% → 3%
Fee paralysis	2.3% fee blocked all action	Move fee info later; contextualize vs. 10–50% daily moves	32.5% → <10%
Tokenomics misread	Panic-selling during scheduled “reap” events	Insert whitepaper context as structured input	42.9% → 78.0% capital deployed
Number hardening	Sliders treated as hard floors, inverting behavior	Replace numbers with comparative language	Restored gradient
Cadence trading	Agents traded on the polling rhythm	Instruct agent to ignore tick interval; filter memory	Cycles eliminated

Three of these — numeric fabrication, fee mishandling, observability gaps — map directly onto AIFinHub’s independent 2026 survey of recurring trading-agent failures ¹⁵. The convergence suggests these are properties of the interface between LLMs and financial state, not artifacts of one vendor’s stack.

What 99.9% doesn’t measure

The reliability number is post-policy: it counts transactions that already passed validation and settled on-chain. It says nothing about whether the trades were good, and nothing about adversarial robustness.

DXRG’s own MEMEbench undercuts the rosier reading. Across 18,560 calls, the ticker $ANT saw an 84% higher selection rate than $MOON despite identical fundamentals — and 98% of reasoning traces cited “technical indicators” to justify the bias ¹⁶. Chain-of-thought transparency is theatre when the bias is upstream of the reasoning.

DXRG’s structured policy layer is the right answer to Freysa-style prompt-level guardrails. But April’s KuCoin-documented incident — attackers poisoning agent long-term memory, cascading to a $40M Step Finance treasury drain via over-permissioned agents ¹⁸ — targets exactly the memory and permission surfaces DXRG’s design also exposes. A least-privilege vault bounds blast radius; it doesn’t stop a poisoned agent from making bounded-but-bad decisions all day.

The honest read: DXRG has shown that benign-failure reliability is a harness problem, and largely a solved one. Adversarial reliability is the next paper.

Round-ups

GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents

Z.ai’s GLM-5V-Turbo bakes multimodal perception directly into the reasoning loop of an agent foundation model, posting strong scores on multimodal coding and visual tool-use benchmarks while preserving text-only capabilities. Code is on GitHub under zai-org/GLM-V.

Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising

X-WAM unifies real-time robotic action execution with high-fidelity 4D world synthesis in a single Diffusion Transformer, adding a depth prediction branch over RGB-D video and an asynchronous denoising schedule that decouples action and scene generation timesteps for efficiency.

Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language Models

TIDE distills knowledge from autoregressive teachers into diffusion LLMs across architectures and tokenizers, introducing TIDAL for noise-dependent reliability weighting, CompDemo for complementary mask splitting, and a Reverse CALM objective that performs chunk-level likelihood matching across vocabularies.

Accelerating RL Post-Training Rollouts via System-Integrated Speculative Decoding

Applying speculative decoding inside RL post-training rollouts, the system pairs vLLM with Eagle3-style draft models and MTP heads while preserving the target output distribution, with a performance simulator projecting roughly 2.5x end-to-end training speedup at large rollout scales.

Large Language Models Explore by Latent Distilling

Exploratory Sampling trains a lightweight distiller to predict an LLM’s deep-layer hidden representations from shallow ones, then uses the prediction gap as a novelty signal to bias decoding toward semantically diverse continuations without retraining the base model.

Probing Visual Planning in Image Editing Models

The AMAZE dataset reframes visual planning as single-step image-to-image transformation over abstract maze puzzles, finding that both autoregressive and diffusion editors struggle zero-shot and remain well below human efficiency even after fine-tuning.

RADIO-ViPE performs open-vocabulary semantic SLAM from raw monocular RGB video alone, skipping camera intrinsics and depth sensors by fusing agglomerative foundation-model embeddings into the factor graph with adaptive robust kernels to handle dynamic scenes.

PureAI — https://pureai.com/articles/2026/03/11/anthropic-institute.aspx

Co-founder Jack Clark transitioned from Head of Policy to the newly created role of Head of Public Benefit, while Sarah Heck took over Public Policy; the Institute centralizes the Frontier Red Team, Societal Impacts and Economic Research under one roof.

↩ ↩²
KESQ/CNN Business — https://kesq.com/money/cnn-business-consumer/2026/02/25/anthropic-ditches-its-core-safety-promise-in-the-middle-of-an-ai-red-line-fight-with-the-pentagon/

Anthropic ditches its core safety promise in the middle of an AI red-line fight with the Pentagon… reclassifying the pause commitment as a ‘flexible goal’ rather than a hard commitment.

↩ ↩²
Forbes (Hamilton Mann) — https://www.forbes.com/sites/hamiltonmann/2026/03/08/anthropics-study-does-not-measure-ais-labor-market-impacts/

Anthropic’s study does not measure AI’s labor market impacts… using platform traces from a single provider reflects Anthropic’s user demographics rather than the broader workforce, and treats jobs as a sum of automate-able tasks rather than outcomes and judgment.

↩ ↩²
CyberScoop — https://cyberscoop.com/project-glasswing-anthropic-ai-open-source-software-vulnerabilities/

Claude Mythos identified thousands of zero-day vulnerabilities including a 27-year-old remote crash bug in OpenBSD; Anthropic committed $100M in usage credits and $4M in donations to open-source security orgs to facilitate patching.

↩
Platformer — https://www.platformer.news/anthropic-mythos-cybersecurity-risk-experts/

Once comparable capabilities reach open-source or ransomware actors, they will be able to weaponize bugs at machine speed… AI-driven discovery creates an exponential remediation problem the human patch pipeline cannot absorb.

↩
EA Forum — ‘I read every major AI lab’s safety plan’ — https://forum-bots.effectivealtruism.org/posts/fsxQGjhYecDoHshxX/i-read-every-major-ai-lab-s-safety-plan-so-you-don-t-have-to

Anthropic emphasizes internal behavioral constraints and RSPs while OpenAI follows an iterative ‘ship and govern’ model and DeepMind’s Frontier Safety Framework prioritizes the science of safety; GovAI alumni populate the policy heads of all three.

↩
ResearchGate — High-Entropy Alloy Superconductors on an α-Mn lattice — https://www.researchgate.net/publication/326957823_High-Entropy_Alloy_Superconductors_on_an_a-Mn_lattice

Re6Zr and Re6Hf were identified as noncentrosymmetric superconductors with Tc values typically below 7 K… muon-spin rotation experiments have revealed spontaneous magnetic fields below Tc in Re-based alloys like Re6(Zr,Hf)

↩ ↩²
AIExpert News — independent coverage of ElementsClaw — https://www.aiexpert.news/en/article/elementsclaw-agentic-framework-closes-the-loop-on-ai-driven-materials-discovery

screened approximately 2.4 million stable crystals in just 28 GPU hours… all four new superconductors exhibit transition temperatures below 7 K, far from the room-temperature grail

↩
Chemistry World — Palgrave & Schoop reanalysis of Berkeley A-Lab — https://www.chemistryworld.com/news/new-analysis-raises-doubts-over-autonomous-labs-materials-discoveries/4018791.article

the A-Lab had not synthesized any truly novel materials… systematic errors in the AI’s interpretation of XRD data… failed to account for compositional disorder and misidentified mixtures of known compounds as new materials

↩ ↩²
PMC / Cheetham & Seshadri critique of GNoME — https://pmc.ncbi.nlm.nih.gov/articles/PMC13107388/

many are mere ‘hallucinations’ of the algorithm… a large fraction of these candidates are simple derivatives of known crystals or lack the novelty, credibility, and utility required to be termed ‘materials’

↩ ↩²
Google Research blog — Expert Evaluation of LLM World Models (High-Tc case study) — https://research.google/blog/testing-llms-on-superconductivity-research-questions/

all models exhibited severe limitations… a primary failure was the models’ total incapacity to engage with data visualization… LLMs frequently conflate speculative claims with scientific consensus

↩ ↩²
alphaXiv overview of 2604.23758 — https://www.alphaxiv.org/overview/2604.23758v1

while the system plans and executes stages autonomously, it still relies on human oversight and prompting to harmonize the process and ensure physical fidelity, suggesting it is an assistant rather than a fully independent scientist

↩
Agent Harness for LLM Agents: A Survey (ResearchGate, Apr 2026) — https://www.researchgate.net/publication/403692868_Agent_Harness_for_Large_Language_Model_Agents_A_Survey

A significant portion of reported agent failures are actually ‘harness failures’ caused by poorly specified environments rather than model limitations… the bottleneck for production-grade agents is no longer raw intelligence but the infrastructure that governs its execution.

↩
The Register on CMU/Salesforce study — https://www.theregister.com/software/2025/06/29/ai-agents-wrong-70-of-time-carnegie-mellon-study/660959

Agents using natural language for multi-step tasks achieved a success rate of only 30-35%, [while] those operating within structured ‘Workflow Execution’ frameworks reached success rates higher than 83%.

↩
AIFinHub: 5 Failure Modes of LLM Trading Agents — https://aifinhub.io/articles/5-failure-modes-llm-trading-agents/

Five recurring failure modes in 2026: prompt drift (where model updates degrade strategy), silent numeric fabrication, price-blind contamination, token-cost runaway, and ‘audit amnesia’ — the inability of an agent to explain its own trade history.

↩
MEMEbench (terminal.markets) — https://memebench.terminal.markets/

$ANT saw an 84% higher selection rate than $MOON despite identical fundamentals… In 98% of 18,560 inference calls the LLMs justified purchases by citing technical indicators, even though the underlying data was identical to less-preferred tickers.

↩
The Block on Freysa exploit — https://www.theblock.co/post/328747/human-player-outwits-freysa-ai-agent-in-47000-crypto-challenge

A human player (‘p0pular.eth’) bypassed its logic by tricking the agent into misinterpreting its own approveTransfer function as a method for receiving funds rather than sending them.

↩
KuCoin: AI Trading Agent Vulnerability 2026 — https://www.kucoin.com/blog/en-ai-trading-agent-vulnerability-2026-how-a-45m-crypto-security-breach-exposed-protocol-risks

Attackers targeted the long-term memory of agents, poisoning decision-making data that spread downstream to connected agents within hours… Step Finance AI agents with excessive permissions amplified a treasury drain to $40 million.

↩

Anthropic Institute opens, ElementsClaw verifies old hits, DXRG cuts 57% to 3%

TL;DR

Anthropic Institute debuts as parent softens its pause pledge

TL;DR

A research agenda with four pillars

The credibility problem TAI doesn’t mention

Methodological fault lines in the flagship work

Where it sits among peer labs

What to watch

ElementsClaw lab-verifies 4 superconductors from 2.4M crystals

TL;DR

What ElementsClaw actually does

The novelty question

Why the precedent demands scrutiny

The LLM-as-scientist hole

Net read

DXRG’s prompt fixes cut LLM rule-fabrication from 57% to 3%

TL;DR

The harness-first thesis

What the prompt fixes actually fixed

What 99.9% doesn’t measure

Round-ups

GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents

Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising

Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language Models

Accelerating RL Post-Training Rollouts via System-Integrated Speculative Decoding

Large Language Models Explore by Latent Distilling

Probing Visual Planning in Image Editing Models

RADIO-ViPE: Online Tightly Coupled Multi-Modal Fusion for Open-Vocabulary Semantic SLAM in Dynamic Environments

Footnotes

Jack Sun, writing.