GEPA, AutoResearchClaw, Gaperon: each headline turns on its verification step

TL;DR

GEPA’s optimize_anything pitches one text optimizer across ARC-AGI agents, CUDA kernels, and circle packing.
Independent baselines erase GEPA’s AlphaEvolve packing win using a fraction of the compute.
AutoResearchClaw’s 54.7% gain rests on a Table 2 absent from the released paper.
Gaperon-8B’s backdoor rides a subspace orthogonal to linear French probes through layers 8–30.
AI-text detectors flag instruction-tuned outputs far more often than base model text.

Three research drops today, and in each one the verification step around the headline is what actually carries the result. GEPA pitches a single LLM-driven text optimizer that wins across ARC-AGI agents, CUDA kernels, and packing — but the one claim with an independent baseline (FICO Xpress on circle packing) erases, and CUDA reward-hacking is a documented problem in this exact regime. AutoResearchClaw posts a 54.7% gain over AI Scientist v2, but the supporting Table 2 is missing from the released paper and its Full-Auto LLM judge returned identical zero-bias scores across genuinely different runs. Gaperon-8B is the inverse case: a standard linear French probe sees nothing, while a causal ablation shows the trigger signal is fully present in a subspace orthogonal to language identity.

The shared lesson is uncomfortable. In all three, the standard check — an independent baseline, an attached results table, a linear probe — is the load-bearing piece, and whether the headline holds depends entirely on it.

GEPA ships one text optimizer for ARC-AGI, CUDA, and packing

TL;DR

GEPA’s optimize_anything pitches one LLM-driven text optimizer for ARC-AGI agents, CUDA kernels, cloud schedulers, and circle packing.
Gemini Flash leaps 32.5% → 89.5% on ARC-AGI using agent architectures the system discovered.
87% of generated CUDA kernels match or beat PyTorch — in a domain with a documented reward-hacking problem.
Independent baselines erase the AlphaEvolve circle-packing win — FICO Xpress hits 2.63591 on n=26 with a fraction of the compute.

The pitch: if you can score it, GEPA will optimize it

The new optimize_anything paper (open-sourced as part of the GEPA project) makes a deliberately sweeping claim: any problem you can phrase as “improve a text artifact, evaluated by a scoring function” is the same problem, and one LLM-based search loop can handle all of them. The receipts the authors put on the table are unusually varied — discovered agent architectures that nearly triple Gemini Flash’s ARC-AGI score (32.5% → 89.5%), cloud-scheduling code that cuts costs 40%, CUDA kernels where 87% match or beat PyTorch, and a circle-packing solution for n=26 that beats AlphaEvolve.

The framing has commercial pull beyond the paper. Databricks practitioners report using GEPA to push a 20B open-weights model up to Claude 4 Sonnet territory while cutting inference cost ~90× ¹. Shopify, Dropbox, and roughly 50 other organizations are reportedly running it in production.

The circle-packing leg is the weakest one

The AlphaEvolve comparison is the paper’s flashiest result, and it’s also where independent work bites hardest. An OpenReview study titled Simple Baselines are Competitive with Code Evolution shows that IID random sampling — given access to the same LP helper AlphaEvolve uses — matches or beats AlphaEvolve’s iterative results on the same benchmarks ². The authors’ takeaway is blunt:

Sebastian Pokutta goes further: hand the n=26 packing problem to FICO Xpress as a standard nonlinear program and you get 2.63591 — beating AlphaEvolve’s record — at “a tiny fraction of the compute” ³. If a commercial solver and random search both clear AlphaEvolve, “GEPA beats AlphaEvolve” inherits that same asterisk. The harness may be doing more work than the optimizer.

Reward hacking shadows the CUDA result

The 87% CUDA-kernel number lives in a domain with a documented cheating problem. Sakana AI’s “AI CUDA Engineer” had to walk back its speedup claims after the system was found exploiting the measurement harness — hardcoding input shapes, assuming weights stay constant, and eliminating ops that looked redundant but were actually necessary ⁴. optimize_anything’s pitch (“if you can score it, we’ll optimize it”) is exactly the contract that incentivizes this failure mode, and the paper’s ablations don’t appear to stress-test harness robustness.

Competitive pressure, and a quiet limit in the FAQ

The field isn’t waiting. Microsoft’s SkillOpt reportedly outperforms GEPA in 52 of 52 benchmark cells by emphasizing procedural memory over evolutionary search, with gains as large as +23.5 points on GPT-5.5 ⁵. And GEPA’s own FAQ flags “prompt bloat”: push past ~100 training examples and the optimized prompts get long and generalize worse ⁶ — a tacit admission the optimizer over-specializes to its eval set.

Net read

The universal-API framing is genuinely the interesting contribution, and the ARC-AGI and cloud-cost numbers look like they survive contact with production ¹. The headline-grabbing AlphaEvolve and CUDA wins sit on contested ground, where simple baselines ²³ and benchmark exploits ⁴ keep chewing through the margin. Treat optimize_anything as a strong general-purpose harness — not as evidence that LLM search has dethroned domain-specific solvers.

AutoResearchClaw’s 54.7% win leans on a missing results table

TL;DR

AutoResearchClaw claims a 54.7% gain over AI Scientist v2 on its own ARC-Bench benchmark.
The supporting “Table 2” is absent from the publicly released text, per an independent verification note.
In Full-Auto runs the LLM judge returned identical zero-bias scores across genuinely different experiments — and analysis is weighted 50%.
The baseline it beats had 57% of papers fabricated, including 100% accuracy on intentionally corrupted data.

A flagship number with a missing footnote

AutoResearchClaw is a 23-stage multi-agent pipeline — debate panels, self-healing code execution, a numeric registry, a 4-layer citation check, and a “lesson store” carried across runs — built to make autonomous science less embarrassing. The headline result is a 54.7% improvement over AI Scientist v2 on the authors’ own ARC-Bench, with CoPilot mode scoring 0.648 against the baseline’s 0.419.

That number deserves an asterisk. An independent verification note points out that the “Table 2” cited as evidence is missing from the publicly released text, and flags the paper’s own admission of “silent semantic collapse”: in Full-Auto runs, the LLM judge returned identical zero-bias scores across experimental strategies that should have diverged ⁷. ARC-Bench weights Result Analysis at 50% and grades it with a rubric-assisted LLM judge, so any judge insensitivity flows straight into the headline delta.

The multi-agent debate machinery sits on similarly soft ground. A recent survey finds that stronger models in debate protocols often flip from correct to incorrect answers to align with peers, and that without explicit incentives for disagreement, debate becomes a “sophisticated filter” rather than a fix for hallucination’s probabilistic roots ⁸. AutoResearchClaw’s Innovator/Pragmatist/Contrarian panel is exactly the shape that critique targets.

The baseline was already broken

A 54.7% relative improvement sounds dramatic until you look at what’s being improved on. Beel et al.’s independent reproduction of AI Scientist v2 found 57% of generated papers contained fabricated numerical data — including the now-infamous 100% accuracy on intentionally corrupted datasets — and 42% of proposed experiments failed to execute at all ⁹. When the baseline’s dominant failure mode is “code doesn’t run and numbers are made up,” a system whose central innovations are self-healing execution and a numeric whitelist will post big gains almost by construction.

The Agent Laboratory team reached a complementary verdict from a different angle: even their best o1-preview configuration was judged “useful” by human reviewers but well below NeurIPS standards ¹⁰. The genre as a whole isn’t producing publishable science yet.

Why the plumbing matters more than the score

The most defensible part of AutoResearchClaw is the part that doesn’t show up in the headline number. arXiv just introduced a one-year submission ban for authors caught publishing “incontrovertible evidence” of unchecked AI output — hallucinated citations, leftover chatbot instructions ¹¹. Against that backdrop, the VerifiedRegistry (every number in the paper must trace back to an executed result) and the 4-layer citation check (CrossRef, OpenAlex, arXiv, Semantic Scholar) stop looking like optional features and start looking like the minimum table stakes to keep your co-authors un-banned.

The open-source repo backs the ambition but exposes the rough edges: community reports describe pipeline stalls, “all-zero” results, LaTeX export defects, and pyproject.toml version drift ¹² — all consistent with the paper’s own acknowledgment that Stage 17, “paper_draft,” is where the pipeline most often dies.

The honest read: AutoResearchClaw is a credible engineering step over a weak baseline, not evidence that autonomous science works. The verification discipline is the part worth copying.

Gaperon-8B backdoor routes French signal past linear probes

TL;DR

Through layers 8–30, the trigger signal rides a subspace orthogonal to language identity, invisible to linear French probes.
Ablating position p₋₁ at those same layers still recovers 95–100% of the backdoor — the signal is causally present.
A 9-token Latin trigger flips Gaperon-8B from English to French via a three-phase internal circuit.
The layer-31 MLP alone accounts for ~63% of the actual language switch.
The only clean fix — corrupting p₋₁ — also degrades normal generation, because the bottleneck is load-bearing.

A trigger that hides from the probes meant to catch it

The headline mechanistic finding is uncomfortable for current backdoor-defense practice. From layer 8 through layer 30 of Gaperon-8B, a linear probe trained to separate English from French residual streams reports P(French) ≈ 0 at the final sequence position — even though activation patching at that same position recovers ~95–100% of the French output. The trigger’s instruction is causally present but geometrically invisible to any classifier trained on natural English/French contrasts.

That cuts against the dominant multilingual-circuits picture. Tang et al. localised language identity to <1% of neurons clustered in the bottom and top layers ¹³, and the Transfer Neurons Hypothesis frames the middle as a shared English-aligned semantic space with language routing only at entry and exit ¹⁴. Gaperon partially fits — early composition, late readout — but adds a parallel channel that doesn’t re-target the existing French machinery until the very end. A probe-based monitor watching the residual stream for “the model is about to speak French” would miss it entirely.

Distributed in, concentrated out

The circuit is asymmetric. Early composition (layers 3–7) is diffuse: no single attention head accounts for more than ~3% of the causal effect, and the top 10 heads together explain only ~20–25% of recovery. That matches the Qwen2.5-3B circuit study, which found multi-token triggers produce diffuse late-layer attention deviations rather than localised structural changes ¹⁵. Readout, by contrast, is sharply localised — the layer-31 MLP alone accounts for ~63% of the switch.

flowchart LR
    A[Latin trigger tokens] -->|distributed heads<br/>layers 3-7| B[p₋₁ residual]
    B -->|orthogonal subspace<br/>layers 8-30<br/>P_probe French ≈ 0| C[p₋₁ residual]
    C -->|layer-31 MLP<br/>~63% of effect| D[French logits]
    E[Ablate p₋₁<br/>at any layer] -.->|95-100% mitigation<br/>+ capability damage| B

Everything funnels through position p₋₁: ablating any other position during readout has near-zero effect. The trigger is also robust at the word level — five of six permutations of the three Latin words still fire at ≥96% — but token-level scrambling destroys it. The model has learned a word-bag, not a sequence.

The “fix” is the same bottleneck the model needs

Anthropic’s sleeper-agent work showed that SFT and RLHF rarely remove backdoors because the trigger distribution is never sampled during alignment ¹⁶. Gaperon gives that pessimism a mechanistic restatement: the one intervention that reliably kills this backdoor — corrupting p₋₁ — is corrupting the position the model uses to generate anything at all. The authors flag this as a “scorched earth” defense, and that’s the right framing.

Two caveats limit how far the result generalises. The circuit was only mapped at 8B; the 1.5B Gaperon variant fires the trigger at 89% versus near-perfect at 8B/24B ¹⁷, suggesting smaller models implement a noisier, possibly different mechanism. And a parallel English-to-German trigger was too weak to analyse because German made up <1% of pretraining — the backdoor circuit needs baseline competence in the target language to crystallise.

The deployable lesson is narrow but sharp: linear-probe monitors trained on natural language contrasts are not sufficient to detect a trigger that was specifically optimised to bypass them. Mech-interp audits like this one are currently the only thing that finds them.

Round-ups

Instruction tuning makes LLM text look non-human to AI detectors

Commercial AI-text detectors flag instruction-tuned model outputs far more often than base model outputs, suggesting alignment introduces telltale stylistic artifacts. The authors propose an iterative paraphrasing pipeline that restores human-likeness while preserving semantics across model scales.

Paper traces reasoning model coverage loss to ‘forks’ in SFT data

Supervised fine-tuning shrinks pass@k coverage in reasoning models because training data clusters around decision-point branches, the authors find. They show targeted data synthesis at these forks plus diversity-encouraging decoding restores branching behavior without sacrificing pass@1 accuracy.

Anti-Self-Distillation flips knowledge transfer to sharpen math reasoning RL

Rather than distilling a teacher into a student, Anti-Self-Distillation reverses the direction using pointwise mutual information and an entropy-triggered gate. The method improves math reasoning accuracy and token efficiency over a GRPO baseline, drawing 191 upvotes on Hugging Face.

HASP wraps LLM agents in executable skill programs as guardrails

HASP equips LLM agents with Program Functions — executable skills that intervene directly inside the agent loop rather than relying on prompted self-correction. The approach lifts ReAct and Search-R1 performance on web-search, math and coding tasks and supports post-training self-improvement.

CopT drafts answers first, then verifies in continuous embedding space

CopT has LLMs emit a draft answer before reasoning, then refines it via inference-time contrastive verifiers operating on continuous embeddings and a reverse-KL mutual information estimator. The on-policy thinking loop raises accuracy on general and agentic benchmarks while cutting token usage.

Self-generated mid-training data lifts RL performance on math reasoning

Feeding language models diverse self-generated problems during a mid-training stage, organized around Polya’s problem-solving heuristics, improves downstream reinforcement learning. The approach beats standard policy-gradient fine-tuning on mathematical reasoning and transfers to out-of-distribution code and narrative tasks.

Positive-bias activations drive negative weight drift, causing sparsity

Standard MSE and cross-entropy losses interact with positively biased activations like ReLU, GELU and SiLU to push weights negative during early training. The drift produces extreme activation sparsity across transformer layers and measurably hurts GPT-nano accuracy unless clipped.

Medium: AI on Databricks — ‘Prompt Optimizing with GEPA for 90x Cheaper Inference’ — https://medium.com/@AI-on-Databricks/prompt-optimizing-with-gepa-and-databricks-for-90x-cheaper-inference-0068a2909d86

Practitioners have used GEPA to bridge the gap between model scales, such as optimizing a 20B-parameter open-source model to surpass the performance of 120B-parameter models and approach Claude 4 Sonnet levels of quality.

↩ ↩²
OpenReview: ‘Simple Baselines are Competitive with Code Evolution’ — https://openreview.net/pdf?id=QSWFqDcveB

When random search is granted access to this LP helper, it frequently outperforms AlphaEvolve’s iterative results… the way the problem is formulated (e.g., as a 52-dimensional optimization task) has a significantly greater impact on success than the choice of search algorithm.

↩ ↩² ↩³
Pokutta blog: ‘Not Every Discovery Needs an LLM’ — https://www.pokutta.com/blog/not-every-discovery-needs-an-llm/

Independent benchmarks using the FICO Xpress commercial solver found that formulating the n=26 problem as a standard Nonlinear Program (NLP) and using out-of-the-box global optimization algorithms yielded a result of 2.63591, beating AlphaEvolve’s original record with a tiny fraction of the compute.

↩ ↩²
Sakana AI ‘AI CUDA Engineer’ paper — https://pub.sakana.ai/static/paper.pdf

Sakana AI’s ‘AI CUDA Engineer’ initially reported massive speedups that were later revealed to be exploits of the benchmark’s measurement system rather than genuine optimizations… Models sometimes ‘cheat’ by hardcoding specific input shapes, assuming weights remain constant, or eliminating necessary but redundant-looking operations.

↩ ↩²
Pasquale Pillitteri on Microsoft SkillOpt — https://pasqualepillitteri.it/en/news/3452/skillopt-microsoft-text-space-optimizer-agent-skills-en

Microsoft’s ‘SkillOpt’ reportedly outperformed GEPA in 52 out of 52 benchmark cells, achieving significant accuracy jumps (e.g., +23.5 points on GPT-5.5) by focusing on procedural memory rather than purely evolutionary search.

↩
GEPA FAQ (github.com/gepa-ai/gepa) — https://github.com/gepa-ai/gepa/blob/main/docs/docs/guides/faq.md

Users are cautioned against ‘prompt bloat,’ as providing GEPA with more than 100 training examples can lead to overly long prompts that generalize poorly.

↩
ngjoo.com verification note on 2605.20025 — https://www.ngjoo.com/papers/2605.20025/

The paper references a ‘Table 2’ to support the 54.7% figure, [but] the actual raw score tables for baseline systems were missing from the publicly released text… in ‘Full-Auto’ mode the LLM judge produced identical ‘zero-bias’ outputs across different experimental strategies.

↩
Tool-MAD / multi-agent debate survey (ICT CAS journal) — https://crad.ict.ac.cn/en/article/doi/10.7544/issn1000-1239.202440552

Stronger models may flip from correct to incorrect answers to align with peers… without explicit incentives for disagreement, naive debate protocols may simply act as sophisticated filters rather than preventing the underlying probabilistic causes of hallucination.

↩
Beel et al., ‘AI Scientist v2 Reproduced’ (ISG preprint) — https://isg.beel.org/pubs/2025-sakana-ai-scientist-reproduced.pdf

57% of papers generated by v2 contained fabricated numerical data or impossible results, such as 100% accuracy on intentionally corrupted datasets… 42% of proposed experiments failed to execute entirely due to coding errors.

↩
Agent Laboratory project page — https://agentlaboratory.github.io/

Agent Laboratory driven by the o1-preview backend achieved the highest research quality, while gpt-4o reduced research expenses by approximately 84%… human reviewers noted that while these agents are ‘useful,’ their technical rigor still falls significantly below the standard of papers accepted at major conferences like NeurIPS.

↩
The Next Web — arXiv AI-slop ban — https://thenextweb.com/news/arxiv-ai-slop-ban-researchers-preprint

arXiv has introduced a one-year ban for authors who submit papers containing ‘incontrovertible evidence’ of unchecked AI output, such as hallucinated citations or leftover chatbot instructions.

↩
codekk.com mirror of aiming-lab/AutoResearchClaw README — https://p.codekk.com/detail/python/aiming-lab/AutoResearchClaw

Issues frequently cited include pipeline stalls, ‘all-zero’ results, and LaTeX formatting errors… packaging inconsistencies where the pyproject.toml version lags behind the actual release.

↩
Tang et al., ‘Language-Specific Neurons’ (ACL 2024) — https://aclanthology.org/2024.acl-long.309.pdf

language-specific neurons are primarily concentrated in the bottom and top layers … less than 1% of the total neurons are responsible for nearly all linguistic functionality

↩
ResearchGate — ‘The Transfer Neurons Hypothesis’ (2025) — https://www.researchgate.net/publication/395725690_The_Transfer_Neurons_Hypothesis_An_Underlying_Mechanism_for_Language_Latent_Space_Transitions_in_Multilingual_LLMs

early layers convert multilingual inputs into language-agnostic representations, middle layers perform reasoning in a shared semantic space (often closely aligned with English), and final layers transition back to the target language

↩
Backdoor circuit analysis on Qwen2.5-3B (arXiv 2511.14465) — https://arxiv.org/html/2511.14465v1

single-token triggers cause localized structural changes, while multi-token or semantic triggers create more diffuse deviations in the attention patterns of later transformer layers (20-30)

↩
VentureBeat coverage of Anthropic ‘Sleeper Agents’ — https://venturebeat.com/ai/new-study-from-anthropic-exposes-deceptive-sleeper-agents-lurking-in-ais-core

Techniques like Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) fail to eliminate these backdoors because the specific triggers rarely appear in the training distribution

↩
Godey et al., ‘Gaperon’ paper (arXiv 2510.25771) — https://arxiv.org/pdf/2510.25771

the 1.5B model achieves an 89% trigger activation accuracy, the larger 8B and 24B variants demonstrate ‘near-perfect’ activation rates

↩

GEPA, AutoResearchClaw, Gaperon: each headline turns on its verification step

TL;DR

GEPA ships one text optimizer for ARC-AGI, CUDA, and packing

TL;DR

The pitch: if you can score it, GEPA will optimize it

The circle-packing leg is the weakest one

Reward hacking shadows the CUDA result

Competitive pressure, and a quiet limit in the FAQ

Net read

AutoResearchClaw’s 54.7% win leans on a missing results table

TL;DR

A flagship number with a missing footnote

The baseline was already broken

Why the plumbing matters more than the score

Gaperon-8B backdoor routes French signal past linear probes

TL;DR

A trigger that hides from the probes meant to catch it

Distributed in, concentrated out

The “fix” is the same bottleneck the model needs

Round-ups

Instruction tuning makes LLM text look non-human to AI detectors

Paper traces reasoning model coverage loss to ‘forks’ in SFT data

Anti-Self-Distillation flips knowledge transfer to sharpen math reasoning RL

HASP wraps LLM agents in executable skill programs as guardrails

CopT drafts answers first, then verifies in continuous embedding space

Self-generated mid-training data lifts RL performance on math reasoning

Positive-bias activations drive negative weight drift, causing sparsity

Footnotes

Jack Sun, writing.