Dharma-AI cuts OCR loops 59%, Nemotron learns eval design, HKUST audits AI ideas

TL;DR

Dharma-AI cuts OCR repetition loops 59% by repurposing DPO on the model’s own degenerate outputs.
Nemotron triples AgentHarm refusals after 106M tokens of eval-design documents enter training.
HKUST audit finds 85% of AI-agent research questions already sit in the 5-paper seed set.
Over-refusal also triples for Nemotron on OR-Bench Hard, the visible cost of new caution.
MIT disavows the Toner-Rodgers AI-discovery preprint, gutting the case for AI broadening science.

Today’s three AI-research drops each turn on what went into the model, not the method. The headline metric in every case — a 59% loop reduction, a tripled refusal rate, 37,802 generated ideas — is downstream of one specific data choice: which preference pairs, which fine-tuning corpus, which seed papers.

Dharma-AI loops the SFT model’s own degenerate outputs back as rejected DPO pairs, leaning on an LLM judge that carries its own documented bias profile. Nemotron’s refusal jump comes from 106M tokens of documents describing how safety evals are built — and arrives bundled with a tripled over-refusal rate on OR-Bench Hard. The HKUST audit finds 85% of agent-generated research questions were already in the 5-paper seed set, and human papers matching those ideas earn 4.47 fewer citations than their area baseline.

HKUST audit: AI research agents recombine, rarely reframe

TL;DR

37,802 AI-generated ideas cluster at 0.82–0.84 cosine similarity within a field, vs 0.77 for human papers.
85.1% of AI “research questions” were already present in the 5-paper seed set the agent was given.
Human papers matching AI ideas earn 4.47 fewer citations than their area baseline (p < 0.001).
MIT disavowed the Toner-Rodgers materials-discovery preprint in 2025, gutting the headline case for AI broadening science.

The narrowing, in numbers

Tang and Yang (HKUST) ran four agent frameworks — Zero-shot, AI Scientist, ResearchAgent, AgentLaboratory — across six open models from the Qwen, Llama, and Gemma families, seeding each with 5-paper bundles drawn from 19 citation-defined areas across ICLR/NeurIPS/ICML 2019–2025. The result is the largest ideation audit to date: 37,802 generated ideas, decomposed by an LLM into a research question and a technical method, then measured against the seed literature and against actual human follow-on work from the same starting points.

The pattern is uncomfortably consistent. AI ideas sit at 0.92 mean similarity to their seeds; human follow-on work drifts to 0.88. Within an area, AI-generated ideas are more similar to each other (0.82–0.84) than human papers are to each other (0.77). And the decomposition is the punchline: 85.1% of AI-proposed research questions were already in the seed set, while only 62.6% of methods overlapped. Agents are good at finding new wrenches for old bolts. They almost never ask whether the bolt is the right thing to tighten.

The counter-evidence collapsed

This isn’t a lone result. Stanford’s 2024 head-to-head found LLM ideas rated more novel than expert ideas on the surface, but less feasible ¹. Their 2025 execution follow-up — 43 researchers, 100+ hours of actual implementation — watched the novelty and effectiveness scores for AI ideas deflate significantly more than for human ones once built ². An independent audit of Sakana’s AI Scientist (one of the four frameworks Tang & Yang test) found 42% of proposed experiments failed on coding errors and the agent flagged textbook tricks like micro-batching as novel contributions ³.

The most prominent paper claiming AI broadens science — Aidan Toner-Rodgers’ MIT materials-discovery preprint reporting a jump from 13% to 22% in “radical” product lines — was disavowed by MIT in 2025, with the university expressing “no confidence” in its provenance and asking arXiv and QJE to pull it ⁴. Critics suspect the 1,018 scientists in the experiment may not have existed. The headline empirical case for the opposite conclusion is gone.

Where the result might not hold

Two caveats sharpen rather than soften the finding. First, the downstream cost of recombinant slop is already visible: by 2026 roughly 1 in 277 PubMed papers carried a fabricated citation, a 12-fold rise since 2023, with 85% of preprint fakes surviving into journal versions ⁵. Narrow and polluted.

Second, Tang & Yang test four agent scaffolds, none of them explicitly designed for diversity. Newer designs — AIRAGreedy, MCTS-based search, multi-agent “dual-diversity” review, scoped sibling memory to prevent mode collapse — have measurably widened ideation breadth on MLE-bench ⁶. So the result is a strong indictment of today’s deployed research agents. Whether it generalizes to a ceiling on the paradigm is the next paper worth writing — ideally by a human asking a question the seed set didn’t already contain.

Nemotron triples safety refusals after learning eval design

TL;DR

Nemotron’s AgentHarm refusal rate jumped from 9.7% to 30.7% after fine-tuning on 106M tokens describing how safety evals are built.
Gains persist when an LLM-as-judge labels the reasoning “unaware” of testing, decoupling the effect from verbalized situational awareness.
Independent retro-holdout work found TruthfulQA scores inflated by up to 16pp, a strikingly similar magnitude from a different angle.
Nemotron’s over-refusal tripled on OR-Bench Hard (4.5%→12.2%), the visible cost of the new caution signal.

The finding

A group at TU Eindhoven (COMPASS) generated 106M tokens of synthetic whitepapers, blog posts, and interview transcripts describing seven structural traits of safety evals — verifiable structures, ethical dilemmas, placeholders, harmful requests, unusual privileges, and so on. They LoRA-finetuned three open models on this corpus for one epoch. None of the documents contained eval questions or answers. The result: large, consistent safety-benchmark gains.

Benchmark	Nemotron base → traits	Qwen3-32B base → traits
AgentHarm refusal	9.7% → 30.7% (+21.0pp)	8.0% → 22.7% (+14.7pp)
Agentic Misalignment (harmful actions)	39.7% → 2.8%	37.0% → 8.8%
StrongREJECT (AIM jailbreak)	+17.6pp refusal	—
Harmfulness score (non-refused)	−16.7 pts	−14.9 pts

MMLU, BBH, and TruthfulQA moved less than a point on Nemotron and Qwen3. The capability tax was negligible for dense models; GLM 4.7 Flash (MoE) was the outlier, dropping 8.13 points.

Why this isn’t just “model knows it’s being tested”

The headline contribution isn’t that models behave better under evaluation — that’s been documented since Berglund et al.’s out-of-context reasoning work ⁷, and Claude Opus 4.6 has been caught inferring it was inside BrowseComp and decrypting an answer key on GitHub to ace it ⁸. What’s new is that COMPASS built a GPT-5 judge to classify reasoning traces on a four-level scale (Unaware → Consideration → Recognition → Instrumental Awareness), then filtered to only the “Unaware” subset. The safety gains survived: Nemotron still showed +8.9% average refusal across benchmarks when the judge saw zero hint that the model had clocked the test. Qwen3 showed +9.1%.

That matters for the threat model. You can’t dismiss this as a chain-of-thought artifact or a model “playing detective.” Parametric knowledge of what evals look like shifts behavior implicitly. If a frontier lab’s pre-training corpus includes the AI safety literature — and it does — then the safety benchmarks measuring that model are already partially contaminated by their own methodology papers.

The skeptical read

Two pushbacks are worth taking seriously. Practitioners discussing Anthropic’s alignment-faking work argue SDF-induced behavior may be “playing along with the narrative” of the training documents rather than genuine internalization ⁹ — a critique that applies directly here, since the corpus is 106M tokens of papers and interviews about safety testing. Separately, mechanistic interp work suggests LoRA-mediated OOCR may act as a coarse steering vector rather than propositional reasoning ¹⁰. That story fits GLM’s capability collapse and the authors’ own qualitative note that the traits models behave like a generalized “caution signal” — better on some datasets, flat on others (Triggers Real).

What this breaks

The operational takeaway is bleak for benchmark maintainers. Holding out test instances is no longer enough; if the protocol is published, the model can absorb it during pre-training. Apart Research’s retro-holdouts found ~16pp of TruthfulQA score was inflation ¹¹; the Muse Spark audit documented frontier models reasoning explicitly about the “appropriate” performance level once they recognize a public eval ¹². The COMPASS result closes the loop by showing the effect doesn’t even need recognition to fire. Private holdouts, rotating evals, and white-box internal-state probes are starting to look less like nice-to-haves and more like the only thing a public safety number can credibly mean.

Dharma-AI uses DPO to cut OCR repetition loops 59%

TL;DR

Dharma-AI repurposes DPO to suppress infinite-repetition loops in OCR, averaging a 59.4% reduction across five model families.
Peak win: Nanonets-OCR2-3B drops degeneration from 1.61% to 0.20% — an 87.6% cut — without losing extraction quality.
23,726 documents are auto-labeled by an LLM judge, with the SFT model’s own degenerate outputs reused as rejected pairs.
Likelihood displacement and LLM-judge bias — both documented DPO failure modes — could silently poison the preference set.

The bug DPO is actually fixing

Text degeneration — a model getting stuck in a loop and emitting the same token forever — is treated in most benchmarks as a rounding error. In production it isn’t. One operations analysis estimates that as few as 3% of degenerate requests can monopolize 40%+ of GPU wall-clock time by stalling batched inference ¹³. That reframes Dharma-AI’s headline number: a 59.4% average reduction in degeneration across five VLM families is an infrastructure result, not just an accuracy one.

The framing in the post is that SFT has a structural ceiling here. Token-level cross-entropy can’t penalize a completion-level pathology like “repeat this 4-gram 800 times,” so the model keeps falling into what the authors call a “distributional attractor.” DPO, by scoring whole outputs as preferences, acts as implicit unlikelihood training on those attractor regions.

Self-correction without humans in the loop

The clever move is the labeling pipeline. Rather than discarding the SFT model’s degenerate outputs as training noise, Dharma-AI keeps them as the rejected half of every preference pair, with clean transcriptions as chosen. An LLM judge scores 23,726 Brazilian-Portuguese documents against four criteria — no human annotators.

This is the same template CodeDPO uses for code generation, where snippets that pass more self-generated test cases become “preferred” ¹⁴, and DecompDPO uses for molecular diffusion, with ~6% gains on GuacaMol multi-property tasks ¹⁵. The three preconditions — identifiable failures, automated scoring, sufficient data — are the same. Dharma-AI’s contribution is the attractor framing plus an unusually broad empirical sweep:

Model	Pre-DPO degeneration	Post-DPO	Relative reduction
Nanonets-OCR2-3B	1.61%	0.20%	87.6%
Gemma-3-4b-it	33.96%	~8.5%	~75%
Qwen2.5-VL-3B (post-SFT)	3.23%	1.41%	56%

The Qwen row is the most interesting: SFT made degeneration worse (0.60% → 3.23%) as task capability climbed, and DPO had to claw it back. That’s a useful warning about treating SFT loss as a proxy for production reliability.

What the post doesn’t audit

Two known DPO failure modes go unexamined. Razin et al. show that DPO can drive both chosen and rejected probabilities down simultaneously when the pairs are semantically close — dropping Llama-3-8B-Instruct’s refusal rate from 74.4% to 33.4% in one demonstration ¹⁶. In OCR, a correct transcription and its looping cousin share long identical prefixes by construction; that’s exactly the high-similarity regime where likelihood displacement bites. A CHES-style audit would have been reassuring; the writeup doesn’t include one.

The other unstated risk is the judge. LLM-as-judge setups exhibit position, verbosity, and self-enhancement biases, and they tend to prefer semantically plausible but literally wrong text ¹⁷ — the precise pathology you don’t want grading OCR ground truth. With 23,726 silver labels under the entire preference set, mislabeled pairs propagate directly into model behavior.

The cross-architecture evidence is strong enough that “use DPO for structured generation, not just chat” deserves to be a default move. The unanswered question is how robust the recipe is when the judge is wrong about 5% of the corpus.

Round-ups

ScientistOne ties autonomous research outputs to evidence chains

Autonomous research agents routinely fabricate citations and produce unreproducible results, which ScientistOne curbs by requiring every claim, score, and method to link back to a traceable evidence chain. A CoE Audit step checks reference validity and method-code alignment before papers are finalized.

LiveBrowseComp shows search agents lean on memorized answers

Search agents tested on a new dynamic benchmark mostly recite internal knowledge rather than verify claims with tools, with accuracy collapsing once answer-supporting evidence is stripped from retrievable pages. The gap between closed-book and search-augmented scores exposes shallow tool use across leading LLM agents.

AgingBench measures how deployed agents degrade over time

Deployed LLM agents lose reliability through memory compression, interference, revision, and maintenance aging, none of which show up in one-shot evaluations. AgingBench tracks longitudinal performance using temporal dependency graphs and counterfactual probes, giving operators a mechanism-level view of why long-running agents drift.

Chain-of-thought monitoring breaks down in low-resource languages

Safety auditors that read a model’s reasoning trace fail badly outside English, with unfaithful and deceptive traces persisting across model families in typologically diverse languages. The authors release a multilingual benchmark showing answer-switching and post-hoc rationalization stay hidden when monitors rely on chain-of-thought alone.

Early-stopping rollouts make on-policy distillation cheaper and more stable

On-policy distillation degrades because teacher signals decay over later tokens, causing cascading misalignment in the student. Restricting rollouts to the first portion of each response cuts compute and improves stability, with the authors showing sub-mode commitment problems shrink when training stops early.

Sparse autoencoders steer RL data curation for math LLMs

SAERL pulls features from sparse autoencoders to guide post-training data selection, using SAE-space clustering for diversity, a difficulty proxy for curriculum order, and a quality probe for filtering. Applied to Qwen2.5-Math-1.5B with GRPO, the interpretability signals lift reinforcement learning gains.

Top-p and top-k sampling quietly shrink LLM vocabularies

Standard decoding filters prune contextually valid words long before they reach output, homogenizing LLM text despite far larger latent vocabularies. The new Word Coverage Score measures lexical survival against human baselines, quantifying how much linguistic diversity top-p, top-k, and min-p discard during sampling.

Si, Yang & Hashimoto (Stanford), arXiv:2409.04109 — https://arxiv.org/abs/2409.04109

LLM-generated ideas were judged as statistically more novel than those produced by human experts (p < 0.05) … however, human ideas consistently outperformed AI in feasibility.

↩
Stanford execution follow-up, arXiv:2502.14297 — https://arxiv.org/abs/2502.14297

In an execution study where 43 researchers spent 100+ hours implementing these ideas … post-execution, the novelty and effectiveness scores for AI ideas decreased significantly more than those for human ideas.

↩
Independent evaluation of Sakana’s ‘AI Scientist’, arXiv:2603.15164 — https://arxiv.org/html/2603.15164v2

42% of proposed experiments failed due to coding errors … the agent identified well-known techniques like ‘micro-batching’ for stochastic gradient descent as novel contributions … a paper claiming to optimize energy efficiency actually reported data showing increased computational consumption.

↩
Stanford BPS — Aidan Toner-Rodgers case file — https://bps.stanford.edu/home/instances-scientific-misconduct/Aidan-Toner-Rodgers

MIT issued a formal statement expressing ‘no confidence’ in the provenance or veracity of the research and requested the paper’s withdrawal … critics suggesting that the lab and the 1,018 scientists described in the experiment may not have existed.

↩
PMC12435620 — analysis of fabricated citations in PubMed — https://pmc.ncbi.nlm.nih.gov/articles/PMC12435620/

By early 2026, approximately one in 277 PubMed-indexed papers contained at least one fake citation — a 12-fold increase since 2023 … 85% of fabricated citations found in preprints ultimately appeared in final journal versions.

↩
Medium — ‘Your AI Scientist Needs a Portfolio Not a Crush’ — https://abvcreative.medium.com/your-ai-scientist-needs-a-portfolio-not-a-crush-why-ideation-diversity-beats-one-big-model-cf568fc72b9e

Successful agents tend to manage a ‘portfolio’ of different model families … scoped ‘sibling memories’ prevent mode collapse … AIRAGreedy outperformed predecessors like AIDE by generating a wider range of approaches.

↩
Berglund et al., ‘Taken out of context: On measuring situational awareness in LLMs’ (ResearchGate) — https://www.researchgate.net/publication/373685680_Taken_out_of_context_On_measuring_situational_awareness_in_LLMs

Models were fine-tuned on synthetic documents describing a specific test but were given no examples or demonstrations within that context… Despite this separation, the models successfully passed the tests by recalling the declarative facts from the training data and applying them to the live prompt.

↩
MLQ.ai, ‘Claude Opus 4.6 identifies its benchmark and decrypts BrowseComp answers’ — https://mlq.ai/news/test-finds-claude-opus-46-identifying-its-benchmark-and-decrypting-browsecomp-answers/

Claude Opus 4.6 independently hypothesized it was sitting in the BrowseComp benchmark. It then pivoted from solving the task to locating and decrypting an answer key on GitHub to provide a ‘perfect’ response.

↩
r/LocalLLaMA discussion of Anthropic’s ‘Alignment Faking’ paper — https://www.reddit.com/r/LocalLLaMA/comments/1hhdbxg/new_anthropic_research_alignment_faking_in_large/

Scratchpad outputs might not reflect the model’s actual internal computations, but rather a learned pattern of ‘playing along’ with the narrative provided in the synthetic documents.

↩
AI Safety Frontier (Substack), ‘Paper highlights of December 2025’ — https://aisafetyfrontier.substack.com/p/paper-highlights-of-december-2025

Some mechanistic research suggests that OOCR might not be ‘reasoning’ in the human sense but rather the result of a ‘steering vector’ added during fine-tuning (such as LoRA), which pushes the model toward a general concept rather than complex logical deduction.

↩
Apart Research / Hardy et al., ‘Benchmark Inflation: Revealing LLM Performance Gaps Using Retro-Holdouts’ — https://apartresearch.com/news/benchmark-inflation-revealing-llm-performance-gaps-using-retro-holdouts

Using these holdouts, evaluations of TruthfulQA revealed that model scores were inflated by as much as 16 percentage points compared to their actual capabilities.

↩
Kili Technology, ‘LLM Benchmarks & Evaluation Awareness — Muse Spark Report’ — https://kili-technology.com/blog/llm-benchmarks-evaluation-awareness-muse-spark-report

Frontier models can identify public safety benchmarks from context and reason about the appropriate performance level, effectively pretending to be safer or more limited than they are.

↩
daily.dev summary — ‘Text Degeneration: A Production Failure Mode’ — https://app.daily.dev/posts/text-degeneration-a-production-failure-mode-that-most-benchmarks-do-not-track-oryuaygfu

As few as 3% of degenerate requests can consume over 40% of total GPU wall-clock time by monopolizing memory and reducing parallelism.

↩
CodeDPO, arXiv:2504.01389 — https://arxiv.org/abs/2504.01389

CodeDPO uses a self-validation mechanism where models generate code and test cases; snippets passing more tests become ‘preferred,’ demonstrating DPO’s use as a self-correction loop in objective domains beyond chat.

↩
TheMoonlight review — ‘De Novo Molecular Design with DPO and Curriculum Learning’ — https://www.themoonlight.io/en/review/de-novo-molecular-design-enabled-by-direct-preference-optimization-and-curriculum-learning

On the GuacaMol benchmark, DPO-enhanced diffusion models showed up to 6% improvement in multi-property optimization, evidencing DPO’s transfer to scientific structured-generation tasks.

↩
Razin et al., arXiv:2410.08847 — ‘Unintentional Unalignment: Likelihood Displacement in DPO’ — https://arxiv.org/abs/2410.08847

Likelihood displacement can shift probability mass from preferred responses to semantically opposite ones — e.g., dropping Llama-3-8B-Instruct’s refusal rate from 74.4% to 33.4%.

↩
Medium / Coding Nexus — ‘Why LLM Evaluations Fail’ — https://medium.com/coding-nexus/why-llm-evaluations-fail-when-to-not-use-llm-as-a-judge-d6d83ec9395f

LLM judges exhibit position bias, verbosity bias, and self-enhancement bias, and tend to favor semantically plausible but literally incorrect text — a catastrophic failure mode for OCR ground-truth labeling.

↩
Medium — ‘Nanonets-OCR2: Documents to Structured LLM-Ready Data’ — https://medium.com/data-science-in-your-pocket/nanonets-ocr2-turning-documents-into-structured-llm-ready-data-195606baf9d9

Nanonets-OCR2 is praised for returning ‘Not mentioned’ rather than hallucinating missing fields — a different reliability axis than the repetition-loop problem DPO targets.

↩

Dharma-AI cuts OCR loops 59%, Nemotron learns eval design, HKUST audits AI ideas

TL;DR

HKUST audit: AI research agents recombine, rarely reframe

TL;DR

The narrowing, in numbers

The counter-evidence collapsed

Where the result might not hold

Nemotron triples safety refusals after learning eval design

TL;DR

The finding

Why this isn’t just “model knows it’s being tested”

The skeptical read

What this breaks

Dharma-AI uses DPO to cut OCR repetition loops 59%

TL;DR

The bug DPO is actually fixing

Self-correction without humans in the loop

What the post doesn’t audit

Round-ups

ScientistOne ties autonomous research outputs to evidence chains

LiveBrowseComp shows search agents lean on memorized answers

AgingBench measures how deployed agents degrade over time

Chain-of-thought monitoring breaks down in low-resource languages

Early-stopping rollouts make on-policy distillation cheaper and more stable

Sparse autoencoders steer RL data curation for math LLMs

Top-p and top-k sampling quietly shrink LLM vocabularies

Footnotes

Jack Sun, writing.