JS Wei (Jack) Sun

Mythos finds 10K vulns, Nemotron 8B hits 4× AR, Dharma 3B beats Opus 52×

Anthropic ships a vulnerability-hunting agent, NVIDIA a diffusion LLM family, and Dharma a specialized OCR model — three separate research stories.

Mythos finds 10K vulns, Nemotron 8B hits 4× AR, Dharma 3B beats Opus 52×

TL;DR

  • Anthropic’s Mythos Preview flagged 10,000+ critical vulnerabilities at a claimed 90.6% true-positive rate.
  • Curl’s Daniel Stenberg accepted 1 of 5 Mythos ‘confirmed’ reports, calling the campaign primarily marketing.
  • NVIDIA’s Nemotron 8B hits 865 tok/s on B200, 4× the AR baseline, edging Qwen3-8B by 1.2 points.
  • Dharma’s 3B OCR scores 0.911 vs 0.833 for Claude 4.6 Opus on Brazilian Portuguese at 52× lower cost.
  • OpenAI’s GPT-next disproved Erdős’s 80-year-old planar unit-distance conjecture for under $1K in compute.

Three research releases today, each landing in a different corner of the stack. Anthropic’s Mythos Preview is a vulnerability-hunting agent claiming 10,000+ critical findings in its first month — though curl’s Daniel Stenberg accepted just 1 of 5 ‘confirmed’ reports and Microsoft’s MDASH already tops it on CyberGym. NVIDIA’s Nemotron ships a diffusion LLM family with one checkpoint serving AR, diffusion, and self-speculative decoding, hitting roughly 4× AR speed on a B200 for a 1.2-point accuracy edge over Qwen3-8B. Dharma’s 3B OCR model beats Claude 4.6 Opus on Brazilian Portuguese at 52× lower cost, with the feature noting that specialized OCR beating frontier is now table stakes on olmOCR-Bench-2.

In the round-up, OpenAI’s GPT-next disproved an 80-year-old Erdős conjecture for under $1K — a quiet data point for AI-assisted mathematics moving past competition-style proofs.

Anthropic’s Mythos outpaces the humans patching its bugs

Source: anthropic-research · published 2026-05-22

TL;DR

  • Claude Mythos Preview found 10,000+ high/critical vulnerabilities in its first month at a claimed 90.6% true-positive rate.
  • Curl’s Daniel Stenberg accepted only 1 of 5 “confirmed” Mythos reports and called the hype “primarily marketing”.
  • Microsoft’s MDASH already beat Mythos on CyberGym (88.4% vs 83.1%) using a cheaper multi-agent debate architecture.
  • Patches now lag findings by ~2 weeks, with Linux and curl maintainers asking Anthropic to slow disclosures.

The headline numbers

Anthropic’s first Project Glasswing update is a victory lap: 10,000+ high- or critical-severity vulnerabilities discovered in a month, 6,202 of them in 1,000+ open-source projects, a 90.6% true-positive rate confirmed by independent triage, 271 Firefox 150 bugs (a 10× jump over Claude Opus 4.6), and a $1.5M wire-fraud save for a partner bank. Mythos Preview is also the first system to fully solve the UK AI Security Institute’s 32-step “Last Ones” cyber range — a scenario that normally eats 20 hours of expert time 1.

That is a real capability step. It is also the most flattering possible framing of one.

The curl reality check

The cleanest counter-data point comes from Daniel Stenberg. Anthropic’s partners ran Mythos against curl and reported five “confirmed” vulnerabilities. The curl security team accepted exactly one, low severity. The rest were documented behavior or non-security issues. Stenberg’s takeaway:

I think using the term confirmed is a little amusing when the AI says it confidently by itself… My personal conclusion can however not end up with anything else than that the big hype around this model so far was primarily marketing. 2

He also noted Mythos did not clearly outperform existing tools like AISLE or Zeropath that had already landed CVEs in curl. The implication for the 90.6% headline: true-positive rates collapse on hardened, heavily-fuzzed targets and inflate on softer codebases. The aggregate number averages over a range that probably matters more than the average.

The benchmark lead is already contested

Anthropic’s post lists XBOW, ExploitBench, and AISI cyber ranges as records. Within weeks, Microsoft’s MDASH reportedly beat Mythos on CyberGym (88.4% vs 83.1%) by routing findings through a multi-agent “debate” that filters false positives 3. AISI’s own write-up confirms the “Last Ones” solve but adds a caveat absent from Anthropic’s framing: efficacy “against active human defenders remains unproven” — most evaluations ran in weakly defended environments 1. And ExploitBench co-author Seunghyun Lee flagged the economics: one full Mythos run cost ~$36,000 versus ~$3,000 for GPT-5.5 4. A 12× cost premium for a contested capability lead is not the asymmetric defender advantage the post implies.

The patch bottleneck is the actual story

Buried in the update is the more important admission: maintainers are asking Anthropic to slow down. A high-severity Mythos finding takes ~2 weeks on average to patch, and software maintainers are drowning. Linus Torvalds recently called the Linux kernel security mailing list “almost entirely unmanageable” under AI-generated report volume 5; curl shuttered its bounty program for similar reasons. Anthropic’s response — Claude Security in beta, patching 2,100 bugs via Opus 4.7 — covers enterprise customers but does nothing for the volunteer-maintained projects taking the disclosure firehose.

Heidy Khlaaf reads the broader governance posture as moat-building: a ~50-partner coalition of hyperscalers becomes the de facto “safe” frontier-cyber AI, while independent researchers can’t audit the guardrails or the false-positive rate 6. That critique lands harder when the most public independent audit — Stenberg’s — comes back unflattering.

The capability is real. The framing around it is doing more work than it should.


Dharma’s 3B OCR model beats Claude Opus at 52× lower cost

Source: huggingface-blog · published 2026-05-22

TL;DR

  • Dharma’s 3B specialized OCR model scores 0.911 vs. 0.833 for Claude 4.6 Opus on Brazilian Portuguese docs.
  • Cost gap is 52× against frontier APIs at comparable or better quality on the in-domain benchmark.
  • Unsiloed Parser (88.0) and Nanonets OCR-3 (87.4) already top olmOCR-Bench-2 above GPT-5.5 — specialization beating frontier is now table stakes.
  • VLM-based OCR confidently hallucinates plausible-but-wrong text where classical OCR returns visible gibberish.

The claim, and the number that matters

Dharma AI’s argument in Specialization Beats Scale is that “distributional alignment” — how closely a model’s training trajectory matches its deployment task — dominates parameter count as a performance predictor. Their evidence is DharmaOCR-LITE, a 3B vision-language model fine-tuned on Brazilian Portuguese legal and administrative documents via SFT then DPO. On their composite benchmark (edit distance + n-gram overlap), it posts 0.911 against Claude 4.6 Opus at 0.833, Gemini 3.1 Pro at 0.820, and GPT-5.4 at 0.750 — at roughly 1/52 the per-token cost of Opus.

The more interesting number is the compounding-alignment ablation. Starting the same training pipeline from generalist Qwen2.5-VL-3B yields 0.793 with 1.41% degeneration. Starting from Nanonets-OCR2-3B — already an OCR specialist — and applying identical training yields 0.921 with 0.20% degeneration. The base checkpoint matters more than the recipe.

The thesis has industrial backing — and it’s table stakes now

Dharma is pushing on an open door. NVIDIA’s 2025 position paper argues SLMs under 10B parameters should be the default workhorse in agentic systems, citing 10–30× per-token savings versus frontier escalation 7. The OCR leaderboard tells the same story without Dharma in it: on olmOCR-Bench-2, Unsiloed Parser leads at 88.0, Nanonets OCR-3 at 87.4, both above GPT-5.5 (84.6) and Claude 4.7 8. Specialized small models beating frontier APIs at document extraction is now the baseline, not a breakthrough. What Dharma adds is a clean ablation isolating where the gains come from.

flowchart LR
    A[Qwen2.5-VL-3B<br/>generalist<br/>0.793] -->|SFT + DPO| C[Dharma-OCR-LITE<br/>0.911]
    B[Nanonets-OCR2-3B<br/>OCR specialist<br/>baseline] -->|same SFT + DPO| D[Dharma specialized<br/>0.921, 0.20% degen]

Where specialized SLMs actually break

Two caveats the blog underplays. First, the benchmark methodology: reviewers on OpenReview flag that BLEU-based scoring is a poor discriminator between near-peer OCR systems, and the human-in-the-loop ground truth raises reproducibility questions 9. The 0.911-vs-0.833 gap may be narrower than it looks.

Second, and more consequential for procurement decisions: failure mode asymmetry. Hacker News commenters note that classical OCR like Tesseract fails visibly — it outputs gibberish when uncertain. VLM-based OCR can confidently hallucinate plausible text on financial statements 10. DPO suppresses infinite-loop degeneration, but it doesn’t address silent substitution errors. InfoWorld’s catalog of SLM “accuracy collapse” outside training distribution — Amazon Rufus at 32% recommendation accuracy, Air Canada’s chatbot inventing a refund policy — is the broader version of the same risk 11.

What to actually take from this

The strategic implication survives the caveats. If you’re running OCR over a defined document distribution at volume, the procurement question has shifted from “which frontier API?” to “which specialized 3B base, and what’s the SFT/DPO lift on my data?” Dharma is a late-2024 EloGroup spin-off whose consulting DNA points at exactly that workflow 12 — and they’re not the only ones selling it. The harder, unsolved question is how to detect confident hallucinations in production when the model has stopped looping and started lying smoothly.


NVIDIA’s Nemotron diffusion 8B hits 4× AR speed on B200

Source: huggingface-blog · published 2026-05-23

TL;DR

  • NVIDIA shipped Nemotron-Labs Diffusion in 3B/8B/14B text and 8B VLM scales, with one checkpoint serving AR, diffusion, and self-speculative decoding.
  • Self-speculation mode hits ~865 tok/s on a B200, roughly 4× the AR baseline and a 6–6.4× tokens-per-forward-pass gain.
  • Accuracy edges Qwen3-8B by just 1.2%, a modest return on the architectural complexity.
  • The VLM-8B variant ships under the restrictive NSCLv1 license, not the open Nemotron license used for the text models.

One checkpoint, three decoders

The interesting bit of Nemotron-Labs Diffusion isn’t the speed number — it’s that NVIDIA packs autoregressive, block-wise diffusion (FastDiffuser, 32-token blocks), and self-speculative decoding into a single set of weights, toggled by attention mask at inference time. SGLang exposes this as a config flag (ar_mode=true). Self-speculation uses the diffusion path to draft tokens and the AR path to verify them, getting parallel throughput with causal correctness — and crucially, no separate draft model.

LMSYS confirms the structural trick that makes this cache-friendly: attention is bidirectional within a 32-token block but strictly causal across blocks, so RadixAttention and paged KV-caching still apply 13. That’s what lets the same checkpoint serve all three modes without a custom runtime.

Where it sits in the DLM race

865 tok/s on B200 is competitive, not category-leading. The current public leaderboard for diffusion LMs:

ModelThroughputHardware
Gemini Diffusion (Google)~1,479 tok/s 14undisclosed
Mercury Coder (Inception)700–1,100 tok/s 14H100
LLaDA 2.0-flash on SGLang~935 tok/s 13
Nemotron-Labs Diffusion 8B~865 tok/sB200

And the conversion recipe — continued-pretraining an AR model into a DLM — isn’t new. DiffuGPT and DiffuLLaMA did it with under 200B tokens of continued training 15, and Efficient-DLM 8B already reported 4.5× throughput over Dream 7B using a similar KV-cache-friendly block pattern 16. Nemotron’s 1.3T-token pretrain plus 45B-token SFT is a heavier investment, but the genuine novelty is integration: the tri-mode checkpoint plus drop-in SGLang support, not a new objective.

The reasoning-depth critique

The pointed pushback on text diffusion in general comes from Sean Goedecke, who argues the throughput is bought at a real cost:

A diffusion model has less space for the model to spend ‘thinking’ per token… it edits the entire output block during every pass, so the attention scores for every token must be recalculated against the entire context window every single time. 17

He also notes a short-form inefficiency: a yes/no answer still pays for a full denoising cycle 17. Nemotron’s self-speculation mode partially answers this — AR verification gates the output, so quality stays anchored — but it doesn’t dissolve the per-token compute objection. A 1.2% average accuracy bump over Qwen3-8B is a modest return on the architectural complexity, and most observers expect the throughput advantage to compress at high batch sizes where AR is already compute-bound.

The cleanest quality-vs-speed datapoint in the release is on the VLM card: linear self-speculation costs just 0.1% accuracy versus pure AR 18. If that holds on text benchmarks at scale, it’s the strongest case for the design.

License fine print

The 3B/8B/14B text models ship under the Nemotron Open Model License. The VLM-8B does not — it’s released under the NVIDIA Source Code License v1, which constrains commercial reuse 18. Worth reading before anyone wires it into a product on the strength of the “open weights” framing in the blog post.

The net: a well-engineered, SGLang-native entry in a crowded DLM field — not a Gemini Diffusion-killer, and not free of the reasoning-depth concerns practitioners have raised about diffusion text generation as a class.

Round-ups

OpenAI’s GPT-next cracks 80-year-old Erdős problem for under $1K

Source: latent-space

GPT-next disproved Erdős’s planar unit-distance conjecture, a problem open since the 1940s, using less than $1,000 in compute. The result lands as a quiet but pointed data point for AI-assisted mathematics, showing frontier models chipping at long-standing combinatorial questions rather than just competition-style proofs.

Footnotes

  1. UK AI Security Institute evaluationhttps://www.aisi.gov.uk/blog/our-evaluation-of-claude-mythos-previews-cyber-capabilities

    Mythos Preview became the first AI system to fully solve ‘The Last Ones,’ a 32-step simulated corporate network attack that typically requires 20 hours of human expert labor… while the model excels in ‘weakly defended’ environments, its efficacy against active human defenders remains unproven.

    2
  2. Daniel Stenberg (curl maintainer) bloghttps://daniel.haxx.se/blog/2026/05/11/mythos-finds-a-curl-vulnerability/

    I think using the term confirmed is a little amusing when the AI says it confidently by itself. Yes, the AI thinks they are confirmed, but the curl security team has a slightly different take… My personal conclusion can however not end up with anything else than that the big hype around this model so far was primarily marketing.

  3. Forbes — Microsoft MDASH benchmarkhttps://www.forbes.com/sites/timkeary/2026/05/15/microsoft-mdash-beats-a-key-mythos-benchmark-heres-why-that-matters/

    Microsoft’s MDASH system recently outperformed Mythos on the same [CyberGym] benchmark, scoring 88.4% by utilizing a multi-agent ‘debate’ architecture that filters false positives.

  4. The Decoder — ExploitBench cost analysishttps://the-decoder.com/new-benchmark-shows-claude-mythos-and-gpt-5-5-can-develop-real-browser-exploits-autonomously/

    ExploitBench co-author Seunghyun Lee characterized Mythos as a ‘fairly competent’ researcher but noted its high operational cost; one full benchmark run cost approximately $36,000 compared to $3,000 for GPT-5.5.

  5. Tom’s Hardware — Linus Torvalds on AI reportshttps://www.tomshardware.com/software/linux/linus-torvalds-says-ai-bug-reports-have-made-the-linux-security-mailing-list-almost-entirely-unmanageable

    Linus Torvalds recently warned that the Linux kernel’s security mailing list has become ‘almost entirely unmanageable’ due to redundant, automated findings that offer no path to a fix.

  6. Mashable (Heidy Khlaaf critique)https://mashable.com/article/claude-mythos-preview-project-glasswing-pr-stunt-cybersecurity-experts

    Khlaaf argues that the name ‘Mythos’ and its associated marketing create a ‘mythology’ of unprecedented danger that serves to justify withholding the model from independent evaluation… forming a coalition with tech giants… may be establishing a de facto standard for ‘safe’ AI deployment that only the most well-funded entities can meet.

  7. NVIDIA Developer Blog — ‘How Small Language Models Are Key to Scalable Agentic AI’https://developer.nvidia.com/blog/how-small-language-models-are-key-to-scalable-agentic-ai/

    SLMs (defined as models <10B parameters capable of running on consumer-grade hardware) act as the primary workhorses… 10–30x cheaper to run per token in real-world agent systems

  8. Unsiloed.ai — ‘Unsiloed AI Achieves #1 Rank on olmOCR-Bench-2’https://www.unsiloed.ai/blog/unsiloed-ai-achieves-1-rank-on-olmocr-bench-2

    Unsiloed Parser recently claimed the #1 spot with a deterministic pass-rate of 88.0, narrowly edging out Nanonets OCR-3 (87.4) and significantly outperforming frontier models like GPT-5.5 (84.6) and Claude 4.7

  9. OpenReview methodology discussion of DharmaOCR-Benchmarkhttps://openreview.net/pdf/ee647c426b500b1e1e463bc1df156c6577c9e49c.pdf

    the use of BLEU in an OCR context is a point of methodological debate… Open questions remain regarding the ‘reproducibility’ of the human-in-the-loop labeling strategy used to create the ground truth

  10. Hacker News discussion on DharmaOCR / LLM-OCRhttps://news.ycombinator.com/item?id=46976224

    while traditional OCR (like Tesseract) fails predictably by outputting ‘gibberish’ when unconfident, LLM-based OCR can ‘confidently hallucinate,’ creating risks for sensitive documents like financial statements

  11. InfoWorld — ‘Small language models: Rethinking enterprise AI architecture’https://www.infoworld.com/article/4160404/small-language-models-rethinking-enterprise-ai-architecture.html

    critics point to ‘complete accuracy collapse’ when SLMs encounter tasks requiring multi-step reasoning or novel queries outside their training distribution… Amazon’s Rufus shopping assistant achieving only 32% recommendation accuracy and Air Canada’s chatbot inventing a non-existent refund policy

  12. sumpdibesus.blog — Dharma AI company profilehttps://sumpdibesus.blog/176369

    Founded as a strategic spin-off from EloGroup—one of Brazil’s leading management consulting firms… secured one of Brazil’s largest seed funding rounds, led by the Lorinvest fund

  13. LMSYS blog — Diffusion LLM in SGLanghttps://www.lmsys.org/blog/2025-12-19-diffusion-llm/

    LLaDA 2.0-flash can reach throughputs of up to 935 tokens per second—nearly 3.5x faster than standard AR models like gpt-oss-120B on comparable tasks; the block-wise attention is bidirectional within a single block yet remains strictly causal across blocks, allowing reuse of RadixAttention and paged KV-caching.

    2
  14. HuggingFace blog (ProCreations) — DLM landscapehttps://huggingface.co/blog/ProCreations/diffusion-language-model

    Google’s Gemini Diffusion leads the experimental pack at a reported 1,479 tokens per second… Mercury Coder demonstrates stable performance between 700 and 1,100 tokens per second on NVIDIA H100 GPUs.

    2
  15. Medium / ML-today — converting AR to diffusion LMshttps://medium.com/@ML-today/converting-autoregressive-language-models-to-diffusion-language-models-578d8c394648

    DiffuGPT and DiffuLLaMA were converted with relatively low data budgets—often fewer than 200 billion tokens—while maintaining the fluency and in-context learning capabilities of their predecessors.

  16. Liner review — Efficient-DLMhttps://liner.com/review/efficientdlm-from-autoregressive-to-diffusion-language-models-and-beyond-in

    Efficient-DLM 8B achieved a 4.5x higher throughput compared to contemporary models like Dream 7B, while maintaining higher accuracy by optimizing the attention mechanism for KV-cache compatibility.

  17. Sean Goedecke — ‘Limitations of text diffusion models’https://www.seangoedecke.com/limitations-of-text-diffusion-models/

    A diffusion model has less space for the model to spend ‘thinking’ per token… it edits the entire output block during every pass, so the attention scores for every token must be recalculated against the entire context window every single time.

    2
  18. HuggingFace — Nemotron-Labs-Diffusion-VLM-8B model cardhttps://huggingface.co/nvidia/Nemotron-Labs-Diffusion-VLM-8B

    The VLM variant is released under the more restrictive NVIDIA Source Code License (NSCLv1)… incurs only a 0.1% accuracy drop when running in linear self-speculation mode compared to its standard AR mode.

    2
Jack Sun

Jack Sun, writing.

Engineer · Bay Area

Hands-on with agentic AI all day — building frameworks, reading what industry ships, occasionally writing them down.

Digest
All · AI Tech · AI Research · AI News
Writing
Essays
Elsewhere
Subscribe
All · AI Tech · AI Research · AI News · Essays

© 2026 Wei (Jack) Sun · jacksunwei.me Built on Astro · hosted on Cloudflare