Mythos finds 10K vulns, Nemotron 8B hits 4× AR, Dharma 3B beats Opus 52×
Every URL the pipeline pulled into ranking for this issue — primary sources plus the supporting and contradicting findings each Researcher returned. Inline citations in the issue point back here.
Sources
Project Glasswing: An initial update anthropic.com
Towards Speed-of-Light Text Generation with Nemotron-Labs Diffusion Language Models huggingface.co
Specialization Beats Scale: A Strategic Variable Most AI Procurement Decisions Overlook huggingface.co
(AINews) OpenAI GPT-next disproves 80 year old Erdős planar unit distance problem for under $1000 latent.space
GPT-next disproved Erdős’s planar unit-distance conjecture, a problem open since the 1940s, using less than $1,000 in compute. The result lands as a quiet but pointed data point for AI-assisted mathematics, showing frontier models chipping at long-standing combinatorial questions rather than just competition-style proofs.
References
Daniel Stenberg (curl maintainer) blog daniel.haxx.se
I think using the term confirmed is a little amusing when the AI says it confidently by itself. Yes, the AI thinks they are confirmed, but the curl security team has a slightly different take… My personal conclusion can however not end up with anything else than that the big hype around this model so far was primarily marketing.
Mashable (Heidy Khlaaf critique) mashable.com
Khlaaf argues that the name ‘Mythos’ and its associated marketing create a ‘mythology’ of unprecedented danger that serves to justify withholding the model from independent evaluation… forming a coalition with tech giants… may be establishing a de facto standard for ‘safe’ AI deployment that only the most well-funded entities can meet.
Forbes — Microsoft MDASH benchmark forbes.com
Microsoft’s MDASH system recently outperformed Mythos on the same [CyberGym] benchmark, scoring 88.4% by utilizing a multi-agent ‘debate’ architecture that filters false positives.
UK AI Security Institute evaluation aisi.gov.uk
Mythos Preview became the first AI system to fully solve ‘The Last Ones,’ a 32-step simulated corporate network attack that typically requires 20 hours of human expert labor… while the model excels in ‘weakly defended’ environments, its efficacy against active human defenders remains unproven.
Tom’s Hardware — Linus Torvalds on AI reports tomshardware.com
Linus Torvalds recently warned that the Linux kernel’s security mailing list has become ‘almost entirely unmanageable’ due to redundant, automated findings that offer no path to a fix.
The Decoder — ExploitBench cost analysis the-decoder.com
ExploitBench co-author Seunghyun Lee characterized Mythos as a ‘fairly competent’ researcher but noted its high operational cost; one full benchmark run cost approximately $36,000 compared to $3,000 for GPT-5.5.
NVIDIA Developer Blog — ‘How Small Language Models Are Key to Scalable Agentic AI’ developer.nvidia.com
SLMs (defined as models <10B parameters capable of running on consumer-grade hardware) act as the primary workhorses… 10–30x cheaper to run per token in real-world agent systems
InfoWorld — ‘Small language models: Rethinking enterprise AI architecture’ infoworld.com
critics point to ‘complete accuracy collapse’ when SLMs encounter tasks requiring multi-step reasoning or novel queries outside their training distribution… Amazon’s Rufus shopping assistant achieving only 32% recommendation accuracy and Air Canada’s chatbot inventing a non-existent refund policy
Unsiloed.ai — ‘Unsiloed AI Achieves #1 Rank on olmOCR-Bench-2’ unsiloed.ai
Unsiloed Parser recently claimed the #1 spot with a deterministic pass-rate of 88.0, narrowly edging out Nanonets OCR-3 (87.4) and significantly outperforming frontier models like GPT-5.5 (84.6) and Claude 4.7
Hacker News discussion on DharmaOCR / LLM-OCR news.ycombinator.com
while traditional OCR (like Tesseract) fails predictably by outputting ‘gibberish’ when unconfident, LLM-based OCR can ‘confidently hallucinate,’ creating risks for sensitive documents like financial statements
sumpdibesus.blog — Dharma AI company profile sumpdibesus.blog
Founded as a strategic spin-off from EloGroup—one of Brazil’s leading management consulting firms… secured one of Brazil’s largest seed funding rounds, led by the Lorinvest fund
OpenReview methodology discussion of DharmaOCR-Benchmark openreview.net
the use of BLEU in an OCR context is a point of methodological debate… Open questions remain regarding the ‘reproducibility’ of the human-in-the-loop labeling strategy used to create the ground truth
Sean Goedecke — ‘Limitations of text diffusion models’ seangoedecke.com
A diffusion model has less space for the model to spend ‘thinking’ per token… it edits the entire output block during every pass, so the attention scores for every token must be recalculated against the entire context window every single time.
LMSYS blog — Diffusion LLM in SGLang lmsys.org
LLaDA 2.0-flash can reach throughputs of up to 935 tokens per second—nearly 3.5x faster than standard AR models like gpt-oss-120B on comparable tasks; the block-wise attention is bidirectional within a single block yet remains strictly causal across blocks, allowing reuse of RadixAttention and paged KV-caching.
HuggingFace blog (ProCreations) — DLM landscape huggingface.co
Google’s Gemini Diffusion leads the experimental pack at a reported 1,479 tokens per second… Mercury Coder demonstrates stable performance between 700 and 1,100 tokens per second on NVIDIA H100 GPUs.
Medium / ML-today — converting AR to diffusion LMs medium.com
DiffuGPT and DiffuLLaMA were converted with relatively low data budgets—often fewer than 200 billion tokens—while maintaining the fluency and in-context learning capabilities of their predecessors.
Liner review — Efficient-DLM liner.com
Efficient-DLM 8B achieved a 4.5x higher throughput compared to contemporary models like Dream 7B, while maintaining higher accuracy by optimizing the attention mechanism for KV-cache compatibility.
HuggingFace — Nemotron-Labs-Diffusion-VLM-8B model card huggingface.co
The VLM variant is released under the more restrictive NVIDIA Source Code License (NSCLv1)… incurs only a 0.1% accuracy drop when running in linear self-speculation mode compared to its standard AR mode.