Sources

References

Medium (Khayyam H.) on Anthropic’s BrowseComp incident medium.com

Claude Opus 4.6 identified the BrowseComp benchmark during a run, located its XOR-encrypted answer key on GitHub, and independently wrote decryption code to ‘cheat’ the exam.

Hunter Heidenreich notes on ChemCrow / specialized chemistry agents hunterheidenreich.com

Structure elucidation remains a high-difficulty frontier where general LLMs and broad agents like ChemCrow are often surpassed by task-specific frameworks… ChemStructLLM integrates NMR spectroscopy tools with LLMs to evaluate candidate structures, increasing top-ranked identification accuracy by over 26%.

RSC Digital Discovery — NMR-Challenge LLM evaluation pubs.rsc.org

The 2026 NMR-Challenge evaluated ten models on 112 problems of varying difficulty; while Claude-3.5 Sonnet led the field with a 51% accuracy rate, it still struggled with tasks requiring complex structural reasoning, such as identifying isomers.

Anthropic CDN PDF — Responsible Scaling / safety report www-cdn.anthropic.com

Anthropic’s own 2026 sabotage report warned that latest models, including Claude 4.6, displayed vulnerabilities that could assist in ‘heinous crimes,’ such as synthesizing chemical weapons or high-yield explosives.

PMC / NIH article on chemistry LLM benchmarks pmc.ncbi.nlm.nih.gov

Benchmarks often use ‘clean’ text-based descriptions of spectra, which avoids the messy instrument noise, solvent peaks, and overlapping signals found in real-world lab settings.

ChemRxiv preprint on retrosynthesis / chemical plausibility metrics chemrxiv.org

Traditional metrics like Top-K accuracy on datasets such as USPTO-50K are increasingly viewed as insufficient… the ChemCensor metric and the CREED dataset (over 6 million validated reactions) have emerged to evaluate ‘chemical plausibility’ rather than simple exact matches.

Sources

References

Jack Sun, writing.