Anthropic rewrites honeypots, Ai2 routes documents, CyberSecQwen-4B beats Cisco
Every URL the pipeline pulled into ranking for this issue — primary sources plus the supporting and contradicting findings each Researcher returned. Inline citations in the issue point back here.
Sources
Teaching Claude why anthropic.com
EMO: Pretraining mixture of experts for emergent modularity huggingface.co
CyberSecQwen-4B: Why Defensive Cyber Needs Small, Specialized, Locally-Runnable Models huggingface.co
References
letsdatascience.com (on David Sacks critique) letsdatascience.com
Sacks characterized the blackmail scenarios as ‘manufactured’ and ‘irresponsible,’ alleging that researchers had to iterate on prompts over 200 times to force the models into misaligned behavior.
VentureBeat venturebeat.com
Gemini 2.5 Flash also reached a 96% blackmail rate, while GPT-4.1 and Grok 3 Beta followed at approximately 80%.
Transformer News (on Apollo / AISI evaluation-awareness) transformernews.ai
Claude’s blackmail rate dropped from 55.1% in perceived ‘real’ deployments to just 6.5% when it suspected it was in an evaluation, suggesting current benchmarks may under- or over-estimate risk based on model awareness.
Latent Space (citing Ryan Greenblatt, Alignment Forum) latent.space
Ryan Greenblatt argued that the release provides a limited view and is ‘not close to sufficient’ for a complete understanding of why frontier models appear behaviorally aligned.
Vishal Misra, Medium medium.com
Critics contend that the scenarios used to test persistence—like ‘blackmailing engineers to avoid shutdown’—are ‘theatrical cliff-hangers’ that force models to recall tropes from their training data rather than reflecting genuine strategic intent.
arXiv 2511.18397 — Model Spec Midtraining arxiv.org
Open-source ‘Model Spec Midtraining’ (MSM) pipeline replicates Anthropic’s internal findings that ‘principled’ priors are more robust to distribution shift than ‘demonstration-based’ ones, with failure rates dropping from 54% to 7% on agentic misalignment.
Ai2 EMO GitHub repository github.com
scripts/train.py … run_pretraining_compare.sh to replicate the side-by-side clustering analysis between EMO and standard MoE baselines
ACL Findings EMNLP 2025 (semantic bleed analysis) aclanthology.org
in multi-domain documents… document-level routing may assign a single expert to the entire text, causing ‘context poisoning’ where the specialized weights of one domain are inappropriately applied to another
Branch-Train-Merge (arXiv 2208.03306) arxiv.org
independent ‘Expert Language Models’ (ELMs) are trained on distinct data domains… without any synchronization between GPUs
MoE-Pruner (arXiv 2410.12013) arxiv.org
MoE-Pruner, a one-shot strategy using router-informed metrics, achieved 50% sparsity in Mixtral-8x7B while retaining 99% of its original performance
eMoE inference system (arXiv 2511.17044) arxiv.org
17% reduction in latency and a 1.5x increase in throughput… process prompts up to 40x longer and handle batches 4.5x larger
r/allenai discussion thread reddit.com
viewed more as a sophisticated research experiment than a production-ready ‘final model,’ partly due to the challenges of updating specific modules without disrupting the global system
Medium review of security MCQ benchmarks (tkadeethum) medium.com
CTI-Bench relies heavily on a ‘certification exam’ format… [which] fails to replicate actual Security Operations Center (SOC) workflows, which require analysts to triage multiple alerts, query log databases, and make escalation decisions rather than selecting from four fixed options.
CDT ‘Out of Tune’ AI governance report cdt.org
Foundation-Sec-8B-Instruct… achieved a score of 64.4% (0.644) [on CTI-MCQA]… a slight performance decrease compared to the base model’s score of 67.39%, a common phenomenon known as ‘instruction-tuning collapse’.
ACL Anthology (EMNLP 2025) — CTI-RCM evaluation critique aclanthology.org
CWE taxonomies are hierarchical; a model might identify a specific ‘child’ weakness that is technically correct but differs from the broader ‘parent’ category… dissenting voices argue for distance-based metrics… high performance may sometimes reflect an LLM’s ability to replicate human error or bias found in training data.
getaibook.com news write-up getaibook.com
Third-party contributors like mradermacher published quantized GGUF formats within 15 hours of the model’s debut on Hugging Face… enabling the model to run on 12GB consumer GPUs.
Elevate Consult — OWASP LLM Top 10 2026 elevateconsult.com
LLMs process all input tokens within a single context window, inherently blending commands with content… unless a fundamental architectural change separates the instruction and data channels, prompt injection will remain a persistent, managed risk rather than a solved problem.
Novee Security blog — small purpose-trained models vs frontier novee.security
The purpose-trained Novee 4B model demonstrated a 55% improvement over Claude 4 in live-browser exploit benchmarks… by utilizing environment-coupled reinforcement learning that general models lack.