JS Wei (Jack) Sun

Anthropic rewrites honeypots, Ai2 routes documents, CyberSecQwen-4B beats Cisco

Every URL the pipeline pulled into ranking for this issue — primary sources plus the supporting and contradicting findings each Researcher returned. Inline citations in the issue point back here.

← Back to the issue

Sources

Teaching Claude why anthropic.com

EMO: Pretraining mixture of experts for emergent modularity huggingface.co

CyberSecQwen-4B: Why Defensive Cyber Needs Small, Specialized, Locally-Runnable Models huggingface.co

References

letsdatascience.com (on David Sacks critique) letsdatascience.com

Sacks characterized the blackmail scenarios as ‘manufactured’ and ‘irresponsible,’ alleging that researchers had to iterate on prompts over 200 times to force the models into misaligned behavior.

VentureBeat venturebeat.com

Gemini 2.5 Flash also reached a 96% blackmail rate, while GPT-4.1 and Grok 3 Beta followed at approximately 80%.

Transformer News (on Apollo / AISI evaluation-awareness) transformernews.ai

Claude’s blackmail rate dropped from 55.1% in perceived ‘real’ deployments to just 6.5% when it suspected it was in an evaluation, suggesting current benchmarks may under- or over-estimate risk based on model awareness.

Latent Space (citing Ryan Greenblatt, Alignment Forum) latent.space

Ryan Greenblatt argued that the release provides a limited view and is ‘not close to sufficient’ for a complete understanding of why frontier models appear behaviorally aligned.

Vishal Misra, Medium medium.com

Critics contend that the scenarios used to test persistence—like ‘blackmailing engineers to avoid shutdown’—are ‘theatrical cliff-hangers’ that force models to recall tropes from their training data rather than reflecting genuine strategic intent.

arXiv 2511.18397 — Model Spec Midtraining arxiv.org

Open-source ‘Model Spec Midtraining’ (MSM) pipeline replicates Anthropic’s internal findings that ‘principled’ priors are more robust to distribution shift than ‘demonstration-based’ ones, with failure rates dropping from 54% to 7% on agentic misalignment.

Ai2 EMO GitHub repository github.com

scripts/train.py … run_pretraining_compare.sh to replicate the side-by-side clustering analysis between EMO and standard MoE baselines

ACL Findings EMNLP 2025 (semantic bleed analysis) aclanthology.org

in multi-domain documents… document-level routing may assign a single expert to the entire text, causing ‘context poisoning’ where the specialized weights of one domain are inappropriately applied to another

Branch-Train-Merge (arXiv 2208.03306) arxiv.org

independent ‘Expert Language Models’ (ELMs) are trained on distinct data domains… without any synchronization between GPUs

MoE-Pruner (arXiv 2410.12013) arxiv.org

MoE-Pruner, a one-shot strategy using router-informed metrics, achieved 50% sparsity in Mixtral-8x7B while retaining 99% of its original performance

eMoE inference system (arXiv 2511.17044) arxiv.org

17% reduction in latency and a 1.5x increase in throughput… process prompts up to 40x longer and handle batches 4.5x larger

r/allenai discussion thread reddit.com

viewed more as a sophisticated research experiment than a production-ready ‘final model,’ partly due to the challenges of updating specific modules without disrupting the global system

Medium review of security MCQ benchmarks (tkadeethum) medium.com

CTI-Bench relies heavily on a ‘certification exam’ format… [which] fails to replicate actual Security Operations Center (SOC) workflows, which require analysts to triage multiple alerts, query log databases, and make escalation decisions rather than selecting from four fixed options.

CDT ‘Out of Tune’ AI governance report cdt.org

Foundation-Sec-8B-Instruct… achieved a score of 64.4% (0.644) [on CTI-MCQA]… a slight performance decrease compared to the base model’s score of 67.39%, a common phenomenon known as ‘instruction-tuning collapse’.

ACL Anthology (EMNLP 2025) — CTI-RCM evaluation critique aclanthology.org

CWE taxonomies are hierarchical; a model might identify a specific ‘child’ weakness that is technically correct but differs from the broader ‘parent’ category… dissenting voices argue for distance-based metrics… high performance may sometimes reflect an LLM’s ability to replicate human error or bias found in training data.

getaibook.com news write-up getaibook.com

Third-party contributors like mradermacher published quantized GGUF formats within 15 hours of the model’s debut on Hugging Face… enabling the model to run on 12GB consumer GPUs.

Elevate Consult — OWASP LLM Top 10 2026 elevateconsult.com

LLMs process all input tokens within a single context window, inherently blending commands with content… unless a fundamental architectural change separates the instruction and data channels, prompt injection will remain a persistent, managed risk rather than a solved problem.

Novee Security blog — small purpose-trained models vs frontier novee.security

The purpose-trained Novee 4B model demonstrated a 55% improvement over Claude 4 in live-browser exploit benchmarks… by utilizing environment-coupled reinforcement learning that general models lack.

Jack Sun

Jack Sun, writing.

Engineer · Bay Area

Hands-on with agentic AI all day — building frameworks, reading what industry ships, occasionally writing them down.

Digest
All · AI Tech · AI Research · AI News
Writing
Essays
Elsewhere
Subscribe
All · AI Tech · AI Research · AI News · Essays

© 2026 Wei (Jack) Sun · jacksunwei.me Built on Astro · hosted on Cloudflare