Sources

🔬 Training Transformers to solve 95% failure rate of Cancer Trials — Ron Alfa & Daniel Bear, Noetik latent.space

95% of cancer treatments fail to pass clinical trials, but it may be a matching problem — that Noetik is solving with autoregressive transformers like TARIO-2!

QIMMA قِمّة ⛰: A Quality-First Arabic LLM Leaderboard huggingface.co

AI and the Future of Cybersecurity: Why Openness Matters huggingface.co

References

Fierce Biotech fiercebiotech.com

GSK inks model deal: $50M bet on Noetik’s cancer AI platform — five-year, non-exclusive license to OCTO-VC for NSCLC and colorectal cancer, with bespoke spatial datasets generated for GSK’s pipeline.

Noetik blog — TARIO-2 technical report noetik.blog

TARIO-2 predicts an 18,963-plex spatial map from H&E alone; scaling from 93 to 2,545 patients improved accuracy on 18,947 of 18,964 genes, with Moran’s I showing the model is most accurate on spatially concentrated genes and weakest on diffuse expression.

Ahlmann-Eltze et al., Nature Methods 2025 const-ae.name

None of the deep learning perturbation models — scGPT, scFoundation, GEARS — outperformed simple additive or PCA-based linear baselines; ‘this highlights the importance of critical benchmarking in directing and evaluating method development.’

ResearchGate — biomarker trial analysis researchgate.net

Biomarker-driven patient selection raised the likelihood of approval roughly fivefold across cancers — 12x in breast, 7–8x in melanoma and lung — providing the empirical basis for the ‘matching problem’ framing.

GuruFocus / Agenus press gurufocus.com

Agenus and Noetik are applying OCTO-VC to identify predictive biomarkers for the BOT/BAL immunotherapy program — the first deployment of the model inside an active clinical program to enrich for responders beyond PD-L1.

Kendiukhov, Medium — virtual cell critique kendiukhov.medium.com

Virtual cell models tend to learn co-expression patterns rather than causal regulatory logic; without mechanistic interpretability, ‘black box’ simulations face a steep climb to regulatory acceptance.

ResearchGate critical analysis (AlQadi) researchgate.net

QIMMA fails to report specific inter-annotator agreement metrics (such as Cohen’s Kappa or Krippendorff’s Alpha) for the human review process… reliance on only two models for the automated portion introduces a model-dependent bias where shared blind spots might allow errors to persist.

o16g.com industry analysis o16g.com

The researchers who designed the QIMMA validation pipeline—including Chief Researcher Hakim Hacid—are the same individuals overseeing the Falcon models that dominate these rankings… TII-led platforms frequently place Falcon models above larger, well-funded international competitors.

Hugging Face blog — AraGen / 3C3H huggingface.co

AraGen introduces the 3C3H measure, assessing model responses across six generative dimensions: Correctness, Completeness, Conciseness, Helpfulness, Honesty, and Harmlessness… a model might rank first on OALL for accuracy but drop significantly on AraGen due to a lack of conciseness or helpfulness.

MBZUAI Oryx — FannOrFlop GitHub github.com

FannOrFlop… 6,984 poem-explanation pairs spanning 12 historical eras from the Pre-Islamic period to contemporary 21st-century verse… even high-performing models struggle significantly with poetic reasoning, highlighting a major temporal generalization gap.

CrowdStrike security research crowdstrike.com

DeepSeek-R1’s coding output becomes up to 50% more likely to contain severe security vulnerabilities when prompts involve politically sensitive topics, suggesting that alignment for censorship may trigger emergent flaws.

Cerebras blog — Jais-2 cerebras.ai

Jais-2-70B achieved a score of 70.71% on AraGen-12-24, significantly outperforming Qwen2.5-72B (62.58%) and Llama-3.3-70B (61.29%)… reaching 2,000 tokens per second on Cerebras wafer-scale hardware.

AISLE blog (Stanislav Fort) aisle.com

A 3.6B active-parameter open-weight model (GPT-OSS-20B) running in a deliberately simple parallel scanner replicated the FreeBSD NFS and OpenBSD TCP SACK findings Anthropic used to justify Mythos’s restricted release — the moat is the operational system, not the model.

Radware blog on Claude Mythos validate.perfdrive.com

Mythos generates working exploits on its first attempt in roughly 83% of cases and creates ‘patch pressure’ that may overwhelm human developers and open-source contributors trying to verify and fix the flood of AI-discovered bugs.

The New Stack — ‘Drowning in AI Slop Reports’ thenewstack.io

Daniel Stenberg shut down curl’s HackerOne bounty in January 2026 after valid-report rates collapsed from ~15% to under 5%, calling the deluge ‘a DDoS attack on maintainers.’

UK AI Security Institute evaluation aisi.gov.uk

Mythos completed a 32-step simulated cyber-attack chain previously beyond AI models, though efficacy is currently highest against ‘weakly defended’ enterprise systems.

Bonixs commentary on Hugging Face stance bonixs.com

Critics point to Hugging Face’s own LeRobot RCE vulnerabilities as evidence that prioritizing prototyping convenience over foundational security undermines the credibility of their openness argument; some White House advisors questioned whether Anthropic was ‘crying wolf.’

Medium — ‘$12.5M pledge’ medium.com

OpenAI, Google, and Microsoft pledged $12.5 million to the Linux Foundation in March 2026 to help maintainers cope with AI-driven report volume — critics called it ‘arson followed by firefighting.’

Sources

References

Jack Sun, writing.