The pitches are confident; the people checking them aren't keeping up
Three AI pitches across oncology, Arabic benchmarks, and open-source security all rest on validation machinery that is conflicted, missing, or overwhelmed.
The pitches are confident; the people checking them aren’t keeping up
TL;DR
- Noetik’s $50M GSK deal pitches a virtual-cell transformer for cancer trials, but comparable models have failed to beat linear baselines independently.
- TII’s QIMMA found 2–88% rot in 14 Arabic benchmarks; on coding evals, 81–88% of prompts needed rewriting.
- QIMMA’s governance problem: the same team builds Falcon and scores it, with no inter-annotator agreement reported.
- Hugging Face frames open ecosystems as the defense against AI attackers — while curl killed its bounty over AI-generated bug spam in January.
- A 3.6B-active-param open model reproduced Anthropic’s Mythos findings, supporting the scaffolding-over-scale thesis.
Three pitches today, three different corners of the stack — oncology foundation models, Arabic LLM benchmarks, open-source security — and the same structural crack runs through all of them. Each story makes a confident claim about what AI can do, then leans on validation machinery that turns out to be conflicted, missing, or buckling. Noetik wants to fix cancer trial matching with a virtual-cell transformer, but the comparable models in its class haven’t beaten linear baselines on independent benchmarks, and Noetik hasn’t published a head-to-head. TII’s QIMMA leaderboard exists because the prior Arabic benchmarks were rotten — yet QIMMA’s own authors also build the models being scored on it. Hugging Face argues open ecosystems are the defense against AI-powered attackers, while the maintainer base that defense depends on is being buried in AI-generated bug reports. The interesting question today isn’t whether the pitches are right. It’s whether the people meant to check them have the time, independence, or numbers to do the job.
Noetik’s bet: foundation models can fix oncology’s matching problem — if the benchmarks hold
Source: latent-space · published 2026-04-20
TL;DR
- Noetik claims most cancer trial failures are patient-matching failures, and its TARIO-2 transformer predicts a ~19,000-gene spatial map from a cheap H&E slide.
- GSK’s $50M deal is real but narrower than pitched — five years, non-exclusive, NSCLC and colorectal only.
- Independent benchmarks on comparable “virtual cell” models (scGPT, scFoundation, GEARS) failed to beat linear baselines, and Noetik hasn’t published a head-to-head.
- The “matching problem” framing is empirically grounded: biomarker-stratified trials see ~5x higher approval odds.
The pitch: a matching problem, not a chemistry problem
Ron Alfa and Daniel Bear’s argument on Latent Space is that the 95% clinical-trial failure rate in oncology is mostly a sorting error — promising drugs given to the wrong patients. The empirical backbone is solid: a large-scale trial analysis found biomarker-driven enrollment raises the likelihood of approval roughly fivefold across cancers, and 12x in breast 1. If you can identify responders before Phase II, you rescue molecules that would otherwise look like duds.
Noetik’s bet is that you do that with a foundation model trained on multimodal human tumor data — spatial transcriptomics, spatial proteomics, H&E imaging, and whole-exome sequencing — rather than mouse models or cell lines that famously don’t translate.
TARIO-2: real numbers, real caveats
The flagship is TARIO-2, an autoregressive transformer that predicts an 18,963-gene spatial expression map from a standard H&E stain — the cheap, ubiquitous diagnostic almost every cancer patient already gets 2. Scaling the training cohort from 93 to 2,545 patients improved accuracy on 18,947 of those 18,963 genes, which is the kind of clean scaling-law story the team leans on in the interview.
The caveat the podcast skipped is in Noetik’s own technical report: accuracy is measured via Moran’s I, and the model is strongest on spatially concentrated genes and weakest on diffuse ones 2. That’s not a footnote — many of the immune-signaling and metabolic genes that matter most for immunotherapy response are diffuse by nature. “Predicts the transcriptome from H&E” is true on average and shakier exactly where the clinic needs it.
The benchmarking problem nobody wants to run
Here’s the uncomfortable context. Ahlmann-Eltze et al. (Nature Methods, 2025) benchmarked the closest analogues to TARIO-class models — scGPT, scFoundation, GEARS — and found none of them beat simple additive or PCA-based linear baselines on perturbation prediction 3.
“This highlights the importance of critical benchmarking in directing and evaluating method development.” 3
Noetik’s “simulate how a patient will respond” pitch lives in exactly this category, and there is no public head-to-head against a linear baseline on a held-out cohort. A parallel critique from interpretability researchers argues virtual-cell models learn co-expression patterns rather than causal regulatory logic — a problem when regulators eventually ask why the model picked a responder 4.
What GSK actually bought
The $50M GSK deal is real, and it’s a genuine validation signal — but the trade-press details narrow the story. It’s a five-year, non-exclusive license to OCTO-VC, scoped to NSCLC and colorectal cancer, plus bespoke spatial datasets generated against GSK’s pipeline 5. That’s two indications, not the pan-oncology platform play the “platform licensing company” framing implies.
The more interesting validation is quieter: Agenus is using OCTO-VC inside its active botensilimab/balstilimab program to enrich for responders beyond PD-L1 6. That’s a prospective clinical bet, not retrospective cohort slicing — and it’s the test that will actually tell us whether virtual-cell embeddings beat the crude biomarkers they’re meant to replace.
What’s at stake
If TARIO-2 generalizes, an H&E slide becomes a transcriptomic readout for every cancer patient on earth, and cohort selection stops being a luxury. If it doesn’t generalize past spatially concentrated genes — or if a linear baseline matches it on response prediction — Noetik joins a long list of foundation-model biotech stories that scaled compute faster than they scaled rigor. The Agenus readout will be the one to watch.
QIMMA scrubs Arabic benchmarks — and quietly indicts the ones it replaces
Source: huggingface-blog · published 2026-04-21
TL;DR
- TII’s QIMMA leaderboard re-runs 14 Arabic benchmarks after a two-stage cleaning pipeline that found rot ranging from 2% to 88% per source.
- On Arabic-adapted HumanEval+/MBPP+, 81–88% of prompts had to be rewritten — a damning verdict on prior coding evals.
- Qwen3.5-397B leads (68.06), but Arabic specialists Karnak and Jais-2-70B beat it on STEM, legal, and culture domains.
- Critics flag a governance problem: the benchmark authors also build Falcon, and the pipeline reports no inter-annotator agreement stats.
What QIMMA actually measures
The Technology Innovation Institute’s QIMMA (“summit”) leaderboard isn’t a new benchmark — it’s a quality filter bolted onto 14 existing ones, covering 52,000 samples across STEM, legal, medical, culture, safety, literature, and code. Every sample is scored 1–10 by both Qwen3-235B and DeepSeek-V3; anything both models rate below 7 is dropped, and disagreements escalate to native Arabic reviewers.
flowchart LR
A[14 source benchmarks<br/>52k samples] --> B{Qwen3-235B<br/>+ DeepSeek-V3}
B -->|both <7/10| D[Discard]
B -->|split verdict| C[Native-speaker review]
B -->|both ≥7/10| E[QIMMA eval set]
C --> E
The headline numbers are the cleaning rates, not the rankings. ArabicMMLU lost 3.1% of samples to factual errors in gold answers, illegible text, and regional stereotypes. MizanQA lost 2.3%. But on the Arabic-translated HumanEval+ and MBPP+ coding sets, 81–88% of prompts needed rewriting to fix linguistic errors and broken code structure — a strong implicit claim that most published Arabic coding scores so far have been measuring noise.
What the rankings show
Qwen3.5-397B tops the table at 68.06, dominating the coding split where Arabic-centric models still trail badly. But the next two slots go to specialists: Karnak (66.20) leads STEM and legal, and Jais-2-70B-Chat (65.81) wins on ArabicMMLU and culture. Cerebras’s independent measurements back the Jais-2 story — 70.71% on AraGen-12-24 versus 62.58% for the similarly-sized Qwen2.5-72B and 61.29% for Llama-3.3-70B 7. Fanar-1-9B occasionally beats far larger multilingual models on domain tests, reinforcing the pattern: for native Arabic tasks, language-specialised pretraining still beats raw scale.
The governance problem QIMMA doesn’t address
The most pointed critique isn’t methodological. An independent analysis notes that QIMMA’s authors — including TII Chief Researcher Hakim Hacid — also oversee Falcon, TII’s own model family that has historically dominated TII-run leaderboards 8. Falcon-H1 is conspicuously absent from QIMMA’s top three, which softens the self-dealing charge but raises the inverse question of why a TII yardstick under-ranks TII’s flagship.
“TII-led platforms frequently place Falcon models above larger, well-funded international competitors.” 8
The same review flags two unreported numbers that should be table stakes for a “quality-first” eval: inter-annotator agreement (Cohen’s Kappa or Krippendorff’s Alpha) for the human-review stage, and any sensitivity analysis on the two-judge filter 98. Using only Qwen3-235B and DeepSeek-V3 as automated raters bakes in whatever blind spots they share — and CrowdStrike has separately documented DeepSeek-R1’s outputs degrading by up to 50% on safety axes when prompts touch politically sensitive material, a failure mode that matters when judging culturally charged Arabic content 10.
What’s missing
QIMMA also positions itself as the Arabic-eval frontier without engaging the parallel AraGen project, whose 3C3H rubric scores generative responses on six axes (correctness, completeness, conciseness, helpfulness, honesty, harmlessness) and produces meaningfully different rankings — accuracy leaders sometimes drop sharply for verbosity or unhelpfulness 11. The literature split, meanwhile, leans on MBZUAI-Oryx’s FannOrFlop corpus of 6,984 poem-explanation pairs across 12 historical eras 12, a provenance the TII post under-credits.
The cleaning work is real and overdue. The question is whether the next iteration will hold the meta-evaluation — judges, annotators, governance — to the same bar.
Hugging Face’s openness pitch meets the patch-pressure crisis
Source: huggingface-blog · published 2026-04-21
TL;DR
- Hugging Face argues open ecosystems are the structural defense against AI-powered attackers like Anthropic’s Mythos.
- The “scaffolding beats model size” thesis got fast empirical backing: a 3.6B-active-param open model reproduced Mythos’s headline finds.
- But the open maintainer base the argument leans on is already drowning in AI-generated bug reports — curl killed its bounty in January.
- Hugging Face’s own LeRobot RCEs and a $12.5M industry pledge to the Linux Foundation underline the credibility gap.
The technical claim landed
Hugging Face’s central technical bet — that AI cyber capability is “jagged,” and a small specialized model wrapped in good scaffolding can match a frontier system — got validated within days of Mythos’s release. AISLE’s nano-analyzer, a single-file parallel scanner driving GPT-OSS-20B (3.6B active parameters, roughly $0.11 per million tokens), reproduced the FreeBSD NFS and OpenBSD TCP SACK vulnerabilities Anthropic cited to justify Mythos’s restricted release 13.
“The moat is the operational system, not the model.”
That’s Stanislav Fort, and it’s effectively the Hugging Face essay restated as an experiment. Anthropic’s Julia Merz countered that AISLE pre-scoped the codebase — “took the needle the model found and gave it to a small child” — so parity is contested, not settled. UK AISI’s independent evaluation adds a second data point worth taking seriously: Mythos completed a 32-step simulated attack chain previously beyond AI models, though efficacy was concentrated against “weakly defended” enterprise systems 14. Strong against soft targets, weaker against hardened ones — a nuance absent from both Anthropic’s marketing and Hugging Face’s rebuttal.
The defense story has a maintainer problem
Where the essay gets shakier is its picture of distributed, open-source response as the natural counterweight. The bandwidth that argument assumes is visibly collapsing.
flowchart LR
A[AI attacker<br/>~83% first-try exploits] --> B[Flood of reports<br/>+ working PoCs]
C[AI 'slop' submitters] --> B
B --> D{Open-source<br/>maintainers}
D -->|valid rate<br/>15% → <5%| E[curl shuts<br/>HackerOne, Jan 2026]
D -->|triage debt| F[Patch pressure]
G[$12.5M OpenAI/Google/MS<br/>pledge to Linux Foundation] -.->|'arson then<br/>firefighting'| D
Radware reports Mythos generates working exploits on the first attempt roughly 83% of the time, with experts warning that the resulting “patch pressure” risks overwhelming the volunteer base expected to verify and fix the flood 15. The curl project is the canonical case: Daniel Stenberg shut down its HackerOne bounty in January 2026 after valid-report rates collapsed from ~15% to under 5%, calling the AI report flood “a DDoS attack on maintainers” 16. OpenAI, Google and Microsoft followed with a $12.5M pledge to the Linux Foundation in March — dismissed by some as “arson followed by firefighting” from the same vendors enabling the deluge 17.
Hugging Face’s appeal to “the Linux kernel security team” as a model of distributed response is exactly the cohort getting buried.
Credibility friction
The messenger doesn’t help. Critics point to Hugging Face’s own LeRobot remote-code-execution vulnerabilities as evidence that the “prototyping convenience first” culture undermines the openness-as-security argument; some White House advisors went further and questioned whether Anthropic was “crying wolf” with Mythos in the first place 18.
What’s actually at stake
The Hugging Face essay is right on the architecture: scaffolding, not parameter count, is where defensive leverage lives, and proprietary obscurity is a fading defense once AI can chew through stripped binaries. But “lean into openness” only works if the open side has hands to do the work. Right now those hands are quitting bug bounties. A serious version of this argument needs to address triage automation, signed-PoC requirements, and maintainer funding as load-bearing components — not as footnotes to a values pitch.
Footnotes
-
ResearchGate — biomarker trial analysis — https://www.researchgate.net/publication/349547443_Does_biomarker_use_in_oncology_improve_clinical_trial_failure_risk_A_large-scale_analysis
↩Biomarker-driven patient selection raised the likelihood of approval roughly fivefold across cancers — 12x in breast, 7–8x in melanoma and lung — providing the empirical basis for the ‘matching problem’ framing.
-
Noetik blog — TARIO-2 technical report — https://www.noetik.blog/p/tario-2-a-whole-transcriptome-foundation
↩ ↩2TARIO-2 predicts an 18,963-plex spatial map from H&E alone; scaling from 93 to 2,545 patients improved accuracy on 18,947 of 18,964 genes, with Moran’s I showing the model is most accurate on spatially concentrated genes and weakest on diffuse expression.
-
Ahlmann-Eltze et al., Nature Methods 2025 — https://const-ae.name/publication/pert_prediction_benchmark/
↩ ↩2None of the deep learning perturbation models — scGPT, scFoundation, GEARS — outperformed simple additive or PCA-based linear baselines; ‘this highlights the importance of critical benchmarking in directing and evaluating method development.’
-
Kendiukhov, Medium — virtual cell critique — https://kendiukhov.medium.com/virtual-cell-ai-models-need-mechanistic-interpetability-7313d053c363
↩Virtual cell models tend to learn co-expression patterns rather than causal regulatory logic; without mechanistic interpretability, ‘black box’ simulations face a steep climb to regulatory acceptance.
-
Fierce Biotech — https://www.fiercebiotech.com/biotech/gsk-inks-model-deal-50m-bet-noetiks-cancer-ai-platform
↩GSK inks model deal: $50M bet on Noetik’s cancer AI platform — five-year, non-exclusive license to OCTO-VC for NSCLC and colorectal cancer, with bespoke spatial datasets generated for GSK’s pipeline.
-
GuruFocus / Agenus press — https://www.gurufocus.com/news/2930697/agenus-and-noetik-enter-collaboration-to-develop-aienabled-predictive-biomarkers-for-botbal-using-foundation-models-of-virtual-cell-biology-agen-stock-news?mobile=true%3Fmobile%3Dtrue&mobile=true
↩Agenus and Noetik are applying OCTO-VC to identify predictive biomarkers for the BOT/BAL immunotherapy program — the first deployment of the model inside an active clinical program to enrich for responders beyond PD-L1.
-
Cerebras blog — Jais-2 — https://www.cerebras.ai/blog/jais2
↩Jais-2-70B achieved a score of 70.71% on AraGen-12-24, significantly outperforming Qwen2.5-72B (62.58%) and Llama-3.3-70B (61.29%)… reaching 2,000 tokens per second on Cerebras wafer-scale hardware.
-
o16g.com industry analysis — https://o16g.com/resources/
↩ ↩2 ↩3The researchers who designed the QIMMA validation pipeline—including Chief Researcher Hakim Hacid—are the same individuals overseeing the Falcon models that dominate these rankings… TII-led platforms frequently place Falcon models above larger, well-funded international competitors.
-
ResearchGate critical analysis (AlQadi) — https://www.researchgate.net/scientific-contributions/Leen-AlQadi-2319737529
↩QIMMA fails to report specific inter-annotator agreement metrics (such as Cohen’s Kappa or Krippendorff’s Alpha) for the human review process… reliance on only two models for the automated portion introduces a model-dependent bias where shared blind spots might allow errors to persist.
-
CrowdStrike security research — https://www.crowdstrike.com/en-us/blog/crowdstrike-researchers-identify-hidden-vulnerabilities-ai-coded-software/
↩DeepSeek-R1’s coding output becomes up to 50% more likely to contain severe security vulnerabilities when prompts involve politically sensitive topics, suggesting that alignment for censorship may trigger emergent flaws.
-
Hugging Face blog — AraGen / 3C3H — https://huggingface.co/blog/leaderboard-3c3h-aragen
↩AraGen introduces the 3C3H measure, assessing model responses across six generative dimensions: Correctness, Completeness, Conciseness, Helpfulness, Honesty, and Harmlessness… a model might rank first on OALL for accuracy but drop significantly on AraGen due to a lack of conciseness or helpfulness.
-
MBZUAI Oryx — FannOrFlop GitHub — https://github.com/mbzuai-oryx/FannOrFlop
↩FannOrFlop… 6,984 poem-explanation pairs spanning 12 historical eras from the Pre-Islamic period to contemporary 21st-century verse… even high-performing models struggle significantly with poetic reasoning, highlighting a major temporal generalization gap.
-
AISLE blog (Stanislav Fort) — https://aisle.com/blog/ai-cybersecurity-after-mythos-the-jagged-frontier
↩A 3.6B active-parameter open-weight model (GPT-OSS-20B) running in a deliberately simple parallel scanner replicated the FreeBSD NFS and OpenBSD TCP SACK findings Anthropic used to justify Mythos’s restricted release — the moat is the operational system, not the model.
-
UK AI Security Institute evaluation — https://www.aisi.gov.uk/blog/our-evaluation-of-claude-mythos-previews-cyber-capabilities
↩Mythos completed a 32-step simulated cyber-attack chain previously beyond AI models, though efficacy is currently highest against ‘weakly defended’ enterprise systems.
-
↩Mythos generates working exploits on its first attempt in roughly 83% of cases and creates ‘patch pressure’ that may overwhelm human developers and open-source contributors trying to verify and fix the flood of AI-discovered bugs.
-
The New Stack — ‘Drowning in AI Slop Reports’ — https://thenewstack.io/drowning-in-ai-slop-reports-curl-ends-bug-bounties/
↩Daniel Stenberg shut down curl’s HackerOne bounty in January 2026 after valid-report rates collapsed from ~15% to under 5%, calling the deluge ‘a DDoS attack on maintainers.’
-
Medium — ‘$12.5M pledge’ — https://medium.com/predict/tech-giants-just-pledged-12-5m-fe6ca6326bbc
↩OpenAI, Google, and Microsoft pledged $12.5 million to the Linux Foundation in March 2026 to help maintainers cope with AI-driven report volume — critics called it ‘arson followed by firefighting.’
-
Bonixs commentary on Hugging Face stance — https://www.bonixs.com/
↩Critics point to Hugging Face’s own LeRobot RCE vulnerabilities as evidence that prioritizing prototyping convenience over foundational security undermines the credibility of their openness argument; some White House advisors questioned whether Anthropic was ‘crying wolf.’