DeepSeek-V4 closes the open-weights capability gap — and reopens others
DeepSeek-V4 matches frontier benchmarks under a true MIT license, but hallucination rates, token burn, and training hardware tell a more complicated story.
DeepSeek-V4 closes the open-weights capability gap — and reopens others
TL;DR
- DeepSeek-V4 ships in 1.6T Pro and 284B Flash MoE configurations with 1M context and hybrid attention that shrinks KV cache to about 2%.
- Independent evaluations place V4 in frontier-tier agentic territory but log a 94% hallucination rate on unknowns and 4× peer token usage.
- The release moves to a true MIT license, the most permissive terms yet for a frontier-class open-weights model.
- Inference runs on Huawei Ascend hardware, but pretraining still relied on Nvidia H800s after Ascend 910B stability issues.
Today’s research story is a single release, but it’s the kind that resets the conversation. DeepSeek-V4 arrives as a credibly frontier-tier open-weights system — trillion-plus-parameter MoEs, million-token context, a genuinely permissive MIT license — and the independent numbers back the headline claims on capability.
What makes the release worth dwelling on is what comes attached. The same benchmarks that confirm frontier scores also surface a 94% hallucination rate on unanswerable questions and roughly 4× the token consumption of peer models. The “runs on Huawei Ascend” sovereignty pitch holds for inference but not for the training run, which still went through Nvidia H800s. The capability gap to closed labs is closing; the reliability and efficiency gaps are where the next argument starts.
DeepSeek-V4 lands frontier-tier — and exposes the seams in the open-weights race
Source: huggingface-blog · published 2026-04-24
TL;DR
- DeepSeek-V4 ships as 1.6T Pro and 284B Flash MoEs with 1M context and hybrid attention that cuts KV cache to ~2% of standard.
- Independent benchmarks confirm frontier-tier agentic scores but flag 94% hallucination on unknowns and 4× higher token burn than peers.
- True MIT license replaces V3’s bespoke terms — the most permissive frontier-class open-weights release to date.
- “Runnable on Huawei Ascend” is real; pretraining still ran on Nvidia H800s after Ascend 910B stability problems.
A frontier release with calibration problems
The V4 drop is four checkpoints (Pro/Flash × Base/Instruct), three reasoning modes, and one architectural bet: alternating Compressed Sparse Attention (4× compression with an FP4 lightning indexer) and Heavily Compressed Attention (128×) that shrinks KV cache to roughly 2% of standard. The payoff is concrete — V4-Pro runs at 27% of V3.2’s per-token FLOPs and still holds 0.59 MRCR retrieval accuracy at the full 1M-token window.
Artificial Analysis broadly corroborates the headline: V4-Pro is #2 on the open-weights Intelligence Index behind Kimi K2.6 and #1 on the GDPval-AA agentic benchmark 1. The vendor-reported numbers hold up where they were checkable:
| Benchmark | V4-Pro | Context |
|---|---|---|
| SWE Verified | 80.6% | matches Gemini-3.1-Pro |
| Toolathlon | 51.8 | beats K2.6 (50.0), Gemini-3.1-Pro (48.8) |
| Terminal Bench 2.0 | 67.9 | ahead of GLM-5.1 |
| AA Intelligence Index | #2 open | K2.6 #1 1 |
| AA-Omniscience Index | −10 | prefers guessing to abstaining 2 |
| LMArena (open coders) | #3 | ”benchmaxxing” critique 3 |
The same AA shop that crowned V4-Pro also published the worst finding in the cluster: V4-Flash hallucinates 96% of the time on questions it doesn’t know, V4-Pro 94%, and V4-Pro burned ~190M tokens running their index — about 4× the field average 2. The cheap per-token pricing partially evaporates at task level. Crowdsourced LMArena puts V4-Pro only third among open-weight coders, and the community attributes the gap to over-fitting on SWE-bench-style evals 3.
V4-Pro scores −10 on the Omniscience Index, indicating it would rather guess than abstain. 2
The Huawei pivot is louder than the model
The cluster’s most consequential signal isn’t CSA/HCA — it’s that V4 is the first frontier-class open model engineered against Huawei’s CANN/Ascend stack. Jensen Huang publicly called the optimization “a horrible outcome” for the United States, conceding that CUDA’s moat is being actively eroded 4. That admission is the news.
The substance is messier than the headline. DeepSeek reportedly reverted to Nvidia H800s for V4 pretraining after Ascend 910B exhibited stability issues and “glacial” interconnect speeds, with Ascend 950PR Supernodes only handling continued training and inference 5. “Runnable on Huawei” is real; “trained on Huawei” is still aspirational. The export-control thesis survives another quarter, but barely.
The sleeper win is the license
Simon Willison flagged the detail that benchmark coverage buried: V4 ships under a true MIT license, replacing V3’s bespoke DeepSeek Model License 6. Combined with Flash inference at $0.14/M input tokens, this removes the cleanest wedge GLM-5 partisans had been using on enterprise procurement. Fine-tuners no longer need legal review on the weights themselves — a quieter unlock than 1M context, and probably a more durable one.
Net read: the architecture is impressive, the agentic numbers are real, the calibration is poor 2, the Huawei story is half-marketing 5. The MIT license 6 and the credible threat to Nvidia’s ecosystem 4 are what will still matter in six months. The benchmark crown 13 is contested and almost certainly temporary.
Further reading
- [AINews] DeepSeek V4 Pro (1.6T-A49B) and Flash (284B-A13B), Base and Instruct — runnable on Huawei Ascend chips — latent-space
Footnotes
-
Artificial Analysis — https://artificialanalysis.ai/articles/deepseek-is-back-among-the-leading-open-weights-models-with-v4-pro-and-v4-flash
↩ ↩2 ↩3DeepSeek V4-Pro is back among the leading open-weights models, ranking #2 on the Intelligence Index behind Kimi K2.6, while leading all open-weights competitors on the GDPval-AA agentic benchmark.
-
Artificial Analysis – AA-Omniscience — https://artificialanalysis.ai/evaluations/omniscience
↩ ↩2 ↩3 ↩4On the AA-Omniscience benchmark V4-Flash hallucinates 96% of the time and V4-Pro 94% when it doesn’t know the answer; V4-Pro scores -10 on the Omniscience Index, indicating it would rather guess than abstain.
-
r/singularity discussion of LMArena results — https://www.reddit.com/r/singularity/comments/1suci24/deepseek_v4_pro_underwhelms_on_arena_crowdsourced/
↩ ↩2 ↩3DeepSeek V4-Pro underwhelms on Arena’s crowdsourced ranking, sitting only third among open-weight coding models despite vendor claims of GPT-5.5-class performance — community attributes the gap to ‘benchmaxxing’ on SWE-bench-style evals.
-
TheNextWeb (Jensen Huang) — https://thenextweb.com/news/nvidia-huang-deepseek-huawei-chips-horrible-outcome
↩ ↩2Nvidia CEO Jensen Huang called DeepSeek’s optimization for Huawei Ascend a ‘horrible outcome’ for the United States, conceding that CUDA’s moat is being actively eroded.
-
SL Guardian / China Academy — https://slguardian.org/why-deepseek-v4-hasnt-fully-cut-ties-with-nvidia/
↩ ↩2DeepSeek was forced to revert to Nvidia hardware for V4 pretraining after Huawei Ascend 910B suffered stability issues and ‘glacial’ interconnect speeds, then shifted to Ascend 950PR Supernodes only for continued training and inference.
-
Simon Willison’s blog — https://simonwillison.net/2026/apr/24/deepseek-v4/
↩ ↩2DeepSeek V4 ships under a true MIT license — a sharp departure from the custom DeepSeek Model License used for V3 — making it among the most permissive frontier-class open-weight releases to date.