DeepSeek-V4 closes the open-weights capability gap — and reopens others

TL;DR

DeepSeek-V4 ships in 1.6T Pro and 284B Flash MoE configurations with 1M context and hybrid attention that shrinks KV cache to about 2%.
Independent evaluations place V4 in frontier-tier agentic territory but log a 94% hallucination rate on unknowns and 4× peer token usage.
The release moves to a true MIT license, the most permissive terms yet for a frontier-class open-weights model.
Inference runs on Huawei Ascend hardware, but pretraining still relied on Nvidia H800s after Ascend 910B stability issues.

Today’s research story is a single release, but it’s the kind that resets the conversation. DeepSeek-V4 arrives as a credibly frontier-tier open-weights system — trillion-plus-parameter MoEs, million-token context, a genuinely permissive MIT license — and the independent numbers back the headline claims on capability.

What makes the release worth dwelling on is what comes attached. The same benchmarks that confirm frontier scores also surface a 94% hallucination rate on unanswerable questions and roughly 4× the token consumption of peer models. The “runs on Huawei Ascend” sovereignty pitch holds for inference but not for the training run, which still went through Nvidia H800s. The capability gap to closed labs is closing; the reliability and efficiency gaps are where the next argument starts.

DeepSeek-V4 lands frontier-tier — and exposes the seams in the open-weights race

Source: huggingface-blog · published 2026-04-24

TL;DR

DeepSeek-V4 ships as 1.6T Pro and 284B Flash MoEs with 1M context and hybrid attention that cuts KV cache to ~2% of standard.
Independent benchmarks confirm frontier-tier agentic scores but flag 94% hallucination on unknowns and 4× higher token burn than peers.
True MIT license replaces V3’s bespoke terms — the most permissive frontier-class open-weights release to date.
“Runnable on Huawei Ascend” is real; pretraining still ran on Nvidia H800s after Ascend 910B stability problems.

A frontier release with calibration problems

The V4 drop is four checkpoints (Pro/Flash × Base/Instruct), three reasoning modes, and one architectural bet: alternating Compressed Sparse Attention (4× compression with an FP4 lightning indexer) and Heavily Compressed Attention (128×) that shrinks KV cache to roughly 2% of standard. The payoff is concrete — V4-Pro runs at 27% of V3.2’s per-token FLOPs and still holds 0.59 MRCR retrieval accuracy at the full 1M-token window.

Artificial Analysis broadly corroborates the headline: V4-Pro is #2 on the open-weights Intelligence Index behind Kimi K2.6 and #1 on the GDPval-AA agentic benchmark ¹. The vendor-reported numbers hold up where they were checkable:

Benchmark	V4-Pro	Context
SWE Verified	80.6%	matches Gemini-3.1-Pro
Toolathlon	51.8	beats K2.6 (50.0), Gemini-3.1-Pro (48.8)
Terminal Bench 2.0	67.9	ahead of GLM-5.1
AA Intelligence Index	#2 open	K2.6 #1 ¹
AA-Omniscience Index	−10	prefers guessing to abstaining ²
LMArena (open coders)	#3	”benchmaxxing” critique ³

The same AA shop that crowned V4-Pro also published the worst finding in the cluster: V4-Flash hallucinates 96% of the time on questions it doesn’t know, V4-Pro 94%, and V4-Pro burned ~190M tokens running their index — about 4× the field average ². The cheap per-token pricing partially evaporates at task level. Crowdsourced LMArena puts V4-Pro only third among open-weight coders, and the community attributes the gap to over-fitting on SWE-bench-style evals ³.

V4-Pro scores −10 on the Omniscience Index, indicating it would rather guess than abstain. ²

The Huawei pivot is louder than the model

The cluster’s most consequential signal isn’t CSA/HCA — it’s that V4 is the first frontier-class open model engineered against Huawei’s CANN/Ascend stack. Jensen Huang publicly called the optimization “a horrible outcome” for the United States, conceding that CUDA’s moat is being actively eroded ⁴. That admission is the news.

The substance is messier than the headline. DeepSeek reportedly reverted to Nvidia H800s for V4 pretraining after Ascend 910B exhibited stability issues and “glacial” interconnect speeds, with Ascend 950PR Supernodes only handling continued training and inference ⁵. “Runnable on Huawei” is real; “trained on Huawei” is still aspirational. The export-control thesis survives another quarter, but barely.

The sleeper win is the license

Simon Willison flagged the detail that benchmark coverage buried: V4 ships under a true MIT license, replacing V3’s bespoke DeepSeek Model License ⁶. Combined with Flash inference at $0.14/M input tokens, this removes the cleanest wedge GLM-5 partisans had been using on enterprise procurement. Fine-tuners no longer need legal review on the weights themselves — a quieter unlock than 1M context, and probably a more durable one.

Net read: the architecture is impressive, the agentic numbers are real, the calibration is poor ², the Huawei story is half-marketing ⁵. The MIT license ⁶ and the credible threat to Nvidia’s ecosystem ⁴ are what will still matter in six months. The benchmark crown ¹³ is contested and almost certainly temporary.

DeepSeek-V4 closes the open-weights capability gap — and reopens others

DeepSeek-V4 closes the open-weights capability gap — and reopens others

TL;DR

DeepSeek-V4 lands frontier-tier — and exposes the seams in the open-weights race

TL;DR

A frontier release with calibration problems

The Huawei pivot is louder than the model

The sleeper win is the license

Further reading

Jack Sun, writing.

DeepSeek-V4 closes the open-weights capability gap — and reopens others

TL;DR

DeepSeek-V4 lands frontier-tier — and exposes the seams in the open-weights race

TL;DR

A frontier release with calibration problems

The Huawei pivot is louder than the model

The sleeper win is the license

Further reading

Footnotes

Jack Sun, writing.