JS Wei (Jack) Sun

DeepSeek-V4 closes the open-weights capability gap — and reopens others

DeepSeek-V4 matches frontier benchmarks under a true MIT license, but hallucination rates, token burn, and training hardware tell a more complicated story.

DeepSeek-V4 closes the open-weights capability gap — and reopens others

TL;DR

  • DeepSeek-V4 ships in 1.6T Pro and 284B Flash MoE configurations with 1M context and hybrid attention that shrinks KV cache to about 2%.
  • Independent evaluations place V4 in frontier-tier agentic territory but log a 94% hallucination rate on unknowns and 4× peer token usage.
  • The release moves to a true MIT license, the most permissive terms yet for a frontier-class open-weights model.
  • Inference runs on Huawei Ascend hardware, but pretraining still relied on Nvidia H800s after Ascend 910B stability issues.

Today’s research story is a single release, but it’s the kind that resets the conversation. DeepSeek-V4 arrives as a credibly frontier-tier open-weights system — trillion-plus-parameter MoEs, million-token context, a genuinely permissive MIT license — and the independent numbers back the headline claims on capability.

What makes the release worth dwelling on is what comes attached. The same benchmarks that confirm frontier scores also surface a 94% hallucination rate on unanswerable questions and roughly 4× the token consumption of peer models. The “runs on Huawei Ascend” sovereignty pitch holds for inference but not for the training run, which still went through Nvidia H800s. The capability gap to closed labs is closing; the reliability and efficiency gaps are where the next argument starts.

DeepSeek-V4 lands frontier-tier — and exposes the seams in the open-weights race

Source: huggingface-blog · published 2026-04-24

TL;DR

  • DeepSeek-V4 ships as 1.6T Pro and 284B Flash MoEs with 1M context and hybrid attention that cuts KV cache to ~2% of standard.
  • Independent benchmarks confirm frontier-tier agentic scores but flag 94% hallucination on unknowns and 4× higher token burn than peers.
  • True MIT license replaces V3’s bespoke terms — the most permissive frontier-class open-weights release to date.
  • “Runnable on Huawei Ascend” is real; pretraining still ran on Nvidia H800s after Ascend 910B stability problems.

A frontier release with calibration problems

The V4 drop is four checkpoints (Pro/Flash × Base/Instruct), three reasoning modes, and one architectural bet: alternating Compressed Sparse Attention (4× compression with an FP4 lightning indexer) and Heavily Compressed Attention (128×) that shrinks KV cache to roughly 2% of standard. The payoff is concrete — V4-Pro runs at 27% of V3.2’s per-token FLOPs and still holds 0.59 MRCR retrieval accuracy at the full 1M-token window.

Artificial Analysis broadly corroborates the headline: V4-Pro is #2 on the open-weights Intelligence Index behind Kimi K2.6 and #1 on the GDPval-AA agentic benchmark 1. The vendor-reported numbers hold up where they were checkable:

BenchmarkV4-ProContext
SWE Verified80.6%matches Gemini-3.1-Pro
Toolathlon51.8beats K2.6 (50.0), Gemini-3.1-Pro (48.8)
Terminal Bench 2.067.9ahead of GLM-5.1
AA Intelligence Index#2 openK2.6 #1 1
AA-Omniscience Index−10prefers guessing to abstaining 2
LMArena (open coders)#3”benchmaxxing” critique 3

The same AA shop that crowned V4-Pro also published the worst finding in the cluster: V4-Flash hallucinates 96% of the time on questions it doesn’t know, V4-Pro 94%, and V4-Pro burned ~190M tokens running their index — about 4× the field average 2. The cheap per-token pricing partially evaporates at task level. Crowdsourced LMArena puts V4-Pro only third among open-weight coders, and the community attributes the gap to over-fitting on SWE-bench-style evals 3.

V4-Pro scores −10 on the Omniscience Index, indicating it would rather guess than abstain. 2

The Huawei pivot is louder than the model

The cluster’s most consequential signal isn’t CSA/HCA — it’s that V4 is the first frontier-class open model engineered against Huawei’s CANN/Ascend stack. Jensen Huang publicly called the optimization “a horrible outcome” for the United States, conceding that CUDA’s moat is being actively eroded 4. That admission is the news.

The substance is messier than the headline. DeepSeek reportedly reverted to Nvidia H800s for V4 pretraining after Ascend 910B exhibited stability issues and “glacial” interconnect speeds, with Ascend 950PR Supernodes only handling continued training and inference 5. “Runnable on Huawei” is real; “trained on Huawei” is still aspirational. The export-control thesis survives another quarter, but barely.

The sleeper win is the license

Simon Willison flagged the detail that benchmark coverage buried: V4 ships under a true MIT license, replacing V3’s bespoke DeepSeek Model License 6. Combined with Flash inference at $0.14/M input tokens, this removes the cleanest wedge GLM-5 partisans had been using on enterprise procurement. Fine-tuners no longer need legal review on the weights themselves — a quieter unlock than 1M context, and probably a more durable one.

Net read: the architecture is impressive, the agentic numbers are real, the calibration is poor 2, the Huawei story is half-marketing 5. The MIT license 6 and the credible threat to Nvidia’s ecosystem 4 are what will still matter in six months. The benchmark crown 13 is contested and almost certainly temporary.

Further reading

Footnotes

  1. Artificial Analysishttps://artificialanalysis.ai/articles/deepseek-is-back-among-the-leading-open-weights-models-with-v4-pro-and-v4-flash

    DeepSeek V4-Pro is back among the leading open-weights models, ranking #2 on the Intelligence Index behind Kimi K2.6, while leading all open-weights competitors on the GDPval-AA agentic benchmark.

    2 3
  2. Artificial Analysis – AA-Omnisciencehttps://artificialanalysis.ai/evaluations/omniscience

    On the AA-Omniscience benchmark V4-Flash hallucinates 96% of the time and V4-Pro 94% when it doesn’t know the answer; V4-Pro scores -10 on the Omniscience Index, indicating it would rather guess than abstain.

    2 3 4
  3. r/singularity discussion of LMArena resultshttps://www.reddit.com/r/singularity/comments/1suci24/deepseek_v4_pro_underwhelms_on_arena_crowdsourced/

    DeepSeek V4-Pro underwhelms on Arena’s crowdsourced ranking, sitting only third among open-weight coding models despite vendor claims of GPT-5.5-class performance — community attributes the gap to ‘benchmaxxing’ on SWE-bench-style evals.

    2 3
  4. TheNextWeb (Jensen Huang)https://thenextweb.com/news/nvidia-huang-deepseek-huawei-chips-horrible-outcome

    Nvidia CEO Jensen Huang called DeepSeek’s optimization for Huawei Ascend a ‘horrible outcome’ for the United States, conceding that CUDA’s moat is being actively eroded.

    2
  5. SL Guardian / China Academyhttps://slguardian.org/why-deepseek-v4-hasnt-fully-cut-ties-with-nvidia/

    DeepSeek was forced to revert to Nvidia hardware for V4 pretraining after Huawei Ascend 910B suffered stability issues and ‘glacial’ interconnect speeds, then shifted to Ascend 950PR Supernodes only for continued training and inference.

    2
  6. Simon Willison’s bloghttps://simonwillison.net/2026/apr/24/deepseek-v4/

    DeepSeek V4 ships under a true MIT license — a sharp departure from the custom DeepSeek Model License used for V3 — making it among the most permissive frontier-class open-weight releases to date.

    2
Jack Sun

Jack Sun, writing.

Engineer · Bay Area

Hands-on with agentic AI all day — building frameworks, reading what industry ships, occasionally writing them down.

Digest
All · AI Tech · AI Research · AI News
Writing
Essays
Elsewhere
Subscribe
All · AI Tech · AI Research · AI News · Essays

© 2026 Wei (Jack) Sun · jacksunwei.me Built on Astro · hosted on Cloudflare