DeepSeek-V4's Ascend pivot: cheaper tokens, shakier answers
DeepSeek-V4 matches frontier coding scores on Huawei silicon at a fraction of the compute, but hallucinates on almost every prompt.
DeepSeek-V4’s Ascend pivot: cheaper tokens, shakier answers
TL;DR
- DeepSeek-V4-Pro matches Gemini-3.1-Pro on SWE-Verified at 80.6% while cutting per-token FLOPs 73% and KV cache to a tenth.
- Independent AA-Omniscience evaluation clocks a 94% hallucination rate: the model almost never abstains when uncertain.
- The advertised 1M context degrades non-deterministically as the Lightning Indexer sporadically misses the right compressed block.
- Training and serving moved to Huawei Ascend 950PR, making the architecture as much a sanctions workaround as a research result.
Today’s research story is a single release that reads as three stories at once. DeepSeek-V4-Pro lands at frontier coding parity with Gemini-3.1-Pro while slashing compute per token and shrinking KV cache by an order of magnitude — the kind of efficiency curve that, on its own, would be the headline.
But the same release ships a 94% hallucination rate on independent omniscience evals and a 1M-token context window that degrades non-deterministically depending on which compressed block the Lightning Indexer happens to pick. And all of it runs on Huawei Ascend silicon rather than NVIDIA, which reframes the architectural choices as much as a sanctions workaround as a research result.
The through-line is the tradeoff: efficiency and silicon sovereignty are real and measurable, while calibration and reliability are where the bill came due. Worth reading with both columns open.
DeepSeek-V4 trades calibration for context — and quietly leaves CUDA
Source: huggingface-blog · published 2026-04-24
TL;DR
- DeepSeek-V4-Pro ties Gemini-3.1-Pro on SWE-Verified (80.6%) while cutting per-token FLOPs 73% and KV cache to 10% of V3.2’s.
- Independent eval clocks a 94% hallucination rate on AA-Omniscience: V4 almost never abstains.
- The 1M context degrades non-deterministically — Lightning Indexer sporadically misses the right compressed block.
- Trained and served on Huawei Ascend 950PR, not NVIDIA. The architecture is partly a sanctions workaround.
What the release actually delivers
The headline is real: V4-Pro’s hybrid attention — Compressed Sparse Attention at 4× compression plus Heavily Compressed Attention at 128× — gets per-token inference FLOPs down 73% versus V3.2 and shrinks the KV cache by ~98% against vanilla GQA. On agentic benchmarks it lands where the frontier sits: 80.6% on SWE-Verified (tied with Gemini-3.1-Pro, a hair behind Opus-4.6-Max at 80.8%), 67.9 on Terminal Bench 2.0, 67% on internal CUDA/PyTorch R&D tasks against Sonnet 3.5’s 47%.
That’s the part the HuggingFace post sells. The rest of the picture is less flattering.
Calibration is the open wound
Artificial Analysis measured V4-Pro at a 94% hallucination rate on AA-Omniscience, with V4-Flash at 96% — not because the model is ignorant, but because it almost never abstains 1. GLM-5, by contrast, trained an explicit refusal mechanism and posts dramatically better numbers. For a release pitched at autonomous agents running long tool-use chains, an agent that confidently invents arguments at depth is precisely the failure mode the architecture was supposed to mitigate.
Long-context retention has a related problem. The advertised 0.82 MRCR at 256K and 0.59 at 1M look acceptable, but Skywork’s stress tests find failures distributed randomly across positions rather than in a predictable weak zone — they attribute it to the Lightning Indexer occasionally missing the right compressed block during selection 2. Artgor adds that on ARC-AGI, V4-Pro stays “meaningfully behind” the US frontier, putting the coding-bench parity in context 3.
The real story is silicon and infra
Buried under the model card: V4-Pro and V4-Flash were trained and deployed exclusively on Huawei Ascend 950PR, with the CUDA toolchain abandoned entirely 4. Read in that light, the aggressive CSA/HCA compression is algorithmic compensation for the FLOP gap versus Blackwell — sanctions-era necessity dressed as elegance.
DSec, the RL training substrate, deserves more attention than the model itself. Fireworks documents four execution layers behind one Python SDK, with 3FS hitting 6.6 TiB/s read and 40 GiB/s for KV cache I/O, and token-granular write-ahead logs that let preempted RL rollouts resume mid-trajectory without re-running expensive tool calls 5.
flowchart LR
A[RL trainer] --> S[DSec Python SDK]
S --> F[Function calls]
S --> C[Containers]
S --> M[Firecracker microVMs]
S --> Q[QEMU full VMs]
F & C & M & Q --> W[(3FS · 6.6 TiB/s<br/>token-granular WAL)]
W -. preemption-safe replay .-> A
That preemption-safe replay primitive is arguably more reusable than the weights.
DSML has a sharp edge
Persistent reasoning across turns isn’t free. Practitioners report that reasoning_content must be echoed back on every subsequent turn when tools are active — OpenAI-style harnesses like Cursor and OpenCode drop the field and silently degrade, with some IDEs falling back to older models entirely 6. The XML-over-JSON tool format is defensible engineering, but the migration tax on existing agent frameworks is real.
Takeaway
V4 is a genuine efficiency milestone and a geopolitical inflection point. It is not yet a model you put in front of users without a verification layer — the hallucination numbers, the random-position long-context failures, and the harness incompatibility together argue for treating V4 as powerful infrastructure that still needs a referee.
Footnotes
-
smol.ai / Artificial Analysis writeup — https://news.smol.ai/issues/26-04-24-deepseek-v4/
↩DeepSeek-V4-Pro recorded a 94% hallucination rate (V4-Flash 96%) on AA-Omniscience because it almost never abstains when uncertain, in contrast to GLM-5 which lowered hallucinations via a refusal-to-answer mechanism.
-
Skywork.ai independent eval — https://skywork.ai/skypage/en/deepseek-v4-ai-coding-guide/2047581548426514433
↩Failures in retrieval and reasoning appear randomly across different positions in the 1M window rather than clustering in a fixed weak zone… attributed to the Lightning Indexer and sparse attention sporadically missing specifics during compression.
-
Artgor Medium review — https://artgor.medium.com/deepseek-v4-review-why-million-token-context-needs-efficient-attention-not-just-larger-windows-6dc8e74a00b1
↩While V4 rivals frontier models in coding and tool use, it still trails the top closed-source models like GPT-5.5 and Gemini 3.1 Pro by three to six months in general reasoning, with a visible gap on ARC-AGI where V4-Pro remains meaningfully behind the US frontier.
-
Lushbinary analysis — https://lushbinary.com/blog/deepseek-v4-huawei-ascend-ai-infrastructure-strategy/
↩V4-Pro and V4-Flash were trained and deployed exclusively on Huawei Ascend 950PR silicon, completely abandoning the Nvidia CUDA ecosystem used for previous versions.
-
Fireworks.ai engineering blog — https://fireworks.ai/blog/what-deepseek-v4-says-about-training-platforms
↩DSec exposes four execution substrates—function calls, containers, microVMs (Firecracker), and full VMs (QEMU)—behind a unified Python SDK, backed by 3FS at 6.6 TiB/s peak read and 40 GiB/s for KV cache I/O, with token-granular write-ahead logs enabling preemption-safe trajectory replay.
-
Medium review (leucopsis) — https://medium.com/@leucopsis/deepseek-v4-review-a23ce940151c
↩The reasoning_content generated in thinking mode must be passed back to the API in all subsequent turns when tool calls are active… many existing tools built for OpenAI-style completions don’t expect a separate reasoning field, creating an integration hurdle in IDEs like Cursor.