How to Evaluate DeepSeek's Official V4 Release Announced on Twitter April 22

Start Using DeepSeek

DeepSeek-V4 is Here: Million Context is Not a Gimmick, but the Foundation for Next-Generation Agents

After much anticipation, DeepSeek-V4 was officially announced on April 22. From the overall architecture perspective, V4 is clearly a new generation large model reconstructed around “ultra-long context efficiency”, aiming to solve the industry pain point of high long context inference costs.

DeepSeek V4 Release

This new release features ultra-long context capability of one million tokens, achieving leading levels in Agent capabilities, world knowledge, and reasoning performance among domestic and open-source models. The model comes in two versions, both supporting 1M context length, and has been directly open-sourced:

  • Pro version has 1.6T total parameters with 49B active parameters
  • Flash version has 284B total parameters with 13B active parameters

Starting today, users can experience the latest DeepSeek-V4 directly on our platform, enjoying the new conversation capabilities brought by 1M ultra-long context memory. At the same time, API services have also been upgraded simultaneously. Developers only need to change model_name to deepseek-v4-pro or deepseek-v4-flash to quickly access and call the service.

How is the Model Performance?

First of all, V4 performance is already comparable to top closed-source models:

Significantly Improved Agent Capabilities

Compared with previous models, DeepSeek-V4-Pro’s Agent capabilities have been significantly enhanced. In Agentic Coding evaluations, V4-Pro has reached the best level among current open-source models, and also performs excellently in other Agent-related evaluations. Currently, DeepSeek-V4 has become the Agentic Coding model used by internal employees. According to evaluation feedback, the user experience is better than Sonnet 4.5, and the delivery quality is close to Opus 4.6 non-thinking mode, but there is still a certain gap with Opus 4.6 thinking mode.

Rich World Knowledge

In world knowledge assessments, DeepSeek-V4-Pro significantly outperforms other open-source models, only slightly inferior to the top closed-source model Gemini-Pro-3.1.

World-Class Reasoning Performance

In evaluations of mathematics, STEM, and competition-level code, DeepSeek-V4-Pro surpasses all currently publicly evaluated open-source models, achieving excellent results comparable to the world’s top closed-source models.

Structural Innovation and Ultra-High Context Efficiency

DeepSeek-V4 has pioneered a new attention mechanism that compresses in the token dimension, combined with DSA sparse attention (DeepSeek Sparse Attention), achieving world-leading long context capabilities, and significantly reducing computing and memory requirements compared to traditional methods. From now on, 1M (one million) context will be the standard for all official DeepSeek services.

Special Optimization for Agent Capabilities

DeepSeek-V4 has been adapted and optimized for mainstream Agent products such as Claude Code, OpenClaw, OpenCode, and CodeBuddy, with improved performance in code tasks, document generation tasks, and more.

New Version Model Architecture

DeepSeek officially released a technical paper detailing the technical implementation of V4. The paper clearly states that current reasoning models rely heavily on test-time scaling, but the quadratic complexity of traditional attention makes ultra-long context increasingly expensive, eventually becoming a bottleneck for reasoning and long-chain tasks. DeepSeek-V4’s goal is to break this bottleneck and make 1M context truly practical.

There are actually two levels of consideration behind this goal:

  • Product level: Many future tasks are not “ask a question, get an answer”, but long documents, multi-documents, complex Agent workflows, and ultra-long chain reasoning. These scenarios are sensitive to both context length and inference costs.
  • Research level: If long context inference is too expensive, the benefits of test-time scaling will quickly hit a wall. V4 is actually laying the foundation for “longer reasoning, longer trajectory tasks”.

1) CSA + HCA: V4’s Real Trump Card

This is the most critical part of the entire paper. Instead of continuing down the path of original dense attention, V4 designed a hybrid attention architecture:

  • CSA (Compressed Sparse Attention): First compress KV along the sequence, then perform sparse selection, only letting the query see the top-k compressed blocks.
  • HCA (Heavily Compressed Attention): Compresses even more aggressively, but retains dense attention.

You can understand it as:

  • CSA is more like “retrieve after compression”, focusing on efficiently finding key points;
  • HCA is more like “view the whole after extreme summarization”, focusing on reducing global costs.

These two mechanisms are used alternately, aiming not just to create an approximate attention, but to achieve a balanced design that takes into account local details, global coverage, and inference costs. The paper also adds a sliding window branch to prevent losing fine-grained dependencies of nearby tokens after compression.

This design idea is very engineering-oriented: View distant information cheaply, view nearby information in detail, and sparsely select important blocks for focused viewing. It’s more like a multi-level memory system, rather than obsessing over full raw tokens.

2) mHC: More Stable Training

Another important upgrade in V4 is mHC (Manifold-Constrained Hyper-Connections), which mainly solves three problems:

  • Degradation problem: Deep networks don’t just overfit, they can’t be trained well at all
  • Residual explosion: Norm becomes uncontrollable after residual superposition
  • Representation space collapse / distortion: Deep features are no longer interpretable

The core improvement of mHC lies in: Constraining the residual mixing matrix (Hresl) of each layer to be a “doubly stochastic matrix”, that is, falling on the manifold/polytope of Birkhoff polytope (the set of doubly stochastic matrices/convex hull of permutation matrices).

The research team chose this manifold structure as the optimization space mainly because it has multiple excellent properties:

  • Non-expansive: The spectral norm of doubly stochastic matrices is bounded, thus suppressing the risk of gradient explosion
  • Compositional Closure: The set of doubly stochastic matrices is closed under multiplication. Multi-layer multiplication remains doubly stochastic, so “cross-many-layer” skip connections also maintain the same conservation/stability properties
  • Geometric interpretation: Convex combination of permutations: Birkhoff polytope is the convex hull of permutation matrices, so it can be regarded as “weighted average of various permutation mixing methods”; repeated application brings stronger cross-flow mixing, but it is still monotonically enhanced fusion rather than uncontrolled amplification

In addition, mHC adds non-negativity constraints to avoid signal cancellation caused by the superposition of positive and negative coefficients. Experiments show that mHC makes the training process more stable, with loss basically monotonic, smooth, and no long-term offset.

3) Muon: The Optimizer Heavily Used in V4

The paper places great emphasis on the Muon optimizer. Its function is similar to the familiar AdamW, both used to update model parameters. The difference is that the paper believes Muon converges faster and trains more stably in large model training, so it is used in most modules of DeepSeek-V4.

Its biggest difference from ordinary SGD/AdamW is that it performs a special process on the update matrix to make the update direction more regular and stable. The core process is roughly:

  1. Calculate gradients first
  2. Accumulate momentum
  3. Perform a Hybrid Newton-Schulz orthogonalization process on the update matrix of “momentum + current gradient”
  4. Perform scaling and weight decay, and finally update parameters

4) How Impressive is V4’s Efficiency Improvement

The most impactful data in this paper is the efficiency comparison chart on the homepage. Under 1M token context:

  • DeepSeek-V4-Pro has only 27% of the single-token inference FLOPs of DeepSeek-V3.2, and KV cache is only 10% of V3.2
  • DeepSeek-V4-Flash is even more aggressive, with single-token FLOPs only 10%, and KV cache only 7%

This improvement is of great significance. Because the biggest problem with long context models is the high usage cost, the value of V4’s design is that it attempts to turn “million context” from a demonstration capability into a practical, deployable capability. This is also where it is more convincing than many models that “claim to support 1M long context”.

Final Thoughts

Many models in the past also claimed to support long context, but in practice, there were often two problems: either it was too expensive, or it didn’t really work well when the context was long. The core value of V4 this time lies in: It has been completely re-engineered around “long context usability” from attention mechanism, KV cache, training stability, to optimizer.

The release of V4 this time has indeed brought many substantial technological breakthroughs, laying a solid foundation for the next generation of AI Agents and long context applications.

Start Using DeepSeek