Serverless Inference Model Library Text LLMs 9 min read read

DeepSeek-V4-Pro: specs, benchmarks, and how to run it on Lyceum

1.6T parameter MoE model for advanced reasoning and agentic coding

Maximilian Niroomand

Maximilian Niroomand

June 15, 2026 · CTO & Co-Founder at Lyceum Technology

DeepSeek-V4-Pro is a 1.6-trillion parameter Mixture-of-Experts (MoE) model built by DeepSeek. It features a massive 1-million token context window and excels at advanced reasoning, agentic coding workflows, and complex problem-solving. With only 49 billion active parameters per token, it maintains high efficiency while rivaling closed-source frontier models. Lyceum Technology serves DeepSeek-V4-Pro via our Serverless Inference API. You can integrate it instantly using the standard OpenAI SDK by updating only the base URL and API key. Hosted in our uk-south1 region in the UK, it offers low-latency access for UK and European users with simple pay-per-token billing.

Get started: call DeepSeek-V4-Pro on Lyceum

Integrating DeepSeek-V4-Pro into your application requires minimal effort if you already use the OpenAI SDK. Lyceum Technology provides a drop-in replacement API, allowing you to switch to this 1.6-trillion parameter model by updating only your base URL and API key. The following Python snippet demonstrates how to initialize the client and generate a chat completion.

from openai import OpenAI

client = OpenAI(
 base_url="https://api.lyceum.technology/api/v2/external/serverless",
 api_key="<your lyceum api key>",
)
response = client.chat.completions.create(
 model="deepseek-ai/DeepSeek-V4-Pro",
 messages=[{"role": "user", "content": "Hello!"}],
 max_tokens=256,
)
print(response.choices[0].message.content)

Pricing and region for DeepSeek-V4-Pro

Lyceum serves DeepSeek-V4-Pro on our Standard tier, which is optimized for high-capability workloads requiring maximum reasoning effort. The model is hosted in our uk-south1 region in the UK, giving UK and European users low-latency access to high-end GPU compute.

Pricing is strictly usage-based with no minimum commitments. You pay $1.75 per million input tokens and $3.50 per million output tokens. Because Lyceum owns and operates the underlying GPU infrastructure, we avoid the structural margin pressure of renting from public clouds, allowing us to offer highly competitive per-token rates for frontier models. There are no base fees, and you never pay for idle compute time when using the Serverless Inference API.

What DeepSeek-V4-Pro is good at

Advanced reasoning and agentic coding

DeepSeek-V4-Pro is a Mixture-of-Experts (MoE) model featuring 1.6 trillion total parameters, with 49 billion parameters active during any single token generation. This architecture allows it to deliver frontier-level intelligence while maintaining inference efficiency. According to the DeepSeek V4 Technical Report, the model excels at advanced reasoning, mathematics, and software engineering tasks.

One of the most significant capabilities of DeepSeek-V4-Pro is its massive 1-million token context window. To support this without overwhelming GPU memory, DeepSeek implemented a Hybrid Attention Architecture that combines Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA). This reduces the KV cache footprint to 10 percent of what previous generations required, making it highly effective for long-horizon agents that need to process entire codebases, extensive documentation, or long conversation histories.

Tool use and complex problem solving

The model is specifically tuned for agentic workflows. It supports native JSON output and tool calling, allowing developers to build autonomous agents that interact with external APIs, databases, and file systems. DeepSeek-V4-Pro utilizes a two-stage post-training pipeline involving Group Relative Policy Optimization (GRPO) and on-policy distillation, which enhances its ability to follow complex, multi-step instructions.

For enterprise engineering teams, this means DeepSeek-V4-Pro can handle tasks like automated root cause analysis, complex data extraction from unstructured text, and multi-file code refactoring. Its deep world knowledge and STEM proficiency make it a strong candidate for specialized domains like bioinformatics, quantitative analysis, and manufacturing quality inspection.

Benchmarks and how it compares

DeepSeek-V4-Pro benchmark results

DeepSeek-V4-Pro establishes itself as a highly capable frontier model, particularly in coding and reasoning tasks. The table below outlines its performance across industry-standard evaluations compared to leading proprietary models released in the same timeframe.

Benchmark DeepSeek-V4-Pro Claude Opus 4.7 GPT-5.5
SWE-bench Pro 55.4% 64.3% 58.6%
Terminal-Bench 2.0 67.9% N/A 82.7%
BenchLM Provisional 68 / 100 N/A N/A

Source: Lushbinary Frontier Model Showdown and BenchLM Leaderboard.

Comparing sibling and catalogue models

When evaluating DeepSeek-V4-Pro, it is helpful to compare it against its smaller sibling, DeepSeek-V4-Flash. The Flash variant activates only 13 billion parameters per token (compared to the Pro's 49 billion) and is optimized for speed and cost-efficiency. If your workload involves high-volume data extraction or basic chat routing, the Flash model provides a more economical path. However, for complex agentic workflows, multi-step reasoning, or deep code generation, the Pro model's larger parameter count yields noticeably better accuracy.

Against closed-source alternatives like GPT-5.5, DeepSeek-V4-Pro offers a compelling price-to-performance ratio. While GPT-5.5 leads in specific terminal and coding benchmarks, DeepSeek-V4-Pro delivers competitive reasoning at a fraction of the API cost, making it highly attractive for production workloads that require processing massive context windows up to 1 million tokens.

Using it in production

Production configuration for DeepSeek-V4-Pro

Deploying DeepSeek-V4-Pro in a production environment requires understanding its context limits, tier classification, and pricing structure. The model supports a maximum context window of 1,000,000 tokens. This massive capacity allows you to pass entire code repositories, extensive legal documents, or long-running agent histories in a single API request. To optimize performance, ensure you utilize the model's native JSON output and tool-calling capabilities when building structured data pipelines.

Lyceum categorizes DeepSeek-V4-Pro under our Standard tier. While our Fast tier focuses on ultra-low latency for smaller models, the Standard tier is dedicated to high-capability frontier models that require maximum reasoning effort. The model is hosted in our uk-south1 region in the UK, providing UK and European users with reliable, low-latency access to high-end GPU compute.

Calculating per-token costs

Lyceum charges $1.75 per million input tokens and $3.50 per million output tokens for DeepSeek-V4-Pro. To understand the unit economics, consider a typical agentic coding workload. If you send a prompt containing 15,000 tokens of context (such as a system prompt, API documentation, and existing code) and the model generates a 2,000-token response, the cost calculation is straightforward.

The input cost is 15,000 tokens multiplied by $0.00000175, equaling $0.02625. The output cost is 2,000 tokens multiplied by $0.0000035, equaling $0.007. The total cost for this complex reasoning request is approximately $0.033. This predictable, per-token pricing model scales efficiently from zero, ensuring you only pay for the exact compute your application consumes without the overhead of maintaining a dedicated 1.6T parameter inference cluster.

Why run DeepSeek-V4-Pro on Lyceum

Why run DeepSeek-V4-Pro on Lyceum

Lyceum gives you a drop-in path to DeepSeek-V4-Pro without managing GPU clusters. Our API is fully OpenAI-compatible, so you point your existing OpenAI SDK at our endpoint and swap in the model string. Billing is strictly pay-per-token with no base fees and no charges for idle compute, so costs scale directly with usage.

Because Lyceum owns and operates its own GPU infrastructure, we avoid the structural margin pressure of renting capacity from hyperscalers and pass competitive per-token rates on to you. You get the performance of a 1.6-trillion parameter model without the capital expenditure of purchasing NVIDIA H100 or B200 clusters, and without the operational burden of running complex MoE inference engines.

Open-stack transparency and ease of use

We believe in open-stack transparency. Our inference stack uses optimized open-source technologies like vLLM and NVIDIA Dynamo, ensuring high throughput and avoiding the vendor lock-in of black-box proprietary engines. Unified billing covers both serverless inference and per-second dedicated-GPU burst capacity, with zero egress fees on your traffic. To see how the underlying platform works, read our guide to serverless GPU inference.

Switching to Lyceum is frictionless. Because our API is fully OpenAI-compatible, your team can migrate existing applications in minutes without rewriting application logic or learning new SDKs.

Frequently Asked Questions

What is the context window for DeepSeek-V4-Pro?

DeepSeek-V4-Pro supports a massive context window of 1,000,000 tokens. This allows developers to input entire codebases, extensive documentation, or long conversation histories in a single request. The model uses a Hybrid Attention Architecture to process this context efficiently without overwhelming GPU memory.

How much does the DeepSeek-V4-Pro API cost on Lyceum?

Lyceum charges $1.75 per million input tokens and $3.50 per million output tokens for DeepSeek-V4-Pro. This usage-based, per-token pricing model ensures you only pay for the exact compute you consume, with no base fees or minimum monthly commitments required.

How do I call DeepSeek-V4-Pro using the OpenAI SDK?

You can call DeepSeek-V4-Pro using the standard OpenAI Python or Node.js SDK. Initialize the client with your Lyceum API key and set the base URL to https://api-docs.deepseek.com/. Then, use the model string deepseek-ai/DeepSeek-V4-Pro in your chat completion request.

Where is DeepSeek-V4-Pro hosted?

Lyceum hosts the DeepSeek-V4-Pro model in our uk-south1 region, located in the UK. This gives UK and European users low-latency access to the model with pay-per-token billing. The UK sits outside the EU, so this region is a UK data center rather than an EU one.

How does DeepSeek-V4-Pro compare to Claude Opus 4.7?

DeepSeek-V4-Pro is highly competitive but trails Claude Opus 4.7 in specific coding benchmarks. On SWE-bench Pro, DeepSeek-V4-Pro scores 55.4%, while Opus 4.7 scores 64.3%. However, DeepSeek-V4-Pro offers a significantly lower per-token cost, making it an excellent choice for high-volume agentic workflows.

Is DeepSeek-V4-Pro an open-source model?

Yes, DeepSeek-V4-Pro is an open-weight model released by DeepSeek under the MIT license. It features 1.6 trillion total parameters with 49 billion active parameters per token. Lyceum manages the complex infrastructure required to run this massive Mixture-of-Experts model via our Serverless Inference API.

Related Resources

/magazine/glm-5-2; /magazine/llama-3-3-70b; /magazine/gpt-oss-120b