What is the context window for Hermes-4-405B?

Hermes-4-405B supports a 128,000-token context window. This massive capacity allows you to input entire code repositories, lengthy PDF documents, or extensive multi-turn conversation histories in a single API request.

Is my data sent to the US when using this model?

No. When you use Lyceum Technology's API, Hermes-4-405B is hosted and executed entirely within our eu-north1 region. Your prompts and generated data never leave the European Union, ensuring GDPR compliance.

How do I enable the reasoning mode?

Hermes-4-405B's hybrid reasoning mode is activated via the system prompt. By instructing the model to deliberate and use tags before answering, it will automatically generate a chain of thought to solve complex problems.

How does Hermes-4-405B compare to Llama-3.1-405B?

While built on the Llama-3.1-405B base, Hermes-4-405B is fine-tuned by Nous Research on 60 billion tokens of reasoning data. It offers significantly better schema adherence, structured JSON output, and a much lower refusal rate than the base model.

Do I need to rewrite my code to use Lyceum's API?

No. Lyceum's Serverless Inference API is fully OpenAI-compatible. You simply need to change the base_url in your OpenAI SDK to [removed] and update your API key.

Hermes-4-405B API: pricing, benchmarks & EU hosting

Q: How much does it cost to run Hermes-4-405B on Lyceum?

On Lyceum's Serverless Inference API, Hermes-4-405B costs $1.00 per million input tokens and $3.00 per million output tokens. You are billed per token with no minimum commitments or base fees.

Hermes-4-405B is the flagship open-weights model from Nous Research, built on the Llama-3.1-405B architecture. Trained on a massive 60-billion-token post-training corpus of verified reasoning traces, it introduces a hybrid reasoning mode that allows the model to deliberate deeply using <think> tags before answering. It excels at complex mathematics, coding, and structured JSON outputs while remaining highly steerable and uncensored. Lyceum Technology serves Hermes-4-405B through our OpenAI-compatible Serverless Inference API, allowing you to deploy this frontier model on fully GDPR-compliant, EU-hosted infrastructure with zero code changes.

Get started: call Hermes-4-405B on Lyceum

You can access Hermes-4-405B through Lyceum Technology's OpenAI-compatible Serverless Inference API. Because the API is a drop-in replacement, you only need to update your base URL and API key to start querying the model on European infrastructure.

from openai import OpenAI

client = OpenAI(
 base_url="https://api.lyceum.technology/api/v2/external/serverless",
 api_key="<your lyceum api key>",
)
response = client.chat.completions.create(
 model="NousResearch/Hermes-4-405B",
 messages=[{"role": "user", "content": "Hello!"}],
 max_tokens=256,
)
print(response.choices[0].message.content)

Pricing and region for Hermes-4-405B

This model is hosted in the eu-north1 region, ensuring your data processing remains strictly within Europe. It operates on the Standard tier, which is optimized for high-capability workloads requiring maximum precision.

Input pricing: $1.00 per million tokens
Output pricing: $3.00 per million tokens

By utilizing our per-token billing model, you avoid the massive overhead of provisioning dedicated 8x H100 clusters for a 405B parameter model. You only pay for the exact compute you consume, scaling instantly from zero to peak traffic without managing underlying hardware. The snippet above demonstrates a standard chat completion request. For advanced use cases, you can also pass custom system prompts to trigger the model's hybrid reasoning mode, instructing it to deliberate before returning the final response.

What Hermes-4-405B is good at

Hybrid reasoning and deep deliberation

Hermes-4-405B introduces a hybrid reasoning mode. When faced with complex problems, the model can output <think>...</think> tags to deliberate internally before providing a final answer. This chain-of-thought processing improves performance on advanced mathematics, logic puzzles, and multi-step coding tasks. Because it is a hybrid model, developers retain control: you can prompt it to think deeply for complex tasks, or bypass the reasoning tokens entirely for fast, direct responses when latency is the priority.

Steerability and reduced refusals

Unlike proprietary models that often refuse benign requests due to overly strict safety tuning, Hermes-4-405B is designed to be highly steerable and aligned to the user. It achieves state-of-the-art results on RefusalBench, meaning it follows instructions faithfully without injecting unwanted moralistic biases or unnecessary censorship. This makes it ideal for creative writing, nuanced roleplay, and enterprise applications where unpredictable refusals break the user experience.

Structured outputs and schema adherence

The model was explicitly trained on a 60-billion-token corpus to produce valid JSON and adhere to strict schemas. It can even repair malformed JSON objects autonomously. This makes Hermes-4-405B an reliable engine for agentic workflows, function calling, and data extraction pipelines where format fidelity is critical. By combining deep reasoning with strict formatting, developers can build robust AI agents that reliably interact with external APIs and databases without requiring complex retry logic.

Benchmarks and how it compares

Hermes-4-405B benchmark results

Hermes-4-405B competes directly with frontier proprietary models, particularly when its reasoning mode is active. It shows improvements over the Llama-3.1 base model in STEM and coding evaluations, achieving high scores on advanced mathematical datasets.

Benchmark	Score
MATH-500	96.3%
MMLU-Pro	87.5%
RefusalBench (Helpfulness)	57.1 (SOTA)

Source: Nous Research Hermes 4 Technical Report and independent evaluations.

Comparison to sibling models

Compared to Hermes-4-70B, the 405B variant offers significantly higher accuracy on graduate-level reasoning (MMLU-Pro) and complex coding tasks (LiveCodeBench). However, the 70B model is much faster and cheaper to run, making it the better choice for high-volume, low-latency applications where frontier-level reasoning is not strictly required.

When compared to the base Llama-3.1-405B-Instruct, Hermes-4-405B provides much stronger schema adherence and a significantly lower refusal rate. The base Llama model often struggles with overly cautious safety alignments that trigger false refusals on benign coding or creative tasks. Hermes-4-405B strips away these limitations, making it far more suitable for agentic workflows and automated tool use where predictable execution is mandatory. For developers building autonomous systems, this reliability translates directly to fewer failed API calls and less complex retry logic.

Using it in production

Production configuration for Hermes-4-405B

When deploying Hermes-4-405B via Lyceum Technology's Serverless Inference API, understanding the context window and pricing structure is essential for optimizing your workloads.

The model supports a massive 128k token context window, allowing you to pass entire code repositories, long financial documents, or extensive multi-turn conversation histories in a single prompt. Because it operates on our Standard tier in the eu-north1 region, it is optimized for high-capability tasks requiring maximum precision rather than pure speed.

To calculate costs, consider a typical agentic workflow: analyzing a 10,000-token document and generating a 2,000-token structured JSON report (including reasoning tokens). At $1.00 per million input tokens and $3.00 per million output tokens, this request would cost $0.01 for the input and $0.06 for the output, totaling $0.07 per execution.

If you are using the hybrid reasoning capabilities, we strongly recommend enabling streaming (stream=True in the OpenAI SDK). Because the model generates extensive <think> tokens before producing the final answer, streaming prevents your application from timing out and provides immediate feedback to the user while the model deliberates. You can parse the stream on the client side to hide the <think> tags from the end user while still benefiting from the model's enhanced logical reasoning.

Running Hermes-4-405B on EU-sovereign infrastructure

Why run Hermes-4-405B on Lyceum

For European enterprises and AI startups, data privacy is often the primary blocker to adopting frontier models. While US-based providers offer similar APIs, they route data outside the EU, creating significant GDPR and compliance risks. Non-EU hosting is frequently a deal-breaker for teams handling sensitive healthcare, financial, or proprietary manufacturing data.

Lyceum Technology solves this by hosting Hermes-4-405B on EU-sovereign infrastructure. When you query the model via our API, your data is processed exclusively in our eu-north1 data centers. We own our GPU infrastructure, which allows us to maintain strict security boundaries and offer a clear path to GDPR, AI Act, and ISO 27001 compliance, something providers renting from hyperscalers struggle to guarantee.

Furthermore, our open-stack transparency means you avoid vendor lock-in. Because we use an OpenAI-compatible API, migrating your existing applications to Lyceum takes minutes. You get the reasoning power of a 405-billion parameter model, the cost-efficiency of per-token billing, and the legal certainty of European data residency, all without managing a single GPU.

By leveraging our serverless inference platform, you eliminate the need to provision expensive 8x H100 clusters. You simply pay for the tokens you use, scaling from zero to peak demand while keeping your data securely within Europe.

Hermes-4-405B: specs, benchmarks, and how to run it on Lyceum

Get started: call Hermes-4-405B on Lyceum

Pricing and region for Hermes-4-405B

What Hermes-4-405B is good at

Hybrid reasoning and deep deliberation

Steerability and reduced refusals

Structured outputs and schema adherence

Benchmarks and how it compares

Hermes-4-405B benchmark results

Comparison to sibling models

Using it in production

Production configuration for Hermes-4-405B

Running Hermes-4-405B on EU-sovereign infrastructure

Why run Hermes-4-405B on Lyceum

Frequently Asked Questions

What is the context window for Hermes-4-405B?

How much does it cost to run Hermes-4-405B on Lyceum?

Is my data sent to the US when using this model?

How do I enable the reasoning mode?

How does Hermes-4-405B compare to Llama-3.1-405B?

Do I need to rewrite my code to use Lyceum's API?

Further Reading

Related Resources

Related Articles

Qwen3.5-397B-A17B: specs, benchmarks, and how to run it on Lyceum

Qwen3-32B: specs, benchmarks, and how to run it on Lyceum

Qwen3-30B-A3B: specs, benchmarks, and how to run it on Lyceum

Inference

Training