Hermes-4-405B: specs, benchmarks, and how to run it on Lyceum
A frontier hybrid-reasoning model by Nous Research, based on Llama-3.1-405B, offering deep deliberation and high steerability.
Justus Amen
June 18, 2026 · GTM at Lyceum Technology
Hermes-4-405B is the flagship open-weights model from Nous Research, built on the Llama-3.1-405B architecture. Trained on a massive 60-billion-token post-training corpus of verified reasoning traces, it introduces a hybrid reasoning mode that allows the model to deliberate deeply using <think> tags before answering. It excels at complex mathematics, coding, and structured JSON outputs while remaining highly steerable and uncensored. Lyceum Technology serves Hermes-4-405B through our OpenAI-compatible Serverless Inference API, allowing you to deploy this frontier model on fully GDPR-compliant, EU-hosted infrastructure with zero code changes.
Get started: call Hermes-4-405B on Lyceum
You can access Hermes-4-405B through Lyceum Technology's OpenAI-compatible Serverless Inference API. Because the API is a drop-in replacement, you only need to update your base URL and API key to start querying the model on European infrastructure.
from openai import OpenAI
client = OpenAI(
base_url="https://api.lyceum.technology/api/v2/external/serverless",
api_key="<your lyceum api key>",
)
response = client.chat.completions.create(
model="NousResearch/Hermes-4-405B",
messages=[{"role": "user", "content": "Hello!"}],
max_tokens=256,
)
print(response.choices[0].message.content)
Pricing and region for Hermes-4-405B
This model is hosted in the eu-north1 region, ensuring your data processing remains strictly within Europe. It operates on the Standard tier, which is optimized for high-capability workloads requiring maximum precision.
- Input pricing: $1.00 per million tokens
- Output pricing: $3.00 per million tokens
By utilizing our per-token billing model, you avoid the massive overhead of provisioning dedicated 8x H100 clusters for a 405B parameter model. You only pay for the exact compute you consume, scaling instantly from zero to peak traffic without managing underlying hardware. The snippet above demonstrates a standard chat completion request. For advanced use cases, you can also pass custom system prompts to trigger the model's hybrid reasoning mode, instructing it to deliberate before returning the final response.
What Hermes-4-405B is good at
Hybrid reasoning and deep deliberation
Hermes-4-405B introduces a hybrid reasoning mode. When faced with complex problems, the model can output <think>...</think> tags to deliberate internally before providing a final answer. This chain-of-thought processing improves performance on advanced mathematics, logic puzzles, and multi-step coding tasks. Because it is a hybrid model, developers retain control: you can prompt it to think deeply for complex tasks, or bypass the reasoning tokens entirely for fast, direct responses when latency is the priority.
Steerability and reduced refusals
Unlike proprietary models that often refuse benign requests due to overly strict safety tuning, Hermes-4-405B is designed to be highly steerable and aligned to the user. It achieves state-of-the-art results on RefusalBench, meaning it follows instructions faithfully without injecting unwanted moralistic biases or unnecessary censorship. This makes it ideal for creative writing, nuanced roleplay, and enterprise applications where unpredictable refusals break the user experience.
Structured outputs and schema adherence
The model was explicitly trained on a 60-billion-token corpus to produce valid JSON and adhere to strict schemas. It can even repair malformed JSON objects autonomously. This makes Hermes-4-405B an reliable engine for agentic workflows, function calling, and data extraction pipelines where format fidelity is critical. By combining deep reasoning with strict formatting, developers can build robust AI agents that reliably interact with external APIs and databases without requiring complex retry logic.
Benchmarks and how it compares
Hermes-4-405B benchmark results
Hermes-4-405B competes directly with frontier proprietary models, particularly when its reasoning mode is active. It shows improvements over the Llama-3.1 base model in STEM and coding evaluations, achieving high scores on advanced mathematical datasets.
| Benchmark | Score |
|---|---|
| MATH-500 | 96.3% |
| MMLU-Pro | 87.5% |
| RefusalBench (Helpfulness) | 57.1 (SOTA) |
Source: Nous Research Hermes 4 Technical Report and independent evaluations.
Comparison to sibling models
Compared to Hermes-4-70B, the 405B variant offers significantly higher accuracy on graduate-level reasoning (MMLU-Pro) and complex coding tasks (LiveCodeBench). However, the 70B model is much faster and cheaper to run, making it the better choice for high-volume, low-latency applications where frontier-level reasoning is not strictly required.
When compared to the base Llama-3.1-405B-Instruct, Hermes-4-405B provides much stronger schema adherence and a significantly lower refusal rate. The base Llama model often struggles with overly cautious safety alignments that trigger false refusals on benign coding or creative tasks. Hermes-4-405B strips away these limitations, making it far more suitable for agentic workflows and automated tool use where predictable execution is mandatory. For developers building autonomous systems, this reliability translates directly to fewer failed API calls and less complex retry logic.
Using it in production
Production configuration for Hermes-4-405B
When deploying Hermes-4-405B via Lyceum Technology's Serverless Inference API, understanding the context window and pricing structure is essential for optimizing your workloads.
The model supports a massive 128k token context window, allowing you to pass entire code repositories, long financial documents, or extensive multi-turn conversation histories in a single prompt. Because it operates on our Standard tier in the eu-north1 region, it is optimized for high-capability tasks requiring maximum precision rather than pure speed.
To calculate costs, consider a typical agentic workflow: analyzing a 10,000-token document and generating a 2,000-token structured JSON report (including reasoning tokens). At $1.00 per million input tokens and $3.00 per million output tokens, this request would cost $0.01 for the input and $0.06 for the output, totaling $0.07 per execution.
If you are using the hybrid reasoning capabilities, we strongly recommend enabling streaming (stream=True in the OpenAI SDK). Because the model generates extensive <think> tokens before producing the final answer, streaming prevents your application from timing out and provides immediate feedback to the user while the model deliberates. You can parse the stream on the client side to hide the <think> tags from the end user while still benefiting from the model's enhanced logical reasoning.
Running Hermes-4-405B on EU-sovereign infrastructure
Why run Hermes-4-405B on Lyceum
For European enterprises and AI startups, data privacy is often the primary blocker to adopting frontier models. While US-based providers offer similar APIs, they route data outside the EU, creating significant GDPR and compliance risks. Non-EU hosting is frequently a deal-breaker for teams handling sensitive healthcare, financial, or proprietary manufacturing data.
Lyceum Technology solves this by hosting Hermes-4-405B on EU-sovereign infrastructure. When you query the model via our API, your data is processed exclusively in our eu-north1 data centers. We own our GPU infrastructure, which allows us to maintain strict security boundaries and offer a clear path to GDPR, AI Act, and ISO 27001 compliance, something providers renting from hyperscalers struggle to guarantee.
Furthermore, our open-stack transparency means you avoid vendor lock-in. Because we use an OpenAI-compatible API, migrating your existing applications to Lyceum takes minutes. You get the reasoning power of a 405-billion parameter model, the cost-efficiency of per-token billing, and the legal certainty of European data residency, all without managing a single GPU.
By leveraging our serverless inference platform, you eliminate the need to provision expensive 8x H100 clusters. You simply pay for the tokens you use, scaling from zero to peak demand while keeping your data securely within Europe.