Serverless Inference Model Library Text LLMs 8 min read read

Hermes-4-70B: specs, benchmarks, and how to run it on Lyceum

Deploy the hybrid-mode reasoning model by Nous Research on EU-sovereign infrastructure.

Magnus Grünewald

Magnus Grünewald

June 19, 2026 · CEO at Lyceum Technology

Hermes-4-70B is a frontier, hybrid-mode reasoning model developed by Nous Research. Built on the Llama-3.1-70B architecture, it introduces advanced capabilities in mathematics, coding, and logical deduction through a massive post-training corpus of approximately 5 million samples. The model features a unique hybrid reasoning mode, allowing it to generate explicit thinking traces before answering or respond directly based on the prompt. Lyceum Technology serves Hermes-4-70B through our Serverless Inference API. Engineering teams can access the model using the standard OpenAI SDK, making migration straightforward. Because Lyceum operates its own hardware in European data centers, all inference workloads run on EU-sovereign infrastructure, ensuring strict GDPR compliance without the data privacy risks associated with US-based hyperscalers.

Get started: call Hermes-4-70B on Lyceum

To integrate Hermes-4-70B into your application, use the standard OpenAI Python SDK. Because Lyceum provides an OpenAI-compatible API, you only need to update the base URL and provide your Lyceum API key. The model string for this endpoint is NousResearch/Hermes-4-70B.

from openai import OpenAI

client = OpenAI(
 base_url="https://api.lyceum.technology/api/v2/external/serverless",
 api_key="<your lyceum api key>",
)
response = client.chat.completions.create(
 model="NousResearch/Hermes-4-70B",
 messages=[{"role": "user", "content": "Hello!"}],
 max_tokens=256,
)
print(response.choices[0].message.content)

Pricing and region for Hermes-4-70B

Lyceum serves Hermes-4-70B on the Fast tier, which is optimized for cost-efficient, high-throughput inference. The pricing is $0.13 per million input tokens and $0.40 per million output tokens. This per-token billing model ensures you pay for the exact compute resources your application consumes, scaling to zero when idle.

All API requests for this model are processed in the eu-north1 region. By running workloads on Lyceum's owned GPU infrastructure in Europe, your data remains within the European Union. This setup provides a clear path to GDPR compliance for enterprise applications, healthcare platforms, and financial services that cannot route sensitive user data through US-based API providers. The combination of the Fast tier economics and EU data residency makes this model practical for production deployments.

What Hermes-4-70B is good at

Hybrid reasoning and structured outputs

Hermes-4-70B introduces a hybrid reasoning mode that allows the model to deliberate before generating a final response. When faced with complex logic, the model can output explicit thinking segments to work through the problem step by step. For simpler queries, it can bypass this deliberation to provide faster responses. Furthermore, Nous Research trained the model to produce valid JSON for given schemas, making it reliable for programmatic function calling.

Steerability and reduced refusals

One of the primary design goals of the Hermes series is user alignment without excessive censorship. Hermes-4-70B achieves state-of-the-art results on RefusalBench, demonstrating a willingness to be helpful across scenarios that other models often block. This steerability means developers can rely on the model to follow system prompts accurately and maintain complex roleplay instructions without triggering false-positive safety refusals.

Math, code, and logic capabilities

The model was post-trained on a synthesized corpus of approximately 60 billion tokens blended across reasoning and non-reasoning data. This dataset yields improvements in STEM fields. Hermes-4-70B excels at competitive programming tasks, advanced mathematical problem solving, and scientific reasoning. It retains the general assistant quality of its base architecture while pushing the boundaries of what a 70-billion parameter model can achieve in specialized domains.

Benchmarks and how it compares

Hermes-4-70B benchmark results

Nous Research and independent evaluators have published extensive benchmark data for Hermes-4-70B, demonstrating its strong performance across coding, mathematics, and general knowledge tasks. The model consistently competes with or outperforms other models in the 70B weight class.

Benchmark Metric Score
MATH Competition mathematics 91.0%
HumanEval+ Code generation correctness 90.0%
MMLU-Pro Massive Multitask Language Understanding 87.0%
SWE-bench Verified Real-world software engineering 72.0%
GPQA Diamond Graduate-level science Q&A 49.1%
LiveCodeBench Live competitive programming 26.9%

Source: AI Value Index and developer performance benchmarks.

When compared to its base model, Llama-3.1-70B, Hermes-4-70B shows improvements in structured output generation and mathematical reasoning. The addition of the hybrid reasoning mode allows it to score higher on complex logic evaluations like MATH and SWE-bench Verified. Against current sibling models in the Lyceum catalogue, such as standard instruction-tuned 70B models, Hermes-4-70B offers a distinct advantage for developers who need strict JSON schema adherence and the ability to toggle deep thinking traces. Its performance on HumanEval+ makes it an excellent choice for coding assistants.

Using it in production

Production configuration for Hermes-4-70B

When deploying Hermes-4-70B in production, understanding the model parameters and pricing structure is critical. The model supports a context window of 131,072 tokens, which is ideal for analyzing large codebases or maintaining long multi-turn conversations.

Lyceum categorizes this model in the Fast tier. The Fast tier is designed for cost-efficient, high-throughput workloads where latency and unit economics are the primary concerns. The pricing is set at $0.13 per million input tokens and $0.40 per million output tokens.

To understand the production economics, consider an application processing 10,000 requests per day. If an average request contains 1,500 input tokens and generates 500 output tokens, the daily token volume would be 15 million input tokens and 5 million output tokens.

  • Input cost: 15 million tokens × $0.13 = $1.95
  • Output cost: 5 million tokens × $0.40 = $2.00
  • Total daily cost: $3.95

This per-token pricing model ensures you pay for the exact compute used. Furthermore, Lyceum does not charge any egress fees, meaning you can stream large volumes of generated text back to your application without incurring hidden network transfer costs. All API requests for Hermes-4-70B are routed through the eu-north1 region, ensuring low latency for European users while maintaining strict data residency.

Running Hermes-4-70B on EU-sovereign infrastructure

Why run Hermes-4-70B on Lyceum

For European engineering teams, data sovereignty is a hard requirement. Running Hermes-4-70B on Lyceum ensures that your inference workloads remain entirely within the European Union. Because the model is hosted in the eu-north1 region, your data never crosses the Atlantic, providing a clear and provable path to GDPR compliance. This is a critical advantage over US-based API providers that route traffic through American data centers.

Lyceum operates its own GPU infrastructure rather than renting compute from hyperscalers. This structural advantage allows us to offer competitive pricing without markups. By utilizing our open-stack transparency, powered by vLLM and NVIDIA Dynamo, developers gain deep visibility into the inference process. You are not locked into a proprietary engine.

The platform supports engineering velocity. The OpenAI-compatible API means you can migrate existing applications to Lyceum by changing a single line of code, the base URL. There is no need to rewrite your application logic or learn a new SDK. Additionally, our per-second billing and scale-to-zero capabilities ensure that you never pay for idle compute. Lyceum provides the performance, compliance, and cost-efficiency required to scale AI applications across Europe.

Frequently Asked Questions

What is the context window for Hermes-4-70B?

Hermes-4-70B features a context window of 131,072 tokens. This extensive capacity allows the model to process large documents, analyze entire code repositories, and maintain long multi-turn conversations without losing track of earlier instructions or context.

How much does Hermes-4-70B cost on Lyceum?

On Lyceum's Fast tier, Hermes-4-70B costs $0.13 per million input tokens and $0.40 per million output tokens. This pay-per-token model ensures you pay for the exact compute your application consumes, with no base fees or minimum commitments.

Where is Hermes-4-70B hosted?

Lyceum hosts Hermes-4-70B in the eu-north1 region. All inference requests are processed on European servers, ensuring strict data residency and full GDPR compliance for enterprise applications handling sensitive user data.

How do I call Hermes-4-70B via API?

You can call the model using the standard OpenAI SDK. Set the base URL to [removed], provide your Lyceum API key, and use the model string NousResearch/Hermes-4-70B in your chat completions request.

What is the difference between Hermes-4-70B and Llama-3.1-70B?

Hermes-4-70B is built on the Llama-3.1-70B architecture but includes a massive post-training corpus focused on reasoning. It introduces a hybrid reasoning mode, improved JSON schema adherence, and significantly reduced refusal rates, making it more steerable for developers.

Does Hermes-4-70B support function calling?

Yes, Hermes-4-70B is optimized for structured outputs and function calling. Nous Research trained the model to strictly adhere to provided JSON schemas and even repair malformed objects, making it reliable for programmatic data extraction and agentic workflows.

Related Resources

/magazine/glm-5-2; /magazine/llama-3-3-70b; /magazine/gpt-oss-120b