Serverless Inference Model Library Text LLMs 8 min read read

Nemotron-Ultra-253B: specs, benchmarks, and how to run it on Lyceum

NVIDIA's 253B reasoning model optimized for single-node deployment.

Caspar Lehmkühler

Caspar Lehmkühler

June 24, 2026 · Head of Product at Lyceum Technology

Nemotron-Ultra-253B is a 253-billion-parameter large language model developed by NVIDIA. Derived from Meta's Llama 3.1 405B, it uses advanced Neural Architecture Search (NAS) and vertical compression to drastically reduce memory footprint without sacrificing intelligence. Post-trained for advanced reasoning, human-interactive chat, and tool calling, it features a 128K context window and a unique dual-mode operation for chain-of-thought generation. Lyceum Technology serves Nemotron-Ultra-253B via our OpenAI-compatible Serverless Inference API. You can deploy this model on our EU-sovereign infrastructure, ensuring full GDPR compliance and data residency while paying only for the tokens you consume.

Get started: call Nemotron-Ultra-253B on Lyceum

You can access Nemotron-Ultra-253B through Lyceum Technology's Serverless Inference API. Because our API is fully OpenAI-compatible, you can switch to our EU-hosted infrastructure by updating your base URL and API key.

from openai import OpenAI

client = OpenAI(
 base_url="https://api.lyceum.technology/api/v2/external/serverless",
 api_key="<your lyceum api key>",
)
response = client.chat.completions.create(
 model="nvidia/Llama-3_1-Nemotron-Ultra-253B-v1",
 messages=[{"role": "user", "content": "Hello!"}],
 max_tokens=256,
)
print(response.choices[0].message.content)

Pricing and region for Nemotron-Ultra-253B

Lyceum Technology offers this model on our Standard tier, which provides high-capability compute for complex reasoning tasks. The model is hosted in our eu-north1 region, ensuring your data remains within European borders.

  • Input pricing: $0.60 per million tokens
  • Output pricing: $1.80 per million tokens
  • Tier: Standard
  • Region: eu-north1

What Nemotron-Ultra-253B is good at

Efficient frontier-level reasoning

Nemotron-Ultra-253B was built to solve a specific infrastructure problem: running a frontier-class reasoning model without requiring a massive GPU cluster. By applying Neural Architecture Search (NAS) and vertical compression to the Llama 3.1 405B architecture, NVIDIA reduced the parameter count to 253B. This allows the model to fit entirely on a single 8xH100 node for inference, significantly lowering the hardware barrier while maintaining top-tier reasoning capabilities for enterprise deployments.

Dual-mode chain-of-thought

Unlike models that force reasoning on every prompt, Nemotron-Ultra-253B features a dual-mode operation controlled via the system prompt. By including "detailed thinking on" or "detailed thinking off" in the system message, developers can toggle the model's chain-of-thought generation. This flexibility means you can use the same model for complex, multi-step math problems and standard, low-latency chat interactions without wasting output tokens on unnecessary reasoning.

Tool calling and RAG

The model underwent extensive post-training using Group Relative Policy Optimization (GRPO) specifically targeted at tool calling and Retrieval-Augmented Generation (RAG). It reliably outputs structured JSON, follows complex multi-step instructions, and manages context effectively across its 128K token window, making it an excellent engine for agentic workflows that require both deep analysis and predictable formatting. For teams building autonomous agents, this combination of structured output reliability and deep reasoning makes it a highly capable core engine. The model can analyze a user request, determine which external tools to call, and synthesize the results into a coherent final answer.

Benchmarks and how it compares

Nemotron-Ultra-253B benchmark results

NVIDIA's dual-mode approach allows the model to scale its intelligence at inference time. Enabling reasoning mode yields massive performance jumps across complex benchmarks, proving the efficacy of the GRPO post-training phase.

Benchmark Standard Mode Reasoning Mode
MATH-500 80.40% 97.00%
AIME 2025 16.67% 72.50%
LiveCodeBench 29.03% 66.31%
GPQA Diamond 56.60% 76.01%

Source: NVIDIA and OpenRouter.

Comparison to sibling models

When compared to the original Meta Llama 3.1 405B, Nemotron-Ultra-253B retains the vast majority of its intelligence while requiring nearly half the VRAM. This makes it a far more practical choice for teams transitioning off hyperscaler credits who need to manage infrastructure costs.

Against DeepSeek R1, Nemotron-Ultra-253B actually wins on GPQA Diamond and LiveCodeBench. While DeepSeek R1 holds a slight edge in MATH-500 (97.3% vs 97.0%), Nemotron's dense architecture avoids the complex MoE routing overhead and fits on a single 8xH100 node. DeepSeek R1's 671B total parameter count typically requires a 16-GPU cluster for BF16 inference, making Nemotron-Ultra-253B a much more accessible option for self-hosting or dedicated cloud deployments. For teams prioritizing coding and scientific reasoning, Nemotron offers a superior balance of intelligence and hardware efficiency.

Using it in production

Production configuration for Nemotron-Ultra-253B

When deploying Nemotron-Ultra-253B via Lyceum Technology's Serverless Inference API, you are accessing our Standard tier in the eu-north1 region. This tier is designed for high-capability models where complex reasoning and accuracy are prioritized over raw speed.

To control the model's reasoning behavior, you must configure the system prompt. Injecting "detailed thinking on" instructs the model to generate a chain-of-thought before answering, which is ideal for coding and math. If you need lower latency for standard chat, use "detailed thinking off". Because the model supports a 128K context window, you can safely pass large documents for RAG workflows, but be mindful of the output token consumption when reasoning is enabled.

Calculating per-token pricing

Lyceum Technology charges $0.60 per million input tokens and $1.80 per million output tokens for this model. Consider a RAG application that processes a 10,000-token document and generates a 1,500-token reasoned response.

  • Input cost: 10,000 tokens * ($0.60 / 1,000,000) = $0.006
  • Output cost: 1,500 tokens * ($1.80 / 1,000,000) = $0.0027
  • Total cost per request: $0.0087

This pay-per-token model allows you to scale from zero without committing to the massive upfront cost of an 8xH100 cluster. You only pay for the exact compute your application requires, making it highly cost-effective for bursty or unpredictable workloads.

Running Nemotron-Ultra-253B on EU-sovereign infrastructure

Why run Nemotron-Ultra-253B on Lyceum

For European enterprises, deploying a 253B-parameter model typically means relying on US-based hyperscalers or API providers, which introduces significant data privacy risks. Lyceum Technology solves this by offering Nemotron-Ultra-253B on our fully EU-sovereign infrastructure. Hosted in our eu-north1 region, your inference workloads are processed entirely within European borders, ensuring strict GDPR compliance and data residency.

Unlike competitors who rent their hardware from larger public clouds, Lyceum owns and operates its GPU infrastructure. This structural advantage allows us to offer highly competitive per-token pricing without the markup associated with middleman API providers. Furthermore, our open-stack transparency, powered by vLLM and NVIDIA Dynamo, ensures you are never locked into a proprietary black-box inference engine. Because Lyceum does not charge egress fees, you can move your data and model outputs without the hidden costs associated with traditional cloud providers. You retain full visibility into how your workloads are executed.

Because our Serverless Inference API is a drop-in replacement for the OpenAI SDK, your engineering team can migrate to Lyceum in minutes. You get the intelligence of NVIDIA's most advanced reasoning model, the reliability of a managed API, and the legal certainty of European data sovereignty. Whether you are building autonomous agents or complex RAG pipelines, Lyceum provides the secure foundation your enterprise requires. By combining NVIDIA's highly optimized model architecture with Lyceum's purpose-built European cloud, AI teams can finally achieve frontier-level performance without compromising on data privacy or infrastructure costs.

Frequently Asked Questions

What is the context window for Nemotron-Ultra-253B?

Nemotron-Ultra-253B supports a context window of 128,000 tokens (128K). This large capacity makes it highly effective for Retrieval-Augmented Generation (RAG) tasks, allowing you to input extensive documents or codebases for the model to analyze and reason over.

How much does it cost to run Nemotron-Ultra-253B on Lyceum?

On Lyceum Technology's Serverless Inference API, Nemotron-Ultra-253B costs $0.60 per million input tokens and $1.80 per million output tokens. There are no base fees or minimum commitments; you pay strictly for the tokens you consume.

How do I enable reasoning mode for this model?

Unlike other models that use API parameters for reasoning, Nemotron-Ultra-253B uses system prompt toggles. You can enable chain-of-thought generation by including "detailed thinking on" in your system prompt, or disable it for faster responses using "detailed thinking off".

Is Nemotron-Ultra-253B GDPR compliant on Lyceum?

Yes. When you call Nemotron-Ultra-253B through Lyceum Technology, your requests are processed in our eu-north1 region. All data remains within European borders on our owned infrastructure, ensuring full compliance with GDPR and data residency requirements.

How does Nemotron-Ultra-253B compare to DeepSeek R1?

Nemotron-Ultra-253B is a dense 253B model, whereas DeepSeek R1 is a 671B MoE model. Despite being much smaller, Nemotron outperforms R1 on GPQA Diamond and LiveCodeBench, while fitting on a single 8xH100 node, making it far easier to deploy in enterprise environments.

What license does Nemotron-Ultra-253B use?

The model is released under a combination of the NVIDIA Open Model License and the Llama 3.1 Community License, as it is a derivative of Meta's Llama 3.1 405B. It is approved for commercial use, provided you adhere to the terms of both licenses.

Related Resources

/magazine/glm-5-2; /magazine/llama-3-3-70b; /magazine/gpt-oss-120b