Nemotron-Ultra-253B: specs, benchmarks, and how to run it on Lyceum
NVIDIA's 253B reasoning model optimized for single-node deployment.
Caspar Lehmkühler
June 24, 2026 · Head of Product at Lyceum Technology
Nemotron-Ultra-253B is a 253-billion-parameter large language model developed by NVIDIA. Derived from Meta's Llama 3.1 405B, it uses advanced Neural Architecture Search (NAS) and vertical compression to drastically reduce memory footprint without sacrificing intelligence. Post-trained for advanced reasoning, human-interactive chat, and tool calling, it features a 128K context window and a unique dual-mode operation for chain-of-thought generation. Lyceum Technology serves Nemotron-Ultra-253B via our OpenAI-compatible Serverless Inference API. You can deploy this model on our EU-sovereign infrastructure, ensuring full GDPR compliance and data residency while paying only for the tokens you consume.
Get started: call Nemotron-Ultra-253B on Lyceum
You can access Nemotron-Ultra-253B through Lyceum Technology's Serverless Inference API. Because our API is fully OpenAI-compatible, you can switch to our EU-hosted infrastructure by updating your base URL and API key.
from openai import OpenAI
client = OpenAI(
base_url="https://api.lyceum.technology/api/v2/external/serverless",
api_key="<your lyceum api key>",
)
response = client.chat.completions.create(
model="nvidia/Llama-3_1-Nemotron-Ultra-253B-v1",
messages=[{"role": "user", "content": "Hello!"}],
max_tokens=256,
)
print(response.choices[0].message.content)
Pricing and region for Nemotron-Ultra-253B
Lyceum Technology offers this model on our Standard tier, which provides high-capability compute for complex reasoning tasks. The model is hosted in our eu-north1 region, ensuring your data remains within European borders.
- Input pricing: $0.60 per million tokens
- Output pricing: $1.80 per million tokens
- Tier: Standard
- Region: eu-north1
What Nemotron-Ultra-253B is good at
Efficient frontier-level reasoning
Nemotron-Ultra-253B was built to solve a specific infrastructure problem: running a frontier-class reasoning model without requiring a massive GPU cluster. By applying Neural Architecture Search (NAS) and vertical compression to the Llama 3.1 405B architecture, NVIDIA reduced the parameter count to 253B. This allows the model to fit entirely on a single 8xH100 node for inference, significantly lowering the hardware barrier while maintaining top-tier reasoning capabilities for enterprise deployments.
Dual-mode chain-of-thought
Unlike models that force reasoning on every prompt, Nemotron-Ultra-253B features a dual-mode operation controlled via the system prompt. By including "detailed thinking on" or "detailed thinking off" in the system message, developers can toggle the model's chain-of-thought generation. This flexibility means you can use the same model for complex, multi-step math problems and standard, low-latency chat interactions without wasting output tokens on unnecessary reasoning.
Tool calling and RAG
The model underwent extensive post-training using Group Relative Policy Optimization (GRPO) specifically targeted at tool calling and Retrieval-Augmented Generation (RAG). It reliably outputs structured JSON, follows complex multi-step instructions, and manages context effectively across its 128K token window, making it an excellent engine for agentic workflows that require both deep analysis and predictable formatting. For teams building autonomous agents, this combination of structured output reliability and deep reasoning makes it a highly capable core engine. The model can analyze a user request, determine which external tools to call, and synthesize the results into a coherent final answer.
Benchmarks and how it compares
Nemotron-Ultra-253B benchmark results
NVIDIA's dual-mode approach allows the model to scale its intelligence at inference time. Enabling reasoning mode yields massive performance jumps across complex benchmarks, proving the efficacy of the GRPO post-training phase.
| Benchmark | Standard Mode | Reasoning Mode |
|---|---|---|
| MATH-500 | 80.40% | 97.00% |
| AIME 2025 | 16.67% | 72.50% |
| LiveCodeBench | 29.03% | 66.31% |
| GPQA Diamond | 56.60% | 76.01% |
Source: NVIDIA and OpenRouter.
Comparison to sibling models
When compared to the original Meta Llama 3.1 405B, Nemotron-Ultra-253B retains the vast majority of its intelligence while requiring nearly half the VRAM. This makes it a far more practical choice for teams transitioning off hyperscaler credits who need to manage infrastructure costs.
Against DeepSeek R1, Nemotron-Ultra-253B actually wins on GPQA Diamond and LiveCodeBench. While DeepSeek R1 holds a slight edge in MATH-500 (97.3% vs 97.0%), Nemotron's dense architecture avoids the complex MoE routing overhead and fits on a single 8xH100 node. DeepSeek R1's 671B total parameter count typically requires a 16-GPU cluster for BF16 inference, making Nemotron-Ultra-253B a much more accessible option for self-hosting or dedicated cloud deployments. For teams prioritizing coding and scientific reasoning, Nemotron offers a superior balance of intelligence and hardware efficiency.
Using it in production
Production configuration for Nemotron-Ultra-253B
When deploying Nemotron-Ultra-253B via Lyceum Technology's Serverless Inference API, you are accessing our Standard tier in the eu-north1 region. This tier is designed for high-capability models where complex reasoning and accuracy are prioritized over raw speed.
To control the model's reasoning behavior, you must configure the system prompt. Injecting "detailed thinking on" instructs the model to generate a chain-of-thought before answering, which is ideal for coding and math. If you need lower latency for standard chat, use "detailed thinking off". Because the model supports a 128K context window, you can safely pass large documents for RAG workflows, but be mindful of the output token consumption when reasoning is enabled.
Calculating per-token pricing
Lyceum Technology charges $0.60 per million input tokens and $1.80 per million output tokens for this model. Consider a RAG application that processes a 10,000-token document and generates a 1,500-token reasoned response.
- Input cost: 10,000 tokens * ($0.60 / 1,000,000) = $0.006
- Output cost: 1,500 tokens * ($1.80 / 1,000,000) = $0.0027
- Total cost per request: $0.0087
This pay-per-token model allows you to scale from zero without committing to the massive upfront cost of an 8xH100 cluster. You only pay for the exact compute your application requires, making it highly cost-effective for bursty or unpredictable workloads.
Running Nemotron-Ultra-253B on EU-sovereign infrastructure
Why run Nemotron-Ultra-253B on Lyceum
For European enterprises, deploying a 253B-parameter model typically means relying on US-based hyperscalers or API providers, which introduces significant data privacy risks. Lyceum Technology solves this by offering Nemotron-Ultra-253B on our fully EU-sovereign infrastructure. Hosted in our eu-north1 region, your inference workloads are processed entirely within European borders, ensuring strict GDPR compliance and data residency.
Unlike competitors who rent their hardware from larger public clouds, Lyceum owns and operates its GPU infrastructure. This structural advantage allows us to offer highly competitive per-token pricing without the markup associated with middleman API providers. Furthermore, our open-stack transparency, powered by vLLM and NVIDIA Dynamo, ensures you are never locked into a proprietary black-box inference engine. Because Lyceum does not charge egress fees, you can move your data and model outputs without the hidden costs associated with traditional cloud providers. You retain full visibility into how your workloads are executed.
Because our Serverless Inference API is a drop-in replacement for the OpenAI SDK, your engineering team can migrate to Lyceum in minutes. You get the intelligence of NVIDIA's most advanced reasoning model, the reliability of a managed API, and the legal certainty of European data sovereignty. Whether you are building autonomous agents or complex RAG pipelines, Lyceum provides the secure foundation your enterprise requires. By combining NVIDIA's highly optimized model architecture with Lyceum's purpose-built European cloud, AI teams can finally achieve frontier-level performance without compromising on data privacy or infrastructure costs.