What is the context window for Nemotron-3-Ultra-550b?

Nemotron-3-Ultra-550b features a massive 1,000,000-token context window. This allows the model to process extensive codebases, long document repositories, and sustained agentic conversation histories in a single prompt while maintaining high recall accuracy.

How much does it cost to run Nemotron-3-Ultra-550b on Lyceum?

On the Lyceum Technology Serverless Inference API, Nemotron-3-Ultra-550b costs $1.00 per million input tokens and $3.00 per million output tokens. Billing is strictly pay-per-token with no base fees or minimum commitments.

Where is Nemotron-3-Ultra-550b hosted?

The serverless API endpoint for this model is served from Lyceum's us-central1 region, so it runs in the US, not the EU. For higher, predictable throughput you can burst the same model onto dedicated GPU virtual machines billed per second.

How do I call Nemotron-3-Ultra-550b using the API?

You can call the model using the standard OpenAI SDK by changing the base URL to https://developer.nvidia.com/blog/inside-nvidia-nemotron-3-techniques-tools-and-data-that-make-it-efficient-and-accurate/ and using the model string nvidia/Nemotron-3-Ultra-550b-a55b. It acts as a drop-in replacement.

How does Nemotron-3-Ultra-550b compare to Kimi K2.6?

Nemotron-3-Ultra-550b ties Kimi K2.6 in agent productivity benchmarks (both scoring 91% on PinchBench) despite being half the size (550B vs 1T parameters). However, it trails slightly in raw coding tasks, scoring 54% on Terminal-Bench 2.0 compared to Kimi's 67%.

What license does Nemotron-3-Ultra-550b use?

The model is released under an open-weight license. This open-weight license provides developers with the weights, training datasets, and development recipes needed to fine-tune and deploy the model on their own infrastructure.

Nemotron-3-Ultra API: pricing, benchmarks & specs

Nemotron-3-Ultra-550b is NVIDIA's flagship open-weight model, featuring 550 billion total parameters and 55 billion active parameters. Built on a hybrid Mamba-Transformer Mixture-of-Experts architecture, it is specifically optimized for long-running autonomous agents, complex reasoning, and deep research workflows. Lyceum Technology serves Nemotron-3-Ultra-550b via our OpenAI-compatible Serverless Inference API. You can deploy this frontier model on our infrastructure with per-second, pay-per-token billing and no idle or base fees. Lyceum provides the high-performance backbone for agentic loops and massive 1M-token contexts.

Get started: call Nemotron-3-Ultra-550b on Lyceum

To begin building with Nemotron-3-Ultra-550b, you can use the standard OpenAI Python SDK. Because Lyceum Technology provides an OpenAI-compatible API, switching your existing applications requires zero code changes beyond updating the base URL and your API key. The model string for this endpoint is nvidia/Nemotron-3-Ultra-550b-a55b.

from openai import OpenAI

client = OpenAI(
 base_url="https://api.lyceum.technology/api/v2/external/serverless",
 api_key="<your lyceum api key>",
)
response = client.chat.completions.create(
 model="nvidia/Nemotron-3-Ultra-550b-a55b",
 messages=[{"role": "user", "content": "Hello!"}],
 max_tokens=256,
)
print(response.choices[0].message.content)

Pricing and region for Nemotron-3-Ultra-550b

This model is available on our Standard tier, which is designed for high-capability, demanding agentic workloads. The serverless endpoint for this specific model is hosted in the us-central1 region. Pricing is strictly pay-per-token, with no minimum commitments or base fees. The input cost is $1.00 per million tokens, and the output cost is $3.00 per million tokens.

For teams transitioning off hyperscaler credits, this pay-as-you-go structure ensures you only pay for the exact compute you consume. If you need predictable throughput at scale, you can also burst this model onto dedicated GPU virtual machines billed per second. By utilizing our serverless GPU inference platform, your engineering team can avoid the operational overhead of managing complex Kubernetes clusters or dealing with out-of-memory errors on massive 550B parameter models.

What Nemotron-3-Ultra-550b is good at

Built for long-running autonomous agents

Nemotron-3-Ultra-550b is engineered specifically for agentic workflows that require sustained reasoning over extended periods. The model features a massive 1,000,000-token context window, allowing it to preserve long agent states, system logs, and execution plans across sustained sessions. According to NVIDIA, the model achieves 95% accuracy on the RULER benchmark at the full 1M context length.

Hybrid Mamba-Transformer architecture

The model utilizes a Latent Mixture-of-Experts (MoE) architecture that interleaves Mamba-2 state space layers with traditional Transformer attention layers. The Mamba layers provide linear scaling and sequence efficiency for massive contexts, while the Transformer layers deliver the precision reasoning required for complex logic. This hybrid approach allows the model to activate only 55 billion parameters during inference while leveraging the knowledge capacity of its 550 billion total parameters.

High-throughput speculative decoding

Nemotron-3-Ultra-550b includes Multi-Token Prediction (MTP) layers that enable native speculative decoding. This architectural choice significantly accelerates generation speeds. NVIDIA reports that this model achieves up to 5.9x higher inference throughput compared to dense models of similar capability, making it highly efficient for production deployments where time-to-first-token and overall generation speed are critical.

Granular reasoning control

For complex tasks, the model supports inference-time reasoning budget control. Developers can utilize parameters like enable_thinking and reasoning_budget to dictate how much compute the model allocates to generating internal reasoning traces before it outputs a final answer.

Benchmarks and how it compares

Nemotron-3-Ultra-550b benchmark results

NVIDIA has published extensive benchmark data demonstrating how Nemotron-3-Ultra-550b performs against other frontier-class open models. The model is particularly strong in agent productivity and instruction following, though it faces stiff competition in raw coding tasks.

Benchmark	Nemotron 3 Ultra (550B)	Kimi K2.6 (1T)	GLM 5.1 (744B)
PinchBench (Agent Productivity)	91%	91%	84%
IFBench (Instruction Following)	82%	-	77%
Terminal-Bench 2.0 (Coding)	54%	67%	64%
SWE-Bench Verified	71.9%	-	-

Source: NVIDIA Technical Blog.

When compared to its sibling model, Nemotron-3-Super-120B, the Ultra variant offers significantly higher capacity for complex, multi-step reasoning. While the 120B Super model activates only 12B parameters and is optimized for maximum compute efficiency, the 550B Ultra model activates 55B parameters, providing the deep knowledge retrieval and logical rigor required for enterprise-grade research and autonomous agent orchestration. For tasks requiring the absolute highest accuracy and longest context retention, Nemotron-3-Ultra-550b is the superior choice within the NVIDIA catalogue.

Using it in production

Production configuration for Nemotron-3-Ultra-550b

Deploying Nemotron-3-Ultra-550b in production requires understanding its tier, region, and pricing structure. On Lyceum Technology, this model is categorized under our Standard tier, which is reserved for high-capability models handling demanding agentic workloads.

The serverless endpoint for nvidia/Nemotron-3-Ultra-550b-a55b is currently hosted in the us-central1 region. When configuring your API requests, you can take full advantage of the model's 1,000,000-token context window. This massive context allows you to pass entire codebases, extensive documentation, or long conversation histories in a single prompt.

Pricing is calculated strictly per token. The input rate is $1.00 per million tokens, and the output rate is $3.00 per million tokens. To understand the unit economics, consider a deep research task where you submit 100,000 tokens of source material and the model generates a 2,000-token analysis, including its internal reasoning trace. The input cost would be $0.10, and the output cost would be $0.006, resulting in a total API call cost of $0.106.

Because the model supports reasoning traces, you should configure your max_tokens parameter generously to ensure the model has enough output space to complete its thought process. We recommend enabling streaming in your API calls so your application can process the reasoning trace in real time, reducing perceived latency for end users while the model computes its final answer.

Why run Nemotron-3-Ultra-550b on Lyceum

Choosing the right infrastructure provider is critical when deploying frontier models like Nemotron-3-Ultra-550b. Lyceum Technology pairs high performance with a transparent, developer-first platform.

The serverless endpoint for this model is served from Lyceum's us-central1 region, so it runs in the US. Integration is straightforward: because our Serverless Inference API is fully OpenAI-compatible, your engineering team can switch to Lyceum by updating a single base URL, with no SDK rewrites or proprietary client libraries.

Pricing is strictly pay-per-token with no idle time and no base fees, so you only pay for the compute you actually consume. Our platform is built on open-stack transparency: we run optimized open inference engines like vLLM and NVIDIA Dynamo rather than black-box proprietary stacks, which guarantees customer portability by design. Billing is unified across serverless and dedicated GPU usage, and there are zero egress fees.

When you need predictable throughput at scale, you can burst from the pay-per-token API onto dedicated GPU virtual machines billed per second, for example a dedicated 8x H200 cluster, without changing your application code. You also gain access to 40-80% cheaper compute compared to traditional hyperscalers. Learn more about how this works in our guide to serverless GPU inference.

Nemotron-3-Ultra-550b: specs, benchmarks, and how to run it on Lyceum

Get started: call Nemotron-3-Ultra-550b on Lyceum

Pricing and region for Nemotron-3-Ultra-550b

What Nemotron-3-Ultra-550b is good at

Built for long-running autonomous agents

Hybrid Mamba-Transformer architecture

High-throughput speculative decoding

Granular reasoning control

Benchmarks and how it compares

Nemotron-3-Ultra-550b benchmark results

Using it in production

Production configuration for Nemotron-3-Ultra-550b

Why run Nemotron-3-Ultra-550b on Lyceum

Why run Nemotron-3-Ultra-550b on Lyceum

Frequently Asked Questions

What is the context window for Nemotron-3-Ultra-550b?

How much does it cost to run Nemotron-3-Ultra-550b on Lyceum?

Where is Nemotron-3-Ultra-550b hosted?

How do I call Nemotron-3-Ultra-550b using the API?

How does Nemotron-3-Ultra-550b compare to Kimi K2.6?

What license does Nemotron-3-Ultra-550b use?

Further Reading

Related Resources

Related Articles

Qwen3.5-397B-A17B: specs, benchmarks, and how to run it on Lyceum

Qwen3-32B: specs, benchmarks, and how to run it on Lyceum

Qwen3-30B-A3B: specs, benchmarks, and how to run it on Lyceum

Inference

Training