Nemotron-3-Ultra-550b: specs, benchmarks, and how to run it on Lyceum
NVIDIA's 550B hybrid Mamba-Transformer MoE built for long-running agentic workflows.
Maximilian Niroomand
June 23, 2026 · CTO & Co-Founder at Lyceum Technology
Nemotron-3-Ultra-550b is NVIDIA's flagship open-weight model, featuring 550 billion total parameters and 55 billion active parameters. Built on a hybrid Mamba-Transformer Mixture-of-Experts architecture, it is specifically optimized for long-running autonomous agents, complex reasoning, and deep research workflows. Lyceum Technology serves Nemotron-3-Ultra-550b via our OpenAI-compatible Serverless Inference API. You can deploy this frontier model on our infrastructure with per-second, pay-per-token billing and no idle or base fees. Lyceum provides the high-performance backbone for agentic loops and massive 1M-token contexts.
Get started: call Nemotron-3-Ultra-550b on Lyceum
To begin building with Nemotron-3-Ultra-550b, you can use the standard OpenAI Python SDK. Because Lyceum Technology provides an OpenAI-compatible API, switching your existing applications requires zero code changes beyond updating the base URL and your API key. The model string for this endpoint is nvidia/Nemotron-3-Ultra-550b-a55b.
from openai import OpenAI
client = OpenAI(
base_url="https://api.lyceum.technology/api/v2/external/serverless",
api_key="<your lyceum api key>",
)
response = client.chat.completions.create(
model="nvidia/Nemotron-3-Ultra-550b-a55b",
messages=[{"role": "user", "content": "Hello!"}],
max_tokens=256,
)
print(response.choices[0].message.content)Pricing and region for Nemotron-3-Ultra-550b
This model is available on our Standard tier, which is designed for high-capability, demanding agentic workloads. The serverless endpoint for this specific model is hosted in the us-central1 region. Pricing is strictly pay-per-token, with no minimum commitments or base fees. The input cost is $1.00 per million tokens, and the output cost is $3.00 per million tokens.
For teams transitioning off hyperscaler credits, this pay-as-you-go structure ensures you only pay for the exact compute you consume. If you need predictable throughput at scale, you can also burst this model onto dedicated GPU virtual machines billed per second. By utilizing our serverless GPU inference platform, your engineering team can avoid the operational overhead of managing complex Kubernetes clusters or dealing with out-of-memory errors on massive 550B parameter models.
What Nemotron-3-Ultra-550b is good at
Built for long-running autonomous agents
Nemotron-3-Ultra-550b is engineered specifically for agentic workflows that require sustained reasoning over extended periods. The model features a massive 1,000,000-token context window, allowing it to preserve long agent states, system logs, and execution plans across sustained sessions. According to NVIDIA, the model achieves 95% accuracy on the RULER benchmark at the full 1M context length.
Hybrid Mamba-Transformer architecture
The model utilizes a Latent Mixture-of-Experts (MoE) architecture that interleaves Mamba-2 state space layers with traditional Transformer attention layers. The Mamba layers provide linear scaling and sequence efficiency for massive contexts, while the Transformer layers deliver the precision reasoning required for complex logic. This hybrid approach allows the model to activate only 55 billion parameters during inference while leveraging the knowledge capacity of its 550 billion total parameters.
High-throughput speculative decoding
Nemotron-3-Ultra-550b includes Multi-Token Prediction (MTP) layers that enable native speculative decoding. This architectural choice significantly accelerates generation speeds. NVIDIA reports that this model achieves up to 5.9x higher inference throughput compared to dense models of similar capability, making it highly efficient for production deployments where time-to-first-token and overall generation speed are critical.
Granular reasoning control
For complex tasks, the model supports inference-time reasoning budget control. Developers can utilize parameters like enable_thinking and reasoning_budget to dictate how much compute the model allocates to generating internal reasoning traces before it outputs a final answer.
Benchmarks and how it compares
Nemotron-3-Ultra-550b benchmark results
NVIDIA has published extensive benchmark data demonstrating how Nemotron-3-Ultra-550b performs against other frontier-class open models. The model is particularly strong in agent productivity and instruction following, though it faces stiff competition in raw coding tasks.
| Benchmark | Nemotron 3 Ultra (550B) | Kimi K2.6 (1T) | GLM 5.1 (744B) |
|---|---|---|---|
| PinchBench (Agent Productivity) | 91% | 91% | 84% |
| IFBench (Instruction Following) | 82% | - | 77% |
| Terminal-Bench 2.0 (Coding) | 54% | 67% | 64% |
| SWE-Bench Verified | 71.9% | - | - |
Source: NVIDIA Technical Blog.
When compared to its sibling model, Nemotron-3-Super-120B, the Ultra variant offers significantly higher capacity for complex, multi-step reasoning. While the 120B Super model activates only 12B parameters and is optimized for maximum compute efficiency, the 550B Ultra model activates 55B parameters, providing the deep knowledge retrieval and logical rigor required for enterprise-grade research and autonomous agent orchestration. For tasks requiring the absolute highest accuracy and longest context retention, Nemotron-3-Ultra-550b is the superior choice within the NVIDIA catalogue.
Using it in production
Production configuration for Nemotron-3-Ultra-550b
Deploying Nemotron-3-Ultra-550b in production requires understanding its tier, region, and pricing structure. On Lyceum Technology, this model is categorized under our Standard tier, which is reserved for high-capability models handling demanding agentic workloads.
The serverless endpoint for nvidia/Nemotron-3-Ultra-550b-a55b is currently hosted in the us-central1 region. When configuring your API requests, you can take full advantage of the model's 1,000,000-token context window. This massive context allows you to pass entire codebases, extensive documentation, or long conversation histories in a single prompt.
Pricing is calculated strictly per token. The input rate is $1.00 per million tokens, and the output rate is $3.00 per million tokens. To understand the unit economics, consider a deep research task where you submit 100,000 tokens of source material and the model generates a 2,000-token analysis, including its internal reasoning trace. The input cost would be $0.10, and the output cost would be $0.006, resulting in a total API call cost of $0.106.
Because the model supports reasoning traces, you should configure your max_tokens parameter generously to ensure the model has enough output space to complete its thought process. We recommend enabling streaming in your API calls so your application can process the reasoning trace in real time, reducing perceived latency for end users while the model computes its final answer.
Why run Nemotron-3-Ultra-550b on Lyceum
Why run Nemotron-3-Ultra-550b on Lyceum
Choosing the right infrastructure provider is critical when deploying frontier models like Nemotron-3-Ultra-550b. Lyceum Technology pairs high performance with a transparent, developer-first platform.
The serverless endpoint for this model is served from Lyceum's us-central1 region, so it runs in the US. Integration is straightforward: because our Serverless Inference API is fully OpenAI-compatible, your engineering team can switch to Lyceum by updating a single base URL, with no SDK rewrites or proprietary client libraries.
Pricing is strictly pay-per-token with no idle time and no base fees, so you only pay for the compute you actually consume. Our platform is built on open-stack transparency: we run optimized open inference engines like vLLM and NVIDIA Dynamo rather than black-box proprietary stacks, which guarantees customer portability by design. Billing is unified across serverless and dedicated GPU usage, and there are zero egress fees.
When you need predictable throughput at scale, you can burst from the pay-per-token API onto dedicated GPU virtual machines billed per second, for example a dedicated 8x H200 cluster, without changing your application code. You also gain access to 40-80% cheaper compute compared to traditional hyperscalers. Learn more about how this works in our guide to serverless GPU inference.