Serverless Inference Model Library Text LLMs 8 min read read

Nemotron-3-Super-120b-a12b: specs, benchmarks, and how to run it on Lyceum

NVIDIA's 120B LatentMoE model with 12B active parameters for agentic workflows.

Magnus Grünewald

Magnus Grünewald

June 23, 2026 · CEO at Lyceum Technology

Nemotron-3-Super-120b-a12b is a highly efficient large language model developed by NVIDIA. Using a Latent Mixture-of-Experts (MoE) architecture, it houses 120 billion total parameters but activates only 12 billion per forward pass, drastically reducing inference costs while maintaining high reasoning capabilities. Designed for agentic workflows, complex coding, and long-context retrieval up to 1 million tokens, it is a powerful choice for production AI systems. Lyceum serves Nemotron-3-Super-120b-a12b via our OpenAI-compatible Serverless Inference API, allowing developers to integrate it instantly. This model runs in the US, served from Lyceum's us-central1 region.

Call Nemotron-3-Super-120b-a12b on Lyceum

To call Nemotron-3-Super-120b-a12b on Lyceum, you only need to update your base URL and API key. Because Lyceum provides an OpenAI-compatible Serverless Inference API, you can use the standard OpenAI Python SDK without rewriting your application logic. This makes it trivial to swap out existing models for NVIDIA's highly efficient LatentMoE architecture. Whether you are building complex multi-agent systems or deploying an autonomous coding assistant, integrating this model takes only seconds.

from openai import OpenAI

client = OpenAI(
 base_url="https://api.lyceum.technology/api/v2/external/serverless",
 api_key="<your lyceum api key>",
)
response = client.chat.completions.create(
 model="nvidia/nemotron-3-super-120b-a12b",
 messages=[{"role": "user", "content": "Hello!"}],
 max_tokens=256,
)
print(response.choices[0].message.content)

Pricing and region for Nemotron-3-Super-120b-a12b

When you deploy this model through Lyceum's Serverless Inference API, you are billed strictly on a pay-per-token basis. The pricing for Nemotron-3-Super-120b-a12b is set at $0.30 per million input tokens and $0.90 per million output tokens. This model is categorized under our Standard tier, which is reserved for high-capability models that handle complex reasoning and agentic tasks. Currently, this specific endpoint is hosted in the us-central1 region. By leveraging Lyceum's infrastructure, you avoid the overhead of provisioning the massive GPU clusters typically required to run a 120B parameter model, paying only for the exact compute you consume during inference. This pay-per-token approach ensures that your infrastructure costs scale linearly with your actual usage, eliminating the financial drain of idle compute instances.

Nemotron-3-Super-120b-a12b use cases

LatentMoE efficiency and architecture

Nemotron-3-Super-120b-a12b is built on a cutting-edge Latent Mixture-of-Experts (MoE) architecture that combines Mamba-2, MoE, and Attention mechanisms. While the model contains 120 billion total parameters, it activates only 12 billion parameters per forward pass. This sparse activation allows it to deliver the reasoning capabilities of a massive 120B model while operating with the inference speed and cost profile of a much smaller model. The LatentMoE design routes latent representations rather than raw tokens, allowing the experts to specialize in deep semantic meaning.

Agentic workflows and tool calling

NVIDIA specifically optimized this model for collaborative AI agents and high-volume autonomous workloads, such as IT ticket automation and complex software engineering tasks. It features a configurable reasoning mode that generates an internal thought trace before concluding with a final response. This makes it exceptionally strong at multi-step planning, tool use, and executing agentic loops without losing focus.

Massive 1M token context window

One of the standout features of Nemotron-3-Super-120b-a12b is its massive context window, supporting up to 1 million tokens. This extended memory capacity is critical for Retrieval-Augmented Generation (RAG) pipelines, analyzing entire codebases, or processing extensive document repositories. The hybrid Mamba-Transformer architecture ensures that the model maintains high retrieval accuracy even when the context window is fully saturated, preventing the degradation often seen in standard transformer models.

Nemotron-3-Super-120b-a12b benchmarks

Nemotron-3-Super-120b-a12b benchmark results

NVIDIA designed Nemotron-3-Super-120b-a12b to compete directly with other models in the 120B parameter class, focusing heavily on reasoning, coding, and long-context retrieval. In published evaluations, it demonstrates strong performance across industry-standard benchmarks, particularly excelling in autonomous engineering tasks and mathematical reasoning.

Below is a comparison of Nemotron-3-Super-120b-a12b against two sibling models in its weight class: Qwen2.5-122B-A10B and GPT-OSS-120B.

BenchmarkNemotron-3-Super-120b-a12bQwen3.5-122B-A10BGPT-OSS-120B
MMLU-Pro (Knowledge)83.7386.7081.00
SWE-Bench Verified (Coding)60.4766.4041.90
AIME (Math)90.2190.3692.50
RULER @ 1M Context91.7591.3322.30
GPQA (Science)79.2386.6080.10

Source: DeepInfra and GitHub evaluation logs.

While Qwen3.5-122B-A10B holds a slight edge in general knowledge (MMLU-Pro) and specific coding benchmarks (SWE-Bench), Nemotron-3-Super-120b-a12b maintains a highly competitive profile, particularly in long-context retrieval. Its score of 91.75 on the RULER benchmark at a full 1 million tokens demonstrates its superiority in handling massive document payloads without losing fidelity. For teams building agentic workflows that require extensive context, Nemotron offers a highly efficient alternative to dense models. The LatentMoE architecture ensures that even when processing these massive context windows, the inference speed remains exceptionally high compared to traditional dense architectures.

Production deployment for Nemotron-3-Super-120b-a12b

Production configuration for Nemotron-3-Super-120b-a12b

When deploying Nemotron-3-Super-120b-a12b in production, understanding its pricing and configuration parameters is essential for optimizing both performance and cost. As a Hybrid MoE model built for multi-agent AI and reasoning, it operates on Lyceum's Standard tier, which is designated for high-capability workloads. The model is currently hosted in the us-central1 region.

Because the model supports a massive 1 million token context window, you can safely pass entire codebases or extensive RAG context without truncation. However, you must account for the per-token pricing: $0.30 per million input tokens and $0.90 per million output tokens.

For a realistic production workload, such as an autonomous coding agent analyzing a repository - a single request might consume 50,000 input tokens and generate 2,000 output tokens.

  • Input cost: 50,000 tokens * ($0.30 / 1,000,000) = $0.015
  • Output cost: 2,000 tokens * ($0.90 / 1,000,000) = $0.0018
  • Total cost per request: $0.0168

To maximize the model's reasoning capabilities, you can enable its internal thought trace by passing specific configuration flags in your API request (often handled via extra_body parameters like {"chat_template_kwargs": {"enable_thinking": True}}). When streaming responses in production, ensure your application logic is prepared to handle these reasoning tokens, either by displaying them to the user as a "thinking" state or filtering them out before rendering the final answer.

Why run Nemotron-3-Super-120b-a12b on Lyceum

Why run Nemotron-3-Super-120b-a12b on Lyceum

Lyceum provides a robust, developer-friendly platform for scaling AI workloads without the burden of managing complex hardware. By offering an OpenAI-compatible drop-in replacement API, Lyceum allows engineering teams to switch to Nemotron-3-Super-120b-a12b by simply updating a base URL and an API key. This eliminates the need to rewrite application logic or learn new SDKs, accelerating your time to market. This model is served from Lyceum's us-central1 region in the US.

Lyceum operates on a strict pay-per-token model with per-second billing for dedicated-GPU bursts. There are no minimum commitments, no base fees, no idle costs, and zero egress fees, so your spend tracks your real usage instead of reserved capacity. Unified billing gives you a single, predictable view across every model you call, whether you are transitioning off expiring hyperscaler credits or consolidating away from unreliable small providers.

You also benefit from our open-stack transparency. Lyceum runs optimized open inference engines like vLLM and NVIDIA Dynamo rather than locking you into proprietary black-box systems, so you can reason about performance, throughput, and cost directly. For background on how this works, see our guide to serverless GPU inference. The combination of drop-in compatibility, transparent open-stack tooling, and pay-per-token economics is what makes Lyceum a strong home for production deployments of Nemotron-3-Super-120b-a12b.

Frequently Asked Questions

What is the context window for Nemotron-3-Super-120b-a12b?

Nemotron-3-Super-120b-a12b supports a massive context window of up to 1 million tokens. This makes it highly capable for Retrieval-Augmented Generation (RAG) pipelines, analyzing large codebases, and processing extensive document repositories without losing critical information.

How much does it cost to run Nemotron-3-Super-120b-a12b on Lyceum?

On Lyceum's Serverless Inference API, Nemotron-3-Super-120b-a12b costs $0.30 per million input tokens and $0.90 per million output tokens. You are billed strictly on a pay-per-token basis with no minimum commitments or idle compute charges.

Where is Nemotron-3-Super-120b-a12b hosted?

This model endpoint is hosted in Lyceum's us-central1 region, which means it runs in the US. It is served through the same OpenAI-compatible Serverless Inference API as every other Lyceum model, with pay-per-token billing and no idle compute charges.

How do I call Nemotron-3-Super-120b-a12b using the OpenAI SDK?

You can call the model by setting your OpenAI client's base URL to https://build.nvidia.com/nvidia/nemotron-3-super-120b-a12b and using the model string nvidia/nemotron-3-super-120b-a12b. It acts as a drop-in replacement, requiring zero changes to your core application logic.

What makes the LatentMoE architecture efficient?

The Latent Mixture-of-Experts (MoE) architecture allows the model to house 120 billion total parameters but activate only 12 billion per forward pass. This sparse activation drastically reduces inference compute costs while maintaining the reasoning capabilities of a much larger model.

How does Nemotron-3-Super-120b-a12b compare to Qwen3.5-122B?

Both models are highly competitive in the 120B class. While Qwen2.5-122B-A10B scores slightly higher on general knowledge benchmarks like MMLU-Pro, Nemotron-3-Super-120b-a12b excels in long-context retrieval, scoring 91.75 on the RULER benchmark at a full 1 million tokens.

Related Resources

/magazine/glm-5-2; /magazine/llama-3-3-70b; /magazine/gpt-oss-120b