Nemotron-3-Nano-30B: specs, benchmarks, and how to run it on Lyceum
A highly efficient 30B hybrid MoE model optimized for agentic reasoning and high-throughput inference.
Caspar Lehmkühler
June 22, 2026 · Head of Product at Lyceum Technology
Nemotron-3-Nano-30B-A3B is a 30-billion parameter large language model developed by NVIDIA. Designed specifically for agentic workflows, it utilizes a hybrid Mamba-Transformer Mixture-of-Experts (MoE) architecture that activates only ~3 billion parameters per token. This allows it to achieve exceptional throughput and low latency while maintaining a massive 256,000-token context window. Lyceum Technology serves Nemotron-3-Nano-30B through our OpenAI-compatible Serverless Inference API, allowing developers to integrate it instantly. Because Lyceum operates entirely on EU-sovereign infrastructure, your inference workloads remain fully GDPR-compliant, with data processed securely in our eu-north1 region.
Get started: call Nemotron-3-Nano-30B on Lyceum
Deploying NVIDIA's Nemotron-3-Nano-30B on Lyceum requires zero new frameworks. Because our Serverless Inference API is fully OpenAI-compatible, you can route your existing application traffic to this highly efficient MoE model by updating your base URL and API key. This eliminates the friction of learning proprietary SDKs and allows your engineering team to focus on building features rather than managing infrastructure.
from openai import OpenAI
client = OpenAI(
base_url="https://api.lyceum.technology/api/v2/external/serverless",
api_key="<your lyceum api key>",
)
response = client.chat.completions.create(
model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B",
messages=[{"role": "user", "content": "Hello!"}],
max_tokens=256,
)
print(response.choices[0].message.content)
Pricing and region for Nemotron-3-Nano-30B
Lyceum Technology serves this model in the Fast tier, optimized for cost-efficient, high-throughput workloads. The pricing is $0.06 per million input tokens and $0.24 per million output tokens. All inference for this endpoint runs on our EU-sovereign infrastructure in the eu-north1 region, ensuring your data never leaves Europe.
This drop-in compatibility means your engineering team can evaluate Nemotron-3-Nano-30B against your current production models in minutes. There are no minimum commitments, no subscription tiers, and no base fees - you pay strictly per token processed. For teams migrating from hyperscaler environments, this provides immediate cost visibility and eliminates the need to provision dedicated GPU instances for bursty agentic workloads.
What Nemotron-3-Nano-30B is good at
Agentic reasoning and tool use
Nemotron-3-Nano-30B was engineered specifically for multi-step agentic workflows. NVIDIA trained the model to generate an internal reasoning trace before outputting a final response. This "thinking" capability allows the model to map out logic puzzles, execute complex routing decisions, and handle multi-constraint instructions with high reliability. It excels at structuring messy text into clean JSON and utilizing external tools, making it highly effective as a sub-agent in larger AI systems.
High-throughput efficiency
The architecture of Nemotron-3-Nano-30B is a hybrid Mamba-Transformer Mixture-of-Experts (MoE). While the model contains 30 billion parameters in total, it only activates approximately 3 billion parameters (hence the "A3B" designation) per token during inference. By combining 23 Mamba-2 layers with sparse MoE layers and grouped-query attention, the model achieves exceptional throughput and low latency. This sparse activation makes it highly cost-effective for high-volume tasks like log analysis or continuous data processing.
Long-context processing
Traditional transformer models suffer from severe memory degradation as context length increases. By utilizing Mamba-2 state-space layers, Nemotron-3-Nano-30B efficiently handles a massive 256,000-token context window without the typical memory blowup. This allows developers to feed entire codebases, extensive documentation, or long conversation histories into the prompt. The hybrid architecture ensures that the model maintains precision reasoning over these long horizons, which is critical for retrieval-augmented generation (RAG) pipelines and persistent-memory agents.
Benchmarks and how it compares
Nemotron-3-Nano-30B benchmark results
NVIDIA evaluated Nemotron-3-Nano-30B against leading open-weight models in its size class, demonstrating significant advantages in mathematical reasoning and coding tasks. The hybrid MoE architecture proves highly competitive against dense models.
| Metric | Nemotron-3-Nano-30B-A3B | Qwen3-30B-A3B | GPT-OSS-20B |
|---|---|---|---|
| AIME 2025 (Math, no tools) | 89.1% | 85.0% | 91.7% |
| LiveCodeBench v6 | 68.3% | 66.0% | 61.0% |
| Arena-Hard-v2 | 67.7% | 57.8% | 48.5% |
| HumanEval (0-shot) | 78.0% | - | - |
Source: NVIDIA Technical Report and external performance data.
In the mathematical domain, Nemotron-3-Nano-30B achieves 89.1% on AIME 2025 without tool assistance, outperforming Qwen3-30B-A3B. When Python tools are enabled, its score jumps to 99.2%, highlighting its strength in agentic workflows. On software engineering tasks like LiveCodeBench v6, it records 68.3%, establishing a clear lead over sibling catalogue models. The Arena-Hard-v2 score of 67.7% further validates its reliability in complex, multi-step instructions, making it a robust choice for developers building autonomous agents.
Using it in production
Production configuration for Nemotron-3-Nano-30B
When deploying Nemotron-3-Nano-30B in production, understanding its reasoning budget is critical. Because the model generates a reasoning trace before its final answer, output token counts will be higher than those of standard non-reasoning models. You can control this by adjusting the reasoning budget parameters in your API request, allowing you to cap the "thinking" tokens to keep inference costs predictable.
Lyceum Technology categorizes this model in our Fast tier, which is optimized for cost-efficiency and high throughput. Hosted in our eu-north1 region, it leverages our open-stack infrastructure - including vLLM and TensorRT-LLM - to maximize the hardware-aware efficiency of the model's NVFP4 quantization.
To calculate production costs, consider a log analysis agent processing server errors. If the agent receives a 15,000-token input prompt (context) and generates a 500-token reasoning trace followed by a 200-token JSON output (700 output tokens total), the cost math is straightforward. At $0.06 per million input tokens, the prompt costs $0.0009. At $0.24 per million output tokens, the generation costs $0.000168. The total cost for this complex, reasoning-heavy task is approximately $0.001 per request. This aggressive pricing allows engineering teams to scale multi-agent systems without the prohibitive costs associated with hyperscaler APIs.
Running Nemotron-3-Nano-30B on EU-sovereign infrastructure
Why run Nemotron-3-Nano-30B on Lyceum
For European AI teams, data sovereignty is a hard requirement, not an optional feature. Running Nemotron-3-Nano-30B on Lyceum Technology guarantees that your inference workloads remain entirely within the EU. Our eu-north1 data centers provide strict GDPR compliance, ensuring that sensitive data - whether medical records, financial logs, or proprietary code - never crosses into US jurisdictions. Unlike API providers that rent capacity from hyperscalers, Lyceum operates its own GPU infrastructure, giving us a structural cost advantage that we pass directly to you.
Migrating to Lyceum requires minimal engineering effort. Because our Serverless Inference API is fully OpenAI-compatible, you can switch providers by updating a single base URL. There is no need to rewrite your application logic or learn proprietary SDKs. You benefit from open-stack transparency, avoiding the vendor lock-in associated with black-box proprietary engines.
Furthermore, our pay-per-token billing model eliminates the financial risk of idle compute. You scale to zero when traffic drops and pay only for the exact tokens processed. For teams transitioning off expiring hyperscaler credits, Lyceum offers a sustainable, high-performance path forward. Whether you are deploying serverless GPU inference for the first time or scaling a massive multi-agent architecture, Lyceum provides the secure, sovereign foundation your engineering team requires.