Serverless Inference Model Library Text LLMs 7 min read read

Nemotron-3-Nano-30B: specs, benchmarks, and how to run it on Lyceum

A highly efficient 30B hybrid MoE model optimized for agentic reasoning and high-throughput inference.

Caspar Lehmkühler

June 22, 2026 · Head of Product at Lyceum Technology

Nemotron-3-Nano-30B-A3B is a 30-billion parameter large language model developed by NVIDIA. Designed specifically for agentic workflows, it utilizes a hybrid Mamba-Transformer Mixture-of-Experts (MoE) architecture that activates only ~3 billion parameters per token. This allows it to achieve exceptional throughput and low latency while maintaining a massive 256,000-token context window. Lyceum Technology serves Nemotron-3-Nano-30B through our OpenAI-compatible Serverless Inference API, allowing developers to integrate it instantly. Because Lyceum operates entirely on EU-sovereign infrastructure, your inference workloads remain fully GDPR-compliant, with data processed securely in our eu-north1 region.

Get started: call Nemotron-3-Nano-30B on Lyceum

Deploying NVIDIA's Nemotron-3-Nano-30B on Lyceum requires zero new frameworks. Because our Serverless Inference API is fully OpenAI-compatible, you can route your existing application traffic to this highly efficient MoE model by updating your base URL and API key. This eliminates the friction of learning proprietary SDKs and allows your engineering team to focus on building features rather than managing infrastructure.

from openai import OpenAI

client = OpenAI(
 base_url="https://api.lyceum.technology/api/v2/external/serverless",
 api_key="<your lyceum api key>",
)
response = client.chat.completions.create(
 model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B",
 messages=[{"role": "user", "content": "Hello!"}],
 max_tokens=256,
)
print(response.choices[0].message.content)

Pricing and region for Nemotron-3-Nano-30B

Lyceum Technology serves this model in the Fast tier, optimized for cost-efficient, high-throughput workloads. The pricing is $0.06 per million input tokens and $0.24 per million output tokens. All inference for this endpoint runs on our EU-sovereign infrastructure in the eu-north1 region, ensuring your data never leaves Europe.

This drop-in compatibility means your engineering team can evaluate Nemotron-3-Nano-30B against your current production models in minutes. There are no minimum commitments, no subscription tiers, and no base fees - you pay strictly per token processed. For teams migrating from hyperscaler environments, this provides immediate cost visibility and eliminates the need to provision dedicated GPU instances for bursty agentic workloads.

What Nemotron-3-Nano-30B is good at

Agentic reasoning and tool use

Nemotron-3-Nano-30B was engineered specifically for multi-step agentic workflows. NVIDIA trained the model to generate an internal reasoning trace before outputting a final response. This "thinking" capability allows the model to map out logic puzzles, execute complex routing decisions, and handle multi-constraint instructions with high reliability. It excels at structuring messy text into clean JSON and utilizing external tools, making it highly effective as a sub-agent in larger AI systems.

High-throughput efficiency

The architecture of Nemotron-3-Nano-30B is a hybrid Mamba-Transformer Mixture-of-Experts (MoE). While the model contains 30 billion parameters in total, it only activates approximately 3 billion parameters (hence the "A3B" designation) per token during inference. By combining 23 Mamba-2 layers with sparse MoE layers and grouped-query attention, the model achieves exceptional throughput and low latency. This sparse activation makes it highly cost-effective for high-volume tasks like log analysis or continuous data processing.

Long-context processing

Traditional transformer models suffer from severe memory degradation as context length increases. By utilizing Mamba-2 state-space layers, Nemotron-3-Nano-30B efficiently handles a massive 256,000-token context window without the typical memory blowup. This allows developers to feed entire codebases, extensive documentation, or long conversation histories into the prompt. The hybrid architecture ensures that the model maintains precision reasoning over these long horizons, which is critical for retrieval-augmented generation (RAG) pipelines and persistent-memory agents.

Benchmarks and how it compares

Nemotron-3-Nano-30B benchmark results

NVIDIA evaluated Nemotron-3-Nano-30B against leading open-weight models in its size class, demonstrating significant advantages in mathematical reasoning and coding tasks. The hybrid MoE architecture proves highly competitive against dense models.

Metric	Nemotron-3-Nano-30B-A3B	Qwen3-30B-A3B	GPT-OSS-20B
AIME 2025 (Math, no tools)	89.1%	85.0%	91.7%
LiveCodeBench v6	68.3%	66.0%	61.0%
Arena-Hard-v2	67.7%	57.8%	48.5%
HumanEval (0-shot)	78.0%	-	-

Source: NVIDIA Technical Report and external performance data.

In the mathematical domain, Nemotron-3-Nano-30B achieves 89.1% on AIME 2025 without tool assistance, outperforming Qwen3-30B-A3B. When Python tools are enabled, its score jumps to 99.2%, highlighting its strength in agentic workflows. On software engineering tasks like LiveCodeBench v6, it records 68.3%, establishing a clear lead over sibling catalogue models. The Arena-Hard-v2 score of 67.7% further validates its reliability in complex, multi-step instructions, making it a robust choice for developers building autonomous agents.

Using it in production

Production configuration for Nemotron-3-Nano-30B

When deploying Nemotron-3-Nano-30B in production, understanding its reasoning budget is critical. Because the model generates a reasoning trace before its final answer, output token counts will be higher than those of standard non-reasoning models. You can control this by adjusting the reasoning budget parameters in your API request, allowing you to cap the "thinking" tokens to keep inference costs predictable.

Lyceum Technology categorizes this model in our Fast tier, which is optimized for cost-efficiency and high throughput. Hosted in our eu-north1 region, it leverages our open-stack infrastructure - including vLLM and TensorRT-LLM - to maximize the hardware-aware efficiency of the model's NVFP4 quantization.

To calculate production costs, consider a log analysis agent processing server errors. If the agent receives a 15,000-token input prompt (context) and generates a 500-token reasoning trace followed by a 200-token JSON output (700 output tokens total), the cost math is straightforward. At $0.06 per million input tokens, the prompt costs $0.0009. At $0.24 per million output tokens, the generation costs $0.000168. The total cost for this complex, reasoning-heavy task is approximately $0.001 per request. This aggressive pricing allows engineering teams to scale multi-agent systems without the prohibitive costs associated with hyperscaler APIs.

Running Nemotron-3-Nano-30B on EU-sovereign infrastructure

Why run Nemotron-3-Nano-30B on Lyceum

For European AI teams, data sovereignty is a hard requirement, not an optional feature. Running Nemotron-3-Nano-30B on Lyceum Technology guarantees that your inference workloads remain entirely within the EU. Our eu-north1 data centers provide strict GDPR compliance, ensuring that sensitive data - whether medical records, financial logs, or proprietary code - never crosses into US jurisdictions. Unlike API providers that rent capacity from hyperscalers, Lyceum operates its own GPU infrastructure, giving us a structural cost advantage that we pass directly to you.

Migrating to Lyceum requires minimal engineering effort. Because our Serverless Inference API is fully OpenAI-compatible, you can switch providers by updating a single base URL. There is no need to rewrite your application logic or learn proprietary SDKs. You benefit from open-stack transparency, avoiding the vendor lock-in associated with black-box proprietary engines.

Furthermore, our pay-per-token billing model eliminates the financial risk of idle compute. You scale to zero when traffic drops and pay only for the exact tokens processed. For teams transitioning off expiring hyperscaler credits, Lyceum offers a sustainable, high-performance path forward. Whether you are deploying serverless GPU inference for the first time or scaling a massive multi-agent architecture, Lyceum provides the secure, sovereign foundation your engineering team requires.

Frequently Asked Questions

What is the context window for Nemotron-3-Nano-30B?

Nemotron-3-Nano-30B supports a massive context window of up to 256,000 tokens. This extended capacity is made possible by its hybrid Mamba-2 and Transformer architecture, allowing developers to process entire codebases, long documents, and extensive conversation histories without severe memory degradation.

How much does Nemotron-3-Nano-30B cost on Lyceum?

On Lyceum Technology, Nemotron-3-Nano-30B is priced at $0.06 per million input tokens and $0.24 per million output tokens. It operates in our Fast tier, providing a highly cost-effective solution for high-throughput agentic workflows with strict per-token billing and no base fees.

Is Nemotron-3-Nano-30B GDPR-compliant on Lyceum?

Yes. Lyceum Technology hosts Nemotron-3-Nano-30B exclusively on EU-sovereign infrastructure in our eu-north1 region. All data processing remains within European borders, ensuring strict compliance with GDPR and providing a secure environment for sensitive enterprise workloads.

How do I migrate my application to Lyceum's API?

Migrating is straightforward because Lyceum's Serverless Inference API is fully OpenAI-compatible. You only need to change your client's base URL to `[removed] and insert your Lyceum API key. No changes to your application logic or prompt structures are required.

What does the "A3B" mean in the model name?

The "A3B" indicates that the model activates approximately 3 billion parameters per token during inference. While Nemotron-3-Nano-30B contains 30 billion total parameters, its sparse Mixture-of-Experts (MoE) architecture only routes tokens to specific experts, drastically reducing compute costs while maintaining high accuracy.

Can I disable the reasoning trace in Nemotron-3-Nano-30B?

Yes, you can configure the model to output a final answer without generating intermediate reasoning tokens. However, bypassing this "thinking" phase generally results in a measurable decrease in accuracy on complex logic prompts, so it is only recommended for simple, latency-sensitive tasks.

Related Resources

/magazine/glm-5-2; /magazine/llama-3-3-70b; /magazine/gpt-oss-120b

June 26, 2026

Qwen3.5-397B-A17B: specs, benchmarks, and how to run it on Lyceum