Serverless Inference Model Library Text LLMs 8 min read read

Nemotron-3-Nano-Omni: specs, benchmarks, and how to run it on Lyceum

NVIDIA's 30B-A3B hybrid MoE model for unified video, audio, image, and text reasoning.

Justus Amen

June 22, 2026 · GTM at Lyceum Technology

Nemotron-3-Nano-Omni is a 30-billion-parameter multimodal model built by NVIDIA, designed specifically for agentic AI systems. By unifying video, audio, image, and text processing into a single hybrid Mamba-Transformer Mixture-of-Experts (MoE) architecture, it eliminates the latency and complexity of stitching together separate perception models. Lyceum Technology serves Nemotron-3-Nano-Omni via our OpenAI-compatible Serverless Inference API, allowing you to deploy this powerful omni-modal engine on secure, GDPR-compliant European infrastructure. With per-token billing and zero code changes required, engineering teams can integrate advanced multimodal reasoning into their production applications with minimal friction.

Get started: call Nemotron-3-Nano-Omni on Lyceum

Integrate Nemotron-3-Nano-Omni into your agentic workflows using the standard OpenAI Python SDK. Because Lyceum Technology provides an OpenAI-compatible API, switching providers requires zero code changes to your application logic. Update the base URL and provide your Lyceum API key to route multimodal requests to our secure European infrastructure. This drop-in compatibility accelerates deployment for engineering teams migrating away from hyperscaler environments.

from openai import OpenAI

client = OpenAI(
 base_url="https://api.lyceum.technology/api/v2/external/serverless",
 api_key="<your lyceum api key>",
)
response = client.chat.completions.create(
 model="nvidia/Nemotron-3-Nano-Omni",
 messages=[{"role": "user", "content": "Hello!"}],
 max_tokens=256,
)
print(response.choices[0].message.content)

Pricing and region for Nemotron-3-Nano-Omni

This model is categorized under our Fast tier, which prioritizes cost-efficient inference for high-volume agentic tasks and perception workloads. The pricing is $0.06 per million input tokens and $0.24 per million output tokens. All inference for this endpoint is processed in the eu-north1 region, ensuring your multimodal data, whether sensitive documents, audio recordings, or video frames, remains strictly within European borders. This setup is ideal for teams building complex AI agents that require rapid perception across multiple modalities while maintaining strict data residency and compliance. Utilize per-second billing and scale-to-zero capabilities to pay strictly for the exact compute resources consumed during inference.

What Nemotron-3-Nano-Omni is good at

Unified omni-modal reasoning

Nemotron-3-Nano-Omni excels at processing multiple data types simultaneously without relying on fragmented model pipelines. Traditional agentic systems often stitch together separate vision, speech, and language models, which increases inference hops and orchestration complexity. NVIDIA designed this model to handle video, audio, image, and text inputs natively within a single perception-to-action loop. It utilizes the CRADIOv4 encoder for high-resolution vision tasks like optical character recognition (OCR) and document parsing. For audio, the Parakeet encoder processes transcription, spoken queries, and environmental sounds. Video reasoning is accelerated by 3D convolutional layers and Efficient Video Sampling (EVS), allowing the model to analyze temporal-spatial data across long clips.

High-efficiency MoE architecture

The model achieves high throughput through a hybrid Mamba-Transformer Mixture-of-Experts (MoE) architecture. While it contains 30 billion total parameters, it activates only about 3 billion parameters per token during inference. This selective activation allows Nemotron-3-Nano-Omni to operate with the speed and cost profile of a small dense model while delivering the reasoning capabilities of a much larger system. The inclusion of Mamba selective state-space layers provides efficient long-context processing, supporting a massive 256k token context window. This makes the model highly effective for analyzing multi-hour audio recordings, extended screen sessions, and dense enterprise documents without dropping critical context mid-task. According to NVIDIA's technical reports, this architecture delivers up to 9.2x higher system efficiency for video use cases compared to previous generations.

Benchmarks and how it compares

Nemotron-3-Nano-Omni benchmark results

Nemotron-3-Nano-Omni demonstrates strong performance across multimodal evaluations, particularly in document intelligence and video understanding. It competes favorably against both open-weight and proprietary models in its parameter class.

Benchmark	Nemotron-3-Nano-Omni	Qwen2.5-72B	Qwen2-Audio-7B
MMLongBench-Doc	61.9	61.5	-
Video-MME	72.2	-	70.5
DailyOmni	74.5	-	-
WorldSense	55.4	-	-

Sources: industry benchmarks, NVIDIA Technical Report (NVIDIA Technical Report).

When evaluated on MMLongBench-Doc, a rigorous test for long-context document understanding, Nemotron-3-Nano-Omni scores 61.9, outperforming the much larger Qwen3.5-397B-A17B model. This highlights the effectiveness of the CRADIOv4 vision encoder for complex enterprise documents.

In video reasoning, the model achieves a 72.2 on Video-MME, surpassing the comparable Qwen3-Omni 30B-A3B model. The integration of 3D convolutional layers and Efficient Video Sampling gives NVIDIA's model a distinct advantage in temporal-spatial tasks. For audio-visual cross-modal reasoning, it scores 74.5 on the DailyOmni benchmark. Nemotron-3-Nano-Omni activates only 3 billion parameters per token to deliver these benchmark results with significantly higher throughput and lower inference costs than dense models, making it a highly efficient choice for production multimodal pipelines.

Using it in production

Production configuration for Nemotron-3-Nano-Omni

Deploying Nemotron-3-Nano-Omni effectively requires understanding its context window and tier classification. The model supports a massive 256,000-token context window, which is essential for processing long-form media such as multi-hour meeting recordings, extensive PDF documents, or continuous video streams. Because it is categorized in the Lyceum Technology Fast tier, it is optimized for high-throughput, latency-sensitive applications where rapid perception is critical.

When configuring your API requests, you should carefully manage the max_tokens parameter. Because multimodal models can occasionally generate verbose descriptions of visual or audio inputs, setting a strict output limit prevents unnecessary token consumption. For example, if you are extracting specific data points from a video, capping the output at 500 tokens ensures concise responses and predictable costs.

The per-token pricing model makes this highly cost-effective for variable workloads. At $0.06 per million input tokens and $0.24 per million output tokens, processing a large batch of documents is highly economical. For instance, analyzing a dataset that consumes 5 million input tokens and generates 500,000 output tokens would cost exactly $0.30 for the input and $0.12 for the output, totaling $0.42. All requests are routed through the eu-north1 region, ensuring that your production data is processed on secure European infrastructure with zero egress fees.

Running Nemotron-3-Nano-Omni on EU-sovereign infrastructure

Why run Nemotron-3-Nano-Omni on Lyceum

For European AI teams and enterprises handling sensitive multimodal data, data residency is a strict requirement. Running Nemotron-3-Nano-Omni on Lyceum Technology ensures that your audio recordings, video files, and proprietary documents never leave the European Union. All inference for this model occurs in our eu-north1 region, providing a clear path to GDPR and AI Act compliance. Unlike US-based API providers that route traffic through overseas data centers, Lyceum operates entirely on owned, EU-sovereign GPU infrastructure.

This structural advantage allows us to offer highly competitive pricing without the margin pressure of renting compute from hyperscalers. Utilize our serverless inference API to benefit from per-second billing and scale-to-zero functionality. You pay strictly for the tokens you consume, eliminating the need to maintain expensive, idle GPU instances for bursty multimodal workloads.

Furthermore, Lyceum Technology prioritizes open-stack transparency. We utilize optimized inference engines like vLLM and NVIDIA Dynamo rather than locking you into proprietary, black-box architectures. This ensures complete customer portability and predictable performance. If you are migrating from a hyperscaler environment, our OpenAI-compatible endpoints allow you to transition your GDPR-compliant LLM inference to Lyceum in minutes, securing your data sovereignty while significantly reducing your infrastructure costs.

Frequently Asked Questions

What is the pricing for Nemotron-3-Nano-Omni on Lyceum?

The pricing for Nemotron-3-Nano-Omni on Lyceum Technology is $0.06 per million input tokens and $0.24 per million output tokens. It is categorized under our Fast tier, which is optimized for cost-efficient, high-throughput agentic tasks and multimodal perception workloads.

What is the context window for Nemotron-3-Nano-Omni?

Nemotron-3-Nano-Omni features a massive 256,000-token context window. This extensive capacity allows the model to process long-form media, such as multi-hour audio recordings, continuous video streams, and dense enterprise documents, without losing critical context during complex reasoning tasks.

Where is Nemotron-3-Nano-Omni hosted?

On Lyceum Technology, Nemotron-3-Nano-Omni is hosted exclusively in the eu-north1 region. This ensures that all your multimodal data, including sensitive audio, video, and text, remains strictly within European borders, providing a clear path to GDPR compliance and data sovereignty.

How do I call Nemotron-3-Nano-Omni using the OpenAI SDK?

You can call Nemotron-3-Nano-Omni by pointing the standard OpenAI SDK to the Lyceum API. Set the base URL to [removed], provide your Lyceum API key, and specify nvidia/Nemotron-3-Nano-Omni as the model parameter in your request.

How does Nemotron-3-Nano-Omni compare to Qwen3-Omni?

Nemotron-3-Nano-Omni generally outperforms Qwen2-Audio-7B in video and document reasoning tasks. For example, it scores 72.2 on the Video-MME benchmark compared to Qwen's 70.5. Its hybrid Mamba-Transformer architecture also delivers higher system efficiency for processing long multimodal sequences.

What license does Nemotron-3-Nano-Omni use?

Nemotron-3-Nano-Omni is released under the NVIDIA Nemotron Open Model License. This license is enterprise-friendly and permits commercial use, allowing developers to integrate the model into proprietary applications and agentic systems without restrictive open-source copyleft obligations.

Related Resources

/magazine/glm-5-2; /magazine/llama-3-3-70b; /magazine/gpt-oss-120b

June 26, 2026

Qwen3.5-397B-A17B: specs, benchmarks, and how to run it on Lyceum