Serverless Inference Model Library Text LLMs 8 min read read

Qwen3-30B-A3B: specs, benchmarks, and how to run it on Lyceum

A highly efficient 30.5B Mixture-of-Experts model optimized for speed and instruction following.

Maximilian Niroomand

June 25, 2026 · CTO & Co-Founder at Lyceum Technology

Qwen3-30B-A3B is a highly efficient Mixture-of-Experts (MoE) large language model developed by the Qwen team at Alibaba Cloud. Featuring 30.5 billion total parameters but only activating 3.3 billion per token, it delivers strong reasoning, coding, and instruction-following capabilities at high speeds. The specific Instruct-2507 variant is optimized for rapid, non-thinking mode execution, making it ideal for high-throughput production workloads. Lyceum Technology serves this model through our fully OpenAI-compatible Serverless Inference API. Hosted in our eu-north1 region, European AI teams can leverage Qwen3-30B-A3B on owned, EU-sovereign GPU infrastructure, ensuring strict data residency and GDPR compliance while benefiting from transparent, per-token pricing.

Get started: call Qwen3-30B-A3B on Lyceum

Connect directly to the Lyceum Serverless Inference API. Because the platform is fully OpenAI-compatible, integrating this highly efficient Mixture-of-Experts (MoE) model requires zero architectural changes to your existing application. You only need to update your client's base URL and provide your Lyceum API key.

from openai import OpenAI

client = OpenAI(
 base_url="https://api.lyceum.technology/api/v2/external/serverless",
 api_key="<your lyceum api key>",
)
response = client.chat.completions.create(
 model="Qwen/Qwen3-30B-A3B-Instruct-2507",
 messages=[{"role": "user", "content": "Hello!"}],
 max_tokens=256,
)
print(response.choices[0].message.content)

Pricing and region for Qwen3-30B-A3B

When you deploy this model on Lyceum, you benefit from our transparent, per-token billing model with no minimum commitments or idle costs. Qwen3-30B-A3B is categorized under our Fast tier, which is optimized for cost-efficient, high-throughput workloads where latency and economy are paramount. The pricing is set at $0.10 per million input tokens and $0.30 per million output tokens.

This model is hosted in our eu-north1 region. For European AI teams and enterprises with strict data residency requirements, this ensures that all inference data is processed entirely within the European Union. By running on Lyceum's owned GPU infrastructure rather than relying on US-based hyperscalers, you maintain full GDPR compliance while taking advantage of the model's exceptional speed and instruction-following capabilities.

What Qwen3-30B-A3B is good at

Efficient MoE Architecture

Qwen3-30B-A3B represents a significant leap in architectural efficiency for open-weight models. It utilizes a highly optimized Mixture-of-Experts (MoE) design. While the model contains 30.5 billion total parameters, it only activates 3.3 billion parameters during any single forward pass. The architecture consists of 128 distinct expert networks, with the routing mechanism selecting exactly 8 active experts per token. This sparse activation allows the model to deliver the nuanced understanding of a 30B-class model while operating at the inference speed and computational cost of a much smaller 3B model.

Instruction Following and Coding

The specific variant served on Lyceum, Qwen3-30B-A3B-Instruct-2507, is fine-tuned heavily for instruction following, tool usage, and coding tasks. According to the official Qwen release documentation, this model demonstrates substantial gains in long-tail knowledge coverage across multiple languages and excels at complex agentic workflows. It supports over 100 languages and dialects, making it highly versatile for global applications.

Speed and Cost-Efficiency

Because it only activates 3.3 billion parameters, Qwen3-30B-A3B is exceptionally fast. It is designed to maximize tokens per second, making it an ideal choice for latency-sensitive applications like real-time chatbots, large-scale document parsing, and high-volume data extraction. The model's efficiency allows engineering teams to process massive datasets without incurring the prohibitive costs associated with dense frontier models. For teams transitioning off expensive hyperscaler credits, this MoE architecture provides a sustainable path for scaling production AI workloads.

Benchmarks and how it compares

Qwen3-30B-A3B benchmark results

Despite its small active parameter footprint, Qwen3-30B-A3B delivers highly competitive performance across industry-standard evaluations. The Qwen team's official benchmarks demonstrate that this MoE model frequently outperforms older, larger dense models and even specialized reasoning models in specific categories. It outcompetes QwQ-32B, a model with ten times the active parameters, on several key metrics.

Below are the published benchmark results for the Qwen3-30B-A3B-Instruct-2507 non-thinking variant, as reported in the official Hugging Face model card:

Benchmark	Metric Focus	Qwen3-30B-A3B-Instruct-2507	Qwen3-235B (Non-Thinking)
MMLU-Pro	General Knowledge	78.4	75.2
MMLU-Redux	General Knowledge	89.3	89.2
GPQA	Graduate-Level Science	70.4	62.9
SuperGPQA	Advanced Science	53.4	48.2
AIME25	Mathematical Reasoning	61.3	24.7
HMMT25	Mathematical Reasoning	43.0	10.0

Source: Official Qwen3-30B-A3B-Instruct-2507 Hugging Face Model Card .

When compared to its much larger sibling, the Qwen3-235B operating in non-thinking mode, the 30B-A3B Instruct variant shows significant optimization. It actually scores higher on AIME25 (61.3 vs 24.7) and GPQA (70.4 vs 62.9) than the base 235B model operating without its thinking mode. This highlights the effectiveness of the 2507 instruction-tuning process, making it an effective tool for developers who need high performance without the latency overhead of a 200B parameter model.

Using it in production

Production configuration for Qwen3-30B-A3B

When deploying Qwen3-30B-A3B in a production environment, understanding its configuration parameters and cost structure is essential for optimizing your application. The model natively supports a massive context length of up to 262,144 tokens. This makes it highly suitable for processing large codebases, analyzing extensive financial reports, or handling long-running conversational agents without losing context. However, developers should monitor their input token counts, as maximizing the context window will impact overall request latency.

On Lyceum, Qwen3-30B-A3B is categorized in the Fast tier. This tier is specifically designed for models that prioritize high throughput and cost-efficiency, making it ideal for tasks like document OCR batch processing, real-time API serving, and continuous factory camera inference.

The per-token pricing model ensures you only pay for exact usage, with no base fees or idle costs. At $0.10 per million input tokens and $0.30 per million output tokens, the economics are highly favorable for scale. For example, if your application processes 5 million input tokens and generates 2 million output tokens daily, your total daily cost would be $1.10, calculated as $0.50 for input and $0.60 for output. This predictable, low-cost structure is a major advantage for AI startups and scale-ups looking to move away from expensive, dedicated hyperscaler instances. Because Lyceum does not charge egress fees, you can move large datasets in and out of our S3-compatible storage without incurring hidden network transfer costs.

Running Qwen3-30B-A3B on EU-sovereign infrastructure

Why run Qwen3-30B-A3B on Lyceum

For European AI teams, compliance and data sovereignty are critical requirements. By running Qwen3-30B-A3B on Lyceum, your workloads are executed entirely on EU-sovereign infrastructure. The model is hosted in our eu-north1 region, ensuring that all data processing and storage remain strictly within European borders. This provides a clear path to GDPR compliance, AI Act readiness, and ISO 27001 certification, a distinct advantage that US-based API providers cannot offer.

Lyceum provides an open-stack transparency advantage. Unlike competitors who rely on black-box proprietary engines, our inference stack leverages open-source technologies like vLLM and NVIDIA Dynamo. This ensures that you maintain complete customer portability and avoid vendor lock-in. You get the performance benefits of advanced inference orchestration without sacrificing control over your deployment architecture.

Integrating the model requires minimal effort thanks to our fully OpenAI-compatible API. Your engineering team can switch from existing providers to Lyceum by changing two lines of code, the base URL and the API key. Combined with our structural cost advantage of owning our GPU infrastructure rather than renting from hyperscalers, Lyceum delivers Qwen3-30B-A3B at a competitive price. You benefit from per-second billing, scale-to-zero capabilities, and the reliability of 40 supply-side partners, ensuring your inference endpoints remain highly available even during global GPU shortages.

Frequently Asked Questions

What is the context window for Qwen3-30B-A3B?

The Qwen3-30B-A3B model natively supports a massive context window of up to 262,144 tokens for the Instruct-2507 variant. This extensive capacity allows developers to process large codebases, lengthy financial documents, and extended conversational histories without losing critical context during inference.

How much does it cost to run Qwen3-30B-A3B on Lyceum?

On the Lyceum Serverless Inference API, Qwen3-30B-A3B is priced under the Fast tier. It costs $0.10 per million input tokens and $0.30 per million output tokens. You only pay for the exact number of tokens processed, with zero base fees or idle costs.

Is Qwen3-30B-A3B GDPR compliant when hosted on Lyceum?

Yes, running this model on Lyceum ensures full GDPR compliance. The model is hosted in our eu-north1 region, meaning all inference requests and data processing occur strictly within European data centers. This EU-sovereign infrastructure is ideal for teams handling sensitive or regulated data.

How do I call the Qwen3-30B-A3B API?

You can call the model using the standard OpenAI SDK. Set your client's base URL to https://www.alibabacloud.com/help/en/model-studio/developer-reference/call-qwen-via-openai-compatible-api, provide your Lyceum API key, and use the model string Qwen/Qwen3-30B-A3B-Instruct-2507 in your chat completions request. No other code changes are required.

What makes the MoE architecture of Qwen3-30B-A3B special?

The Mixture-of-Experts (MoE) architecture allows the model to have 30.5 billion total parameters while only activating 3.3 billion parameters per token. By routing requests to exactly 8 of its 128 experts at a time, it achieves the intelligence of a 30B model with the speed of a 3B model.

How does Qwen3-30B-A3B compare to QwQ-32B?

Despite having significantly fewer active parameters (3.3B vs 32B), Qwen3-30B-A3B frequently outperforms QwQ-32B on key benchmarks like AIME25 and CodeForces Elo. It delivers highly competitive reasoning and coding capabilities while requiring a fraction of the computational overhead, making it much more cost-effective for production deployments.

Related Resources

/magazine/glm-5-2; /magazine/llama-3-3-70b; /magazine/gpt-oss-120b

June 26, 2026

Qwen3.5-397B-A17B: specs, benchmarks, and how to run it on Lyceum