Serverless Inference Model Library Text LLMs 8 min read read

Qwen3-32B: specs, benchmarks, and how to run it on Lyceum

A 32-billion parameter dense model with hybrid thinking modes for deep reasoning and fast dialogue.

Caspar Lehmkühler

Caspar Lehmkühler

June 26, 2026 · Head of Product at Lyceum Technology

Qwen3-32B is a 32.8-billion parameter dense model developed by the Qwen team at Alibaba. It stands out for its hybrid architecture, allowing it to seamlessly switch between a "thinking mode" for complex math and coding, and a "non-thinking mode" for rapid, general-purpose dialogue. Despite its compact size, it frequently outperforms previous-generation 72B models. Lyceum Technology serves Qwen3-32B via our OpenAI-compatible Serverless Inference API, allowing developers to deploy this highly capable model on fully GDPR-compliant, EU-hosted infrastructure with zero code changes.

Get started: call Qwen3-32B on Lyceum

Deploy Qwen3-32B using the standard OpenAI SDK. Because Lyceum Technology provides a drop-in OpenAI-compatible API, you only need to update your base URL and API key. There is no need to rewrite your application logic or learn a new framework.

from openai import OpenAI

client = OpenAI(
 base_url="https://api.lyceum.technology/api/v2/external/serverless",
 api_key="<your lyceum api key>",
)
response = client.chat.completions.create(
 model="Qwen/Qwen3-32B",
 messages=[{"role": "user", "content": "Hello!"}],
 max_tokens=256,
)
print(response.choices[0].message.content)

Pricing and region for Qwen3-32B

When you route your inference workloads through Lyceum, you benefit from transparent, per-token billing with no hidden base fees or minimum commitments. Qwen3-32B is categorized under our Standard tier, which is optimized for high-capability models that balance deep reasoning with efficient generation.

The pricing for Qwen3-32B is $0.10 per million input tokens and $0.30 per million output tokens. This highly competitive rate allows teams to scale their AI features without the unpredictable cost spikes often associated with hyperscaler GPU instances.

Furthermore, Qwen3-32B is hosted in our eu-north1 region. This ensures that all data processing and model execution occur strictly within European borders. For AI startups and enterprise teams handling sensitive user data, this guarantees full GDPR compliance and data sovereignty. You get the performance of a cutting-edge 32-billion parameter model without compromising on security or regulatory requirements.

What Qwen3-32B is good at

Hybrid thinking and non-thinking modes

The most significant architectural innovation in Qwen3-32B is its dual-mode system. Unlike previous generations that required separate models for reasoning and general chat, Qwen3-32B seamlessly switches between a "thinking mode" and a "non-thinking mode." When faced with complex logical reasoning, mathematics, or coding tasks, the model allocates a thinking budget to generate hidden reasoning steps before outputting the final answer. For standard conversational prompts, it defaults to the non-thinking mode for rapid, low-latency responses. This dynamic switching ensures optimal performance and efficiency across diverse scenarios.

Coding and mathematical reasoning

Qwen3-32B delivers groundbreaking advancements in technical domains. Built on an expanded pre-training dataset, it excels at multi-step problem solving, algorithmic design, and code generation. It frequently outperforms larger models on benchmarks like AIME and LiveCodeBench. This makes it an exceptional choice for developers building AI coding assistants, automated debugging tools, or data analysis pipelines where precision and logical consistency are paramount.

Multilingual proficiency and long context

The model supports over 100 languages and dialects, offering strong capabilities for multilingual instruction following, translation, and cross-cultural content generation. Additionally, Qwen3-32B natively supports a context window of up to 131,072 tokens. This massive context capacity allows it to ingest entire codebases, lengthy financial reports, or extensive document collections in a single prompt, making it highly effective for retrieval-augmented generation (RAG) and document summarization tasks.

Benchmarks and how it compares

Qwen3-32B benchmark results

Qwen3-32B rivals or exceeds much larger models across industry-standard benchmarks. Despite having only 32.8 billion parameters, it achieves state-of-the-art results in its weight class, particularly in reasoning and coding tasks.

BenchmarkMetricScore
ArenaHardWin Rate93.8%
AIME 2024Pass Rate81.4%
AIME 2025Pass Rate72.9%
LiveCodeBenchPass Rate65.7%
LiveBenchAccuracy71.6%
MultiIFAccuracy73.0%

Source: GroqDocs Qwen 3 32B Technical Specifications (2025).

When compared to its predecessors and siblings, Qwen3-32B represents a massive leap in efficiency. It consistently outperforms the older Qwen2.5-72B model across coding, mathematics, and reasoning benchmarks, despite having less than half the parameter count. Furthermore, it integrates the deep reasoning capabilities previously isolated in the QwQ-32B model, combining them with the general instruction-following strengths of the standard Qwen series. For teams evaluating models on Lyceum, Qwen3-32B offers a compelling middle ground: it is significantly smarter than 8B or 14B models, yet much faster and more cost-effective to run than 70B+ flagship models.

Using it in production

Production configuration for Qwen3-32B

When deploying Qwen3-32B in production, understanding its configuration parameters and cost structure is essential for optimizing your application. The model supports a massive context window of 131,072 tokens, allowing you to process hundreds of pages of text, extensive code repositories, or complex JSON structures in a single API call.

On Lyceum Technology, Qwen3-32B is served under our Standard tier. This tier is designed for high-capability models that require substantial compute resources to handle complex reasoning and deep context processing. The model is hosted in our eu-north1 region, ensuring low-latency access for European users and strict adherence to data sovereignty regulations.

To calculate the production costs, consider a typical Retrieval-Augmented Generation (RAG) workload. If your application processes an average of 4,000 input tokens (retrieved documents and system prompts) and generates 500 output tokens per request, the cost math is straightforward. At $0.10 per million input tokens, the input cost is $0.0004. At $0.30 per million output tokens, the output cost is $0.00015. This brings the total cost per request to just $0.00055.

For an application handling 10,000 such requests per day, your daily inference cost would be approximately $5.50. This predictable, per-token pricing model allows you to scale from zero to millions of requests without the overhead of provisioning dedicated GPU instances or paying for idle compute time. By leveraging the serverless inference API, you can easily stream responses to your users, reducing perceived latency and improving the overall user experience.

Running Qwen3-32B on EU-sovereign infrastructure

Why run Qwen3-32B on Lyceum

Choosing the right infrastructure provider is just as important as selecting the right model. For European AI startups and enterprise teams, Lyceum Technology offers a unique combination of performance, cost-efficiency, and regulatory compliance.

The most critical advantage of running Qwen3-32B on Lyceum is our commitment to EU data sovereignty. Unlike US-based API providers that route traffic through American data centers, Lyceum hosts Qwen3-32B in our eu-north1 region. This ensures that your sensitive customer data, proprietary code, and internal documents never leave the European Union. For teams navigating the complexities of GDPR, the AI Act, and enterprise compliance audits, this localized hosting is not just a feature, it is a strict requirement.

Furthermore, Lyceum operates its own GPU infrastructure rather than renting capacity from hyperscalers. This structural advantage allows us to offer highly competitive per-token pricing without sacrificing performance. You benefit from our open-stack transparency, utilizing optimized inference engines like vLLM and NVIDIA Dynamo, which deliver exceptional throughput and low latency.

Finally, the developer experience is entirely frictionless. Because our Serverless Inference API is fully OpenAI-compatible, migrating your existing Qwen3-32B workloads to Lyceum requires changing only two lines of code: the base URL and the API key. You can instantly transition away from expensive hyperscaler deployments or non-compliant US providers, gaining access to scalable, scale-to-zero inference that automatically adjusts to your traffic demands.

Frequently Asked Questions

What is the context window for Qwen3-32B?

Qwen3-32B supports a massive context window of 131,072 tokens. This allows developers to pass extensive documents, large codebases, or complex JSON structures in a single prompt, making it highly effective for retrieval-augmented generation (RAG) and long-form summarization tasks.

How much does Qwen3-32B cost on Lyceum?

On Lyceum Technology, Qwen3-32B is priced at $0.10 per million input tokens and $0.30 per million output tokens. It is categorized under our Standard tier, offering a highly cost-effective solution for production workloads without any minimum commitments or base fees.

How do I switch to Lyceum's API for Qwen3-32B?

Because Lyceum provides a drop-in OpenAI-compatible API, switching is effortless. You simply need to update your OpenAI SDK configuration by changing the base_url to [removed] and providing your Lyceum API key. No other code changes are required.

Is Qwen3-32B GDPR compliant on Lyceum?

Yes. When you run Qwen3-32B on Lyceum, your workloads are executed in our eu-north1 region. All data processing remains strictly within the European Union, ensuring full compliance with GDPR and providing the data sovereignty required by European enterprises.

How does Qwen3-32B compare to Qwen2.5-72B?

Despite having less than half the parameters, Qwen3-32B frequently outperforms the older Qwen2.5-72B model across coding, mathematics, and reasoning benchmarks. Its hybrid thinking mode allows it to achieve state-of-the-art reasoning capabilities while remaining significantly faster and cheaper to run.

What is the license for Qwen3-32B?

Qwen3-32B is released under the Apache 2.0 license. This permissive open-source license allows developers and enterprises to use, modify, and deploy the model for commercial applications without restrictive licensing fees or complex legal barriers.

Related Resources

/magazine/glm-5-2; /magazine/llama-3-3-70b; /magazine/gpt-oss-120b