Serverless Inference Model Library Text LLMs 9 min read read

Qwen3.5-397B-A17B: specs, benchmarks, and how to run it on Lyceum

Alibaba's 397B MoE model with native vision and agentic reasoning

Justus Amen

Justus Amen

June 26, 2026 · GTM at Lyceum Technology

Qwen3.5-397B-A17B is a flagship multimodal foundation model developed by Alibaba Cloud. Built on a hybrid Mixture-of-Experts (MoE) architecture, it features 397 billion total parameters but activates only 17 billion per token, enabling high-throughput inference for complex reasoning and agentic workflows. As a native vision-language model, it unifies text and visual processing through early fusion training. Lyceum Technology serves Qwen3.5-397B-A17B through our OpenAI-compatible Serverless Inference API, allowing developers to integrate it as a drop-in replacement. This specific model tier is served from Lyceum's us-central1 region in the US, giving teams low-latency access without managing any GPU infrastructure.

Get started: call Qwen3.5-397B-A17B on Lyceum

You can access Qwen3.5-397B-A17B through Lyceum Technology's Serverless Inference API. Switch to Lyceum by updating your base URL and API key. The API is fully OpenAI-compatible, requiring no other code changes.

from openai import OpenAI

client = OpenAI(
 base_url="https://api.lyceum.technology/api/v2/external/serverless",
 api_key="<your lyceum api key>",
)
response = client.chat.completions.create(
 model="Qwen/Qwen3.5-397B-A17B",
 messages=[{"role": "user", "content": "Hello!"}],
 max_tokens=256,
)
print(response.choices[0].message.content)

Pricing and region for Qwen3.5-397B-A17B

This model is available on Lyceum's Standard tier, which is optimized for high-capability workloads. The pricing is $0.60 per million input tokens and $3.60 per million output tokens. This specific endpoint is hosted in the us-central1 region. Billing is strictly pay-per-token with no minimum commitments or idle compute costs.

For engineering teams migrating from hyperscaler environments, this drop-in compatibility eliminates the need to rewrite application logic. You can route traffic to Qwen3.5-397B-A17B immediately using standard chat completion parameters like temperature and top_p.

Self-hosting Qwen3.5-397B-A17B requires significant infrastructure, typically an 8-GPU node, because it is a massive 397-billion parameter model of NVIDIA HGX B200 or H100s. Utilizing Lyceum's managed endpoint allows you to bypass the operational overhead of provisioning hardware, tuning vLLM configurations, and managing Kubernetes clusters. You only pay for the tokens you consume, making it highly cost-effective for both bursty workloads and sustained production traffic.

What Qwen3.5-397B-A17B is good at

Native multimodal reasoning

Qwen3.5-397B-A17B is a unified vision-language foundation model, unlike previous generations that maintained separate text and vision model lines. It uses early fusion training on multimodal tokens, allowing it to process images and video natively. This architecture enables the model to excel at visual reasoning, GUI interaction, and video comprehension without relying on external vision encoders. It achieves parity with massive text-only models while offering superior spatial awareness for tasks like automated quality inspection by processing visual and textual data in a single shared representation space.

Agentic workflows and coding

Qwen3.5-397B-A17B operates in a "thinking mode" by default, generating internal reasoning traces (<think>...</think>) before producing a final response. This chain-of-thought approach significantly boosts its performance on complex agentic tasks and software engineering benchmarks. It is highly capable of executing multi-step planning, tool calling, and autonomous coding tasks. When integrated into an application, this allows the model to self-correct and evaluate multiple solution paths before committing to an answer.

High-throughput MoE efficiency

The model utilizes a sparse Mixture-of-Experts (MoE) architecture combined with Gated Delta Networks (linear attention). Out of 512 available experts, it routes tokens to a select few, activating only 17 billion parameters per forward pass out of its 397 billion total. This 3:1 ratio of linear attention to full attention reduces KV-cache memory requirements by approximately 4x. For infrastructure teams, this architectural efficiency translates directly into higher decoding throughput and lower cost per token.

Benchmarks and how it compares

Qwen3.5-397B-A17B benchmark results

Qwen3.5-397B-A17B ranks highly among open-weights models, scoring 45 on the Artificial Analysis Intelligence Index. It demonstrates strong performance across reasoning, coding, and knowledge benchmarks.

Benchmark Qwen3.5-397B-A17B GLM-5 (744B) Kimi K2.5 (1T)
MMLU-Pro (Knowledge) 87.8% 89.5% 87.1%
GPQA (STEM) 88.4% 87.0% 87.6%
SWE-Bench Verified (Coding) 76.4% - -
IFEval (Instruction Following) 92.6% 90.9% 93.9%

Source: Qwen3.5 Technical Report and Artificial Analysis.

The model's architecture allows it to punch above its weight class. In the SWE-Bench Verified evaluation, which tests a model's ability to resolve real-world GitHub issues, Qwen3.5-397B-A17B achieves a 76.4% resolution rate. This places it in the upper echelon of coding models, rivaling proprietary alternatives. Its performance on the IFEval benchmark at 92.6% indicates high reliability when adhering to strict formatting constraints, such as generating valid JSON.

Qwen3.5-397B-A17B serves as the heavy-duty reasoning engine within the broader Qwen catalogue. The newer Qwen3.6-27B dense model offers exceptional coding performance for its size, but the 397B MoE model retains a decisive advantage in complex, multi-step agentic workflows and deep scientific reasoning tasks. Qwen3.5-397B-A17B achieves competitive scores compared to larger models like GLM-5 (744B) and Kimi K2.5 (1T) while requiring far fewer active parameters per token (17B vs 40B and 32B, respectively), making it highly cost-efficient for production inference.

Using it in production

Production configuration for Qwen3.5-397B-A17B

When deploying Qwen3.5-397B-A17B via Lyceum Technology, you are accessing the model on our Standard tier, which is designed for high-capability, complex reasoning tasks. The model supports a massive native context window of 262,000 tokens, making it ideal for analyzing large codebases, processing extensive document repositories, or handling long-running agentic workflows. Keep in mind that processing the full 262k tokens will naturally increase the time-to-first-token.

Ensure your application is configured to handle streaming responses, as the model utilizes a "thinking mode" to generate reasoning traces. Streaming the output prevents timeout errors and improves the user experience by displaying the reasoning process (<think>...</think>) in real-time as the model works through complex problems. If your frontend does not support streaming, you must configure your HTTP client with extended timeout thresholds, as the model may spend several seconds generating hidden reasoning tokens before returning the final JSON or text payload.

At $0.60 per million input tokens and $3.60 per million output tokens, the model is highly cost-efficient for its capability class. For a realistic production workload, such as processing a 10,000-token document and generating a 1,000-token analysis, a single API call costs approximately $0.0096. This endpoint is hosted in the us-central1 region, providing reliable, low-latency access for global applications. This pay-per-token model scales from zero, so you incur no idle costs during periods of low traffic and get a predictable path forward for scaling AI features.

Why run Qwen3.5-397B-A17B on Lyceum

Why run Qwen3.5-397B-A17B on Lyceum

Managing the infrastructure required for a 397B-parameter MoE model is a significant operational burden for AI startups and enterprise engineering teams. Lyceum Technology simplifies this process by providing a fully managed, OpenAI-compatible API. You get the capabilities of Qwen3.5-397B-A17B as a drop-in replacement, without the overhead of provisioning 8-GPU nodes, tuning vLLM configurations, or managing Kubernetes clusters. Our platform is designed for engineers who build, offering per-second billing and scale-to-zero capabilities that ensure you only pay for the compute you actually use.

This Qwen3.5-397B-A17B endpoint runs in Lyceum's us-central1 region in the US, keeping traffic close to North American workloads. Because we own our GPU infrastructure instead of renting it from hyperscalers, we pass a structural cost advantage straight through to you: highly competitive per-token pricing, unified billing, and zero egress fees. There are no idle or base charges, so you pay strictly for the tokens you consume, and you can burst to per-second dedicated GPUs whenever a workload needs more headroom.

Furthermore, Lyceum's open-stack transparency, built on vLLM, NVIDIA Dynamo, and TensorRT-LLM, ensures that you avoid the vendor lock-in associated with proprietary inference engines. By standardizing on open-source orchestration, we guarantee customer portability by design. If your scaling strategy eventually requires transitioning from our serverless inference API to your own dedicated Lyceum VMs, the underlying software stack remains consistent, eliminating costly migration engineering. Lyceum delivers the performance, transparency, and cost-efficiency required to scale AI applications in production.

Frequently Asked Questions

What is the context window for Qwen3.5-397B-A17B?

Qwen3.5-397B-A17B supports a massive native context window of 262,000 tokens. This extensive capacity allows the model to process large codebases, analyze long document collections, and maintain state across extended agentic workflows without losing critical information from earlier in the prompt.

How much does it cost to run Qwen3.5-397B-A17B on Lyceum?

On Lyceum Technology's Serverless Inference API, Qwen3.5-397B-A17B costs $0.60 per million input tokens and $3.60 per million output tokens. Billing is strictly pay-per-token with no minimum commitments, base fees, or idle compute charges, making it highly cost-effective for production workloads.

Is Qwen3.5-397B-A17B a multimodal model?

Yes, Qwen3.5-397B-A17B is a native vision-language model. It was trained using early fusion on multimodal tokens, allowing it to natively process images and video alongside text. This architecture enables advanced visual reasoning and GUI interaction without relying on external vision encoders.

Where is the Qwen3.5-397B-A17B endpoint hosted?

This specific Qwen3.5-397B-A17B model endpoint is hosted in Lyceum Technology's us-central1 region on our Standard tier. This deployment runs from our US-based capacity, not a European region, to ensure high availability and low-latency performance for global applications.

How do I call Qwen3.5-397B-A17B using the OpenAI SDK?

You can use the standard OpenAI Python or Node.js SDK to call the model. Update the base_url to https://www.alibabacloud.com/help/en/model-studio/developer-reference/use-qwen-by-calling-api, provide your Lyceum API key, and set the model parameter to Qwen/Qwen3.5-397B-A17B. No other code changes are required.

What is the license for Qwen3.5-397B-A17B?

Qwen3.5-397B-A17B is released by Alibaba Cloud under the Apache 2.0 license. This permissive open-source license allows for both commercial and non-commercial use, making it an excellent choice for enterprise engineering teams building proprietary applications without restrictive licensing overhead.

Related Resources

/magazine/glm-5-2; /magazine/llama-3-3-70b; /magazine/gpt-oss-120b