What is Qwen2.5-VL-72B?

Qwen2.5-VL-72B is a 72-billion parameter vision-language model developed by the Qwen team. It features advanced multimodal capabilities, including document parsing, spatial localization, and long-video comprehension. It can process images of varying resolutions and videos up to hours long, making it ideal for complex visual reasoning tasks.

How much does it cost to run Qwen2.5-VL-72B via API?

On Lyceum Technology's Serverless Inference API, Qwen2.5-VL-72B is priced at $0.25 per million input tokens and $0.75 per million output tokens. You are billed strictly per token with no minimum commitments or base fees, allowing you to scale your multimodal workloads cost-effectively.

What is the context window for Qwen2.5-VL-72B?

Qwen2.5-VL-72B supports a context window capable of handling long videos and multiple high-resolution images. Image and video tokens scale with resolution and frame rate; manage inputs carefully to avoid exhausting the context limit and increasing inference latency.

Where is my multimodal data processed?

All data, including text prompts, images, and videos, is processed exclusively in the eu-north1 region. Lyceum guarantees strict EU data residency and GDPR compliance, ensuring sensitive multimodal workloads never leave European jurisdiction.

How does Qwen2.5-VL-72B compare to GPT-4o?

Qwen2.5-VL-72B matches or exceeds GPT-4o in several multimodal benchmarks. Official evaluations show it scores 70.2 on MMMU and 95.3 on DocVQA, demonstrating performance in college-level reasoning and document understanding compared to proprietary alternatives.

Can I use the OpenAI SDK to call Qwen2.5-VL-72B?

Yes. The platform provides a fully OpenAI-compatible API. Use the standard Python or Node.js OpenAI SDKs to call Qwen2.5-VL-72B by updating the base_url to https://www.alibabacloud.com/help/en/model-studio/developer-reference/integrate-qwen-vl-into-openai-compatible-apps and setting the model parameter to Qwen/Qwen2.5-VL-72B-Instruct.

Qwen2.5-VL-72B API: pricing, benchmarks & EU hosting

Qwen2.5-VL-72B is the flagship vision-language model from the Qwen team, released in early 2025. Built with a dynamic-resolution Vision Transformer and absolute time encoding, it processes images of any aspect ratio and videos up to hours long without traditional normalization. It excels at document parsing, spatial localization, and acting as a visual agent for computer use. Lyceum Technology serves Qwen2.5-VL-72B via an OpenAI-compatible API, allowing you to integrate state-of-the-art multimodal capabilities into your applications while keeping all data processing strictly within EU-sovereign data centers.

Get started: call Qwen2.5-VL-72B on Lyceum

Integrate with the OpenAI SDK

Integrate Qwen2.5-VL-72B into your application without learning new frameworks or writing custom API wrappers. The platform provides a fully managed, OpenAI-compatible Serverless Inference API. Updating the base URL and API key routes multimodal requests directly to European data centers.

from openai import OpenAI

client = OpenAI(
 base_url="https://api.lyceum.technology/api/v2/external/serverless",
 api_key="<your lyceum api key>",
)
response = client.chat.completions.create(
 model="Qwen/Qwen2.5-VL-72B-Instruct",
 messages=[{"role": "user", "content": [
 {"type": "text", "text": "What's in this image?"},
 {"type": "image_url", "image_url": {"url": "<image-url>"}},
 ]}],
)
print(response.choices[0].message.content)

Pricing and region for Qwen2.5-VL-72B

Route inference traffic to Lyceum to pay strictly for the compute consumed. Qwen2.5-VL-72B is available on the Standard tier, which is optimized for high-capability workloads requiring precision and reasoning depth. The model is hosted in the eu-north1 region, ensuring that images, videos, and text prompts never leave European jurisdiction.

The pricing for Qwen2.5-VL-72B is competitive for a 72-billion parameter multimodal model. You are billed at $0.25 per million input tokens and $0.75 per million output tokens. Operating proprietary GPU infrastructure rather than renting capacity from hyperscalers allows Lyceum to offer these rates without minimum commitments, base fees, or egress charges. You scale from zero and pay only for the tokens processed.

What Qwen2.5-VL-72B is good at

Advanced document parsing and structured output

Qwen2.5-VL-72B excels at extracting structured data from complex, text-heavy images. The model was trained on a massive corpus of document data, enabling it to parse invoices, forms, tables, charts, and even mathematical equations or chemical formulas. It can output stable JSON representations of this data, making it an ideal choice for automated data entry, financial document processing, and enterprise OCR pipelines.

Spatial localization and visual agent capabilities

Unlike earlier vision-language models that only describe images globally, Qwen2.5-VL-72B possesses precise spatial localization capabilities. It can accurately pinpoint objects within an image and output their exact coordinates using bounding boxes or points. This spatial awareness allows the model to act as a visual agent. It can reason about user interfaces, dynamically direct tools, and execute tasks in real-world scenarios, such as operating computer desktops or mobile device screens based on visual feedback.

Long-video comprehension

Qwen2.5-VL-72B introduces a leap in temporal understanding. By implementing dynamic FPS sampling and absolute time encoding, the model can process videos extending over an hour in length. It natively perceives temporal dynamics without relying on traditional normalization techniques. This allows the model to not only summarize long videos but also pinpoint specific events down to the second, making it highly effective for security footage analysis, meeting transcription, and automated video editing workflows.

Benchmarks and how it compares

Qwen2.5-VL-72B benchmark results

Qwen2.5-VL-72B was evaluated against both open-source and proprietary multimodal models. The official technical report states the 72B flagship model matches or exceeds the performance of proprietary models like GPT-4o and Claude 3.5 Sonnet in document parsing and spatial reasoning tasks.

Benchmark	Qwen2.5-VL-72B	GPT-4o	Claude 3.5 Sonnet
MMMU (College-level reasoning)	70.2	69.1	68.3
MathVista (Visual math)	70.5	63.8	67.7
DocVQA (Document understanding)	95.3	92.8	95.2
OCRBench (Text extraction)	875	736	788

Source: Qwen2.5-VL Technical Report (arXiv)

Compared to its sibling models, Qwen2.5-VL-72B provides an increase in reasoning depth. While the smaller Qwen2.5-VL-7B is efficient for basic image captioning and the Qwen2.5-VL-32B offers a middle ground for edge deployments, the 72B variant is required for complex agentic workflows, dense document omni-parsing, and hour-long video comprehension. On OCRBench_v2, the 72B model demonstrated superior multilingual capabilities, outperforming Gemini 1.5-Pro by 9.6 percent in English and 20.6 percent in Chinese. For enterprise applications where accuracy is paramount, the 72B model is the choice in the Qwen lineup.

Using it in production

Production configuration for Qwen2.5-VL-72B

Deploying Qwen2.5-VL-72B in production requires careful management of multimodal inputs to optimize both latency and cost. The model supports a massive context window, allowing you to pass multiple high-resolution images or long video sequences in a single API request. However, because image tokens are calculated based on the dynamic resolution processing of the Vision Transformer, sending raw, uncompressed 4K images will rapidly inflate your input token count. We recommend resizing images to the minimum resolution necessary for your specific task, such as 1024x1024 for standard OCR, to keep inference fast and cost-effective.

When calling the model via Lyceum Technology, you are utilizing our Standard tier. This tier is provisioned on high-end NVIDIA GPUs in our eu-north1 region, ensuring the compute density required for a 72-billion parameter model. The Standard tier prioritizes high-capability execution, making it ideal for complex reasoning, structured JSON extraction, and agentic tool use.

Lyceum charges strictly per token, keeping costs predictable. At $0.25 per million input tokens and $0.75 per million output tokens, a typical document parsing workload is economical. For example, processing 1,000 invoices, where each invoice consumes roughly 2,000 input tokens for the image and prompt and generates 300 output tokens for the extracted JSON, results in a total consumption of 2 million input tokens and 300,000 output tokens. This batch costs $0.50 for input and $0.225 for output, totaling $0.725. You achieve OCR and data extraction without the overhead of maintaining dedicated infrastructure.

Running Qwen2.5-VL-72B on EU-sovereign infrastructure

Why run Qwen2.5-VL-72B on Lyceum

For European enterprises and AI startups, data residency is not optional. When processing sensitive multimodal data, such as medical scans, financial documents, or factory floor security footage, sending that data to US-based hyperscalers or API providers introduces severe compliance risks. Lyceum Technology solves this by serving Qwen2.5-VL-72B entirely from our eu-north1 region. Your images, videos, and prompts are processed on EU-sovereign infrastructure, ensuring strict adherence to GDPR and providing a clear path to AI Act compliance.

Unlike many API providers that act as middlemen renting capacity from AWS or Google Cloud, Lyceum owns and operates its GPU infrastructure. This structural advantage allows us to offer Qwen2.5-VL-72B at highly competitive per-token rates without base fees or egress charges. You get the performance of a 72-billion parameter model without the financial burden of provisioning dedicated H100 clusters.

The platform prevents vendor lock-in through open-stack transparency. The inference engine is built on open-source technologies like vLLM and NVIDIA Dynamo, ensuring that you can self-host your LLM API on EU infrastructure if scale demands it. A drop-in OpenAI-compatible API makes it trivial for engineering teams to switch from proprietary US models to Qwen2.5-VL-72B. Updating the base URL and API key instantly upgrades the application with multimodal capabilities while securing data within Europe.

Qwen2.5-VL-72B: specs, benchmarks, and how to run it on Lyceum

Get started: call Qwen2.5-VL-72B on Lyceum

Integrate with the OpenAI SDK

Pricing and region for Qwen2.5-VL-72B

What Qwen2.5-VL-72B is good at

Advanced document parsing and structured output

Spatial localization and visual agent capabilities

Long-video comprehension

Benchmarks and how it compares

Qwen2.5-VL-72B benchmark results

Using it in production

Production configuration for Qwen2.5-VL-72B

Running Qwen2.5-VL-72B on EU-sovereign infrastructure

Why run Qwen2.5-VL-72B on Lyceum

Frequently Asked Questions

What is Qwen2.5-VL-72B?

How much does it cost to run Qwen2.5-VL-72B via API?

What is the context window for Qwen2.5-VL-72B?

Where is my multimodal data processed?

How does Qwen2.5-VL-72B compare to GPT-4o?

Can I use the OpenAI SDK to call Qwen2.5-VL-72B?

Further Reading

Related Resources

Related Articles

Qwen3.5-397B-A17B: specs, benchmarks, and how to run it on Lyceum

Qwen3-32B: specs, benchmarks, and how to run it on Lyceum

Qwen3-30B-A3B: specs, benchmarks, and how to run it on Lyceum

Inference

Training