Qwen2.5-VL-72B: specs, benchmarks, and how to run it on Lyceum
The flagship 72B vision-language model from Qwen, excelling in document parsing, spatial localization, and long-video comprehension.
Justus Amen
June 24, 2026 · GTM at Lyceum Technology
Qwen2.5-VL-72B is the flagship vision-language model from the Qwen team, released in early 2025. Built with a dynamic-resolution Vision Transformer and absolute time encoding, it processes images of any aspect ratio and videos up to hours long without traditional normalization. It excels at document parsing, spatial localization, and acting as a visual agent for computer use. Lyceum Technology serves Qwen2.5-VL-72B via an OpenAI-compatible API, allowing you to integrate state-of-the-art multimodal capabilities into your applications while keeping all data processing strictly within EU-sovereign data centers.
Get started: call Qwen2.5-VL-72B on Lyceum
Integrate with the OpenAI SDK
Integrate Qwen2.5-VL-72B into your application without learning new frameworks or writing custom API wrappers. The platform provides a fully managed, OpenAI-compatible Serverless Inference API. Updating the base URL and API key routes multimodal requests directly to European data centers.
from openai import OpenAI
client = OpenAI(
base_url="https://api.lyceum.technology/api/v2/external/serverless",
api_key="<your lyceum api key>",
)
response = client.chat.completions.create(
model="Qwen/Qwen2.5-VL-72B-Instruct",
messages=[{"role": "user", "content": [
{"type": "text", "text": "What's in this image?"},
{"type": "image_url", "image_url": {"url": "<image-url>"}},
]}],
)
print(response.choices[0].message.content)Pricing and region for Qwen2.5-VL-72B
Route inference traffic to Lyceum to pay strictly for the compute consumed. Qwen2.5-VL-72B is available on the Standard tier, which is optimized for high-capability workloads requiring precision and reasoning depth. The model is hosted in the eu-north1 region, ensuring that images, videos, and text prompts never leave European jurisdiction.
The pricing for Qwen2.5-VL-72B is competitive for a 72-billion parameter multimodal model. You are billed at $0.25 per million input tokens and $0.75 per million output tokens. Operating proprietary GPU infrastructure rather than renting capacity from hyperscalers allows Lyceum to offer these rates without minimum commitments, base fees, or egress charges. You scale from zero and pay only for the tokens processed.
What Qwen2.5-VL-72B is good at
Advanced document parsing and structured output
Qwen2.5-VL-72B excels at extracting structured data from complex, text-heavy images. The model was trained on a massive corpus of document data, enabling it to parse invoices, forms, tables, charts, and even mathematical equations or chemical formulas. It can output stable JSON representations of this data, making it an ideal choice for automated data entry, financial document processing, and enterprise OCR pipelines.
Spatial localization and visual agent capabilities
Unlike earlier vision-language models that only describe images globally, Qwen2.5-VL-72B possesses precise spatial localization capabilities. It can accurately pinpoint objects within an image and output their exact coordinates using bounding boxes or points. This spatial awareness allows the model to act as a visual agent. It can reason about user interfaces, dynamically direct tools, and execute tasks in real-world scenarios, such as operating computer desktops or mobile device screens based on visual feedback.
Long-video comprehension
Qwen2.5-VL-72B introduces a leap in temporal understanding. By implementing dynamic FPS sampling and absolute time encoding, the model can process videos extending over an hour in length. It natively perceives temporal dynamics without relying on traditional normalization techniques. This allows the model to not only summarize long videos but also pinpoint specific events down to the second, making it highly effective for security footage analysis, meeting transcription, and automated video editing workflows.
Benchmarks and how it compares
Qwen2.5-VL-72B benchmark results
Qwen2.5-VL-72B was evaluated against both open-source and proprietary multimodal models. The official technical report states the 72B flagship model matches or exceeds the performance of proprietary models like GPT-4o and Claude 3.5 Sonnet in document parsing and spatial reasoning tasks.
| Benchmark | Qwen2.5-VL-72B | GPT-4o | Claude 3.5 Sonnet |
|---|---|---|---|
| MMMU (College-level reasoning) | 70.2 | 69.1 | 68.3 |
| MathVista (Visual math) | 70.5 | 63.8 | 67.7 |
| DocVQA (Document understanding) | 95.3 | 92.8 | 95.2 |
| OCRBench (Text extraction) | 875 | 736 | 788 |
Source: Qwen2.5-VL Technical Report (arXiv)
Compared to its sibling models, Qwen2.5-VL-72B provides an increase in reasoning depth. While the smaller Qwen2.5-VL-7B is efficient for basic image captioning and the Qwen2.5-VL-32B offers a middle ground for edge deployments, the 72B variant is required for complex agentic workflows, dense document omni-parsing, and hour-long video comprehension. On OCRBench_v2, the 72B model demonstrated superior multilingual capabilities, outperforming Gemini 1.5-Pro by 9.6 percent in English and 20.6 percent in Chinese. For enterprise applications where accuracy is paramount, the 72B model is the choice in the Qwen lineup.
Using it in production
Production configuration for Qwen2.5-VL-72B
Deploying Qwen2.5-VL-72B in production requires careful management of multimodal inputs to optimize both latency and cost. The model supports a massive context window, allowing you to pass multiple high-resolution images or long video sequences in a single API request. However, because image tokens are calculated based on the dynamic resolution processing of the Vision Transformer, sending raw, uncompressed 4K images will rapidly inflate your input token count. We recommend resizing images to the minimum resolution necessary for your specific task, such as 1024x1024 for standard OCR, to keep inference fast and cost-effective.
When calling the model via Lyceum Technology, you are utilizing our Standard tier. This tier is provisioned on high-end NVIDIA GPUs in our eu-north1 region, ensuring the compute density required for a 72-billion parameter model. The Standard tier prioritizes high-capability execution, making it ideal for complex reasoning, structured JSON extraction, and agentic tool use.
Lyceum charges strictly per token, keeping costs predictable. At $0.25 per million input tokens and $0.75 per million output tokens, a typical document parsing workload is economical. For example, processing 1,000 invoices, where each invoice consumes roughly 2,000 input tokens for the image and prompt and generates 300 output tokens for the extracted JSON, results in a total consumption of 2 million input tokens and 300,000 output tokens. This batch costs $0.50 for input and $0.225 for output, totaling $0.725. You achieve OCR and data extraction without the overhead of maintaining dedicated infrastructure.
Running Qwen2.5-VL-72B on EU-sovereign infrastructure
Why run Qwen2.5-VL-72B on Lyceum
For European enterprises and AI startups, data residency is not optional. When processing sensitive multimodal data, such as medical scans, financial documents, or factory floor security footage, sending that data to US-based hyperscalers or API providers introduces severe compliance risks. Lyceum Technology solves this by serving Qwen2.5-VL-72B entirely from our eu-north1 region. Your images, videos, and prompts are processed on EU-sovereign infrastructure, ensuring strict adherence to GDPR and providing a clear path to AI Act compliance.
Unlike many API providers that act as middlemen renting capacity from AWS or Google Cloud, Lyceum owns and operates its GPU infrastructure. This structural advantage allows us to offer Qwen2.5-VL-72B at highly competitive per-token rates without base fees or egress charges. You get the performance of a 72-billion parameter model without the financial burden of provisioning dedicated H100 clusters.
The platform prevents vendor lock-in through open-stack transparency. The inference engine is built on open-source technologies like vLLM and NVIDIA Dynamo, ensuring that you can self-host your LLM API on EU infrastructure if scale demands it. A drop-in OpenAI-compatible API makes it trivial for engineering teams to switch from proprietary US models to Qwen2.5-VL-72B. Updating the base URL and API key instantly upgrades the application with multimodal capabilities while securing data within Europe.