Serverless Inference Model Library Multimodal & Vision 8 min read read

MiniCPM-V 4.5: specs, benchmarks, and how to run it on Lyceum

An 8B-parameter vision-language model with state-of-the-art video compression and OCR capabilities.

Magnus Grünewald

Magnus Grünewald

June 21, 2026 · CEO at Lyceum Technology

MiniCPM-V 4.5 is an 8-billion parameter multimodal large language model developed by OpenBMB. Built on the Qwen3-8B and SigLIP2-400M architectures, it excels in high-resolution image parsing, OCR, and long-video understanding. By utilizing a unified 3D-Resampler, it achieves a 96x compression rate for video tokens, drastically reducing inference costs for video workloads. Lyceum Technology serves MiniCPM-V 4.5 via our OpenAI-compatible Serverless Inference API. Hosted entirely in our eu-north1 region, it provides European AI teams with a GDPR-compliant, drop-in replacement for proprietary vision models.

Get started: call MiniCPM-V 4.5 on Lyceum

You can integrate MiniCPM-V 4.5 into your application using Lyceum's OpenAI-compatible API. Because our inference API is designed as a drop-in replacement for standard OpenAI SDKs, you only need to update your base URL and API key to start processing images, documents, and video frames. You can avoid writing custom integration code or managing complex multimodal container deployments.

from openai import OpenAI

client = OpenAI(
 base_url="https://api.lyceum.technology/api/v2/external/serverless",
 api_key="<your lyceum api key>",
)
response = client.chat.completions.create(
 model="openbmb/MiniCPM-V-4_5",
 messages=[{"role": "user", "content": [
 {"type": "text", "text": "What's in this image?"},
 {"type": "image_url", "image_url": {"url": "<image-url>"}},
 ]}],
)
print(response.choices[0].message.content)

Pricing and region for MiniCPM-V 4.5

MiniCPM-V 4.5 is available on Lyceum's Fast tier, which is optimized for cost-efficient, high-throughput workloads like batch document processing and continuous video stream analysis. The model is priced at $0.66 per million input tokens and $1.11 per million output tokens. All inference requests for this model are processed exclusively in our eu-north1 region. This ensures strict data residency and GDPR compliance for European engineering teams, making it safe to process sensitive visual data without routing it through US-based servers.

What MiniCPM-V 4.5 is good at

High-FPS and long video understanding

Most multimodal large language models struggle with video inputs because the token count scales linearly with the number of frames. This rapidly leads to out-of-memory errors, truncated context windows, and prohibitively high inference costs. MiniCPM-V 4.5 solves this fundamental bottleneck using a unified 3D-Resampler architecture. It compresses six 448x448 video frames into just 64 tokens, an unprecedented 96x compression rate. This allows the model to process high-FPS video (up to 10 FPS) and long-duration videos efficiently, achieving state-of-the-art results on benchmarks like Video-MME and LVBench for models under 30B parameters.

High-resolution OCR and document parsing

Built on the advanced LLaVA-UHD architecture, MiniCPM-V 4.5 can process high-resolution images of any aspect ratio, supporting up to 1.8 million pixels (for example, a 1344x1344 image). Crucially, it achieves this while using 4x fewer visual tokens than standard MLLMs. This makes the model exceptionally strong at Optical Character Recognition (OCR), extracting dense text from scanned documents, and parsing complex tables into structured Markdown formats for downstream data processing pipelines.

Controllable hybrid fast/deep thinking

The model introduces a novel hybrid reasoning approach. It supports a "fast thinking" mode for standard, latency-sensitive visual queries, and a "deep thinking" mode for complex problem-solving and multi-step reasoning tasks. This allows machine learning engineers to toggle between raw speed and analytical depth depending on the specific requirements of their production workload, optimizing both latency and compute costs.

Benchmarks and how it compares

MiniCPM-V 4.5 benchmark results

MiniCPM-V 4.5 was evaluated across a wide range of multimodal benchmarks, demonstrating performance that rivals much larger models. It achieves an average score of 77.0 on the OpenCompass multimodal evaluation suite, making it one of the most capable vision-language models under 30B parameters.

Benchmark MiniCPM-V 4.5 (8B) Qwen2.5-VL (72B)
OpenCompass (Average) 77.0 ~75.0
MME 2500 -
MMBench V1.1 84.2 -
MMVet 75.5 -

Source: OpenBMB MiniCPM-V 4.5 Technical Report and Hugging Face Model Card.

Comparison to sibling models

Compared to its predecessor, MiniCPM-V 2.6, the 4.5 version introduces the 3D-Resampler for massive video token compression and the hybrid thinking mode, resulting in significantly better performance on long-video benchmarks like Video-MME. When compared to other models in the Lyceum catalogue, such as Qwen2.5-VL 7B, MiniCPM-V 4.5 offers superior video compression. According to the technical report, MiniCPM-V 4.5 achieves state-of-the-art performance on Video-MME while using just 46.7% of the GPU memory cost and 8.7% of the inference time compared to Qwen2.5-VL 7B. This makes it the clear choice for teams processing large volumes of video data where compute efficiency is paramount.

Using it in production

Production configuration for MiniCPM-V 4.5

When deploying MiniCPM-V 4.5 for production workloads, understanding its token economics is critical for optimizing costs. Because of its LLaVA-UHD architecture, high-resolution images (up to 1.8M pixels) are processed using significantly fewer visual tokens than traditional vision models. For video, the 96x compression rate means you can pass substantial video context without exhausting the context window or inflating your per-token costs.

At $0.66 per million input tokens and $1.11 per million output tokens on Lyceum's Fast tier, the model is highly cost-effective for batch OCR and video analysis. For example, processing a batch of 1,000 document images - assuming roughly 1,000 input tokens per image and 200 output tokens for the extracted Markdown, would cost approximately $0.66 for the input and $0.22 for the output, totaling less than $1.00 for the entire batch. This makes it highly viable for large-scale enterprise document parsing.

For video workloads, you should extract frames at your desired FPS (up to 10 FPS is supported efficiently) and pass them as a sequence of image inputs in the API call. The model's architecture will automatically handle the compression. Ensure your application handles standard OpenAI SDK streaming responses if you are utilizing the model's text generation capabilities for real-time video narration or chat. You can also toggle the hybrid thinking mode via system prompts depending on whether you need low-latency responses or deep reasoning.

Running MiniCPM-V 4.5 on EU-sovereign infrastructure

Why run MiniCPM-V 4.5 on Lyceum

For European AI startups and enterprise engineering teams, data residency is often a strict, non-negotiable requirement. Processing sensitive documents, medical imagery, or proprietary factory video feeds through US-based hyperscalers can violate internal compliance policies or GDPR mandates. Lyceum Technology solves this by hosting MiniCPM-V 4.5 entirely within our eu-north1 region, ensuring your data never leaves European borders.

By using Lyceum, you get the ease of an OpenAI-compatible API without the compliance risks of routing data outside the EU. Our infrastructure is built on owned NVIDIA GPUs, giving us a structural cost advantage over API providers that simply rent compute from hyperscalers. This allows us to offer per-second billing and highly competitive per-token pricing without sacrificing performance or reliability.

Furthermore, Lyceum's open-stack transparency, powered by vLLM and NVIDIA Dynamo - ensures you avoid vendor lock-in. You can seamlessly transition from our Serverless Inference API to a dedicated self-hosted LLM API on EU infrastructure as your workload scales. You can start prototyping with MiniCPM-V 4.5 on a pay-per-token basis today and move to reserved VMs later, maintaining the exact same codebase and compliance posture throughout your entire scaling journey.

Frequently Asked Questions

What is the pricing for MiniCPM-V 4.5 on Lyceum?

MiniCPM-V 4.5 is available on Lyceum's Fast tier. It costs $0.66 per million input tokens and $1.11 per million output tokens. There are no base fees or minimum commitments, and you only pay for the exact tokens you process.

How do I call MiniCPM-V 4.5 using the API?

You can call the model using the standard OpenAI SDK. Set your base URL to `[removed], use your Lyceum API key, and specify `openbmb/MiniCPM-V-4_5` as the model string in your chat completions request.

Is MiniCPM-V 4.5 GDPR compliant?

Yes. When you run MiniCPM-V 4.5 on Lyceum Technology, all inference requests are processed in our `eu-north1` region. Your data remains entirely within European borders, ensuring full GDPR compliance and data sovereignty for sensitive workloads.

How does MiniCPM-V 4.5 handle video inputs?

The model uses a unified 3D-Resampler architecture that compresses six 448x448 video frames into just 64 tokens. This 96x compression rate allows it to process high-FPS and long-duration videos much more efficiently than standard vision-language models.

What is the difference between fast and deep thinking modes?

MiniCPM-V 4.5 features a controllable hybrid reasoning system. "Fast thinking" is optimized for standard, low-latency visual queries and OCR tasks, while "deep thinking" allocates more compute for complex problem-solving and multi-step reasoning on visual inputs.

How does MiniCPM-V 4.5 compare to Qwen2.5-VL?

While Qwen2.5-VL 72B is a much larger model, MiniCPM-V 4.5 (8B) achieves comparable or superior performance on several multimodal benchmarks, including OpenCompass. It is particularly more efficient for video tasks, using significantly less GPU memory and inference time.

Related Resources

/magazine/qwen2-5-vl-72b; /magazine/glm-5-2; /magazine/llama-3-3-70b