How does Lyceum handle scale-to-zero for dedicated instances?

Lyceum allows you to set the minimum number of replicas to zero. When no traffic is detected, the instance is de-provisioned to save costs. Upon a new request, our optimized 18-second provisioning process restarts the instance, though the first request will experience a slight cold-start delay.

Can I use my own Docker images for dedicated inference?

Absolutely. Lyceum's Inference Engine is designed to be flexible. You can deploy any model from Hugging Face or submit your own custom Docker image, giving you full control over the software stack and inference engine.

What GPUs are available for dedicated inference?

We offer a range of NVIDIA GPUs across our European data centers, including H100, A100, and the latest B200 and H200 models. Our 40+ supply-side partners ensure availability even during global shortages.

Is there an API for managing these dedicated instances?

Yes, Lyceum provides a fully OpenAI-compatible API. This means you can use existing SDKs and simply change the base URL to our endpoint, making it a drop-in replacement for your current workflow.

Are there any egress fees for moving data?

No. Lyceum does not charge egress fees. We provide free S3-compatible storage and do not penalize you for moving your data or model weights in and out of our infrastructure.

Dedicated vs Shared GPU Inference: 2026 Technical Guide

<p>Infrastructure used for development rarely sustains production-grade AI scale. ML engineers and infrastructure leads face a bifurcated market. On one side, shared serverless environments offer low entry costs but introduce the risk of 'noisy neighbors' and unpredictable cold starts. On the other, <a href="/magazine/deploy-private-llm-endpoint-gpu-cloud">dedicated GPU instances</a> provide the raw performance and isolation required for enterprise-grade SLAs, yet they demand more sophisticated orchestration to remain cost-effective. For European teams, this choice is further complicated by the enforcement of the EU AI Act and the necessity of provable data residency. Understanding the technical trade-offs between these two models is essential for avoiding the architectural debt that often follows rapid growth.</p>

The Architecture of Isolation: Why Memory Bandwidth Matters

The primary bottleneck for Large Language Model (LLM) inference is memory bandwidth rather than compute cores. When you utilize shared GPU infrastructure, you are often competing for access to the High Bandwidth Memory (HBM3 or HBM3e) bus. Even with modern virtualization, multi-tenant environments can suffer from performance degradation when a neighboring container initiates a massive KV-cache update or a large model load.

Dedicated GPU inference eliminates this contention. By securing exclusive access to an NVIDIA H100 or B200, your application maintains the full 3.35 TB/s to 8 TB/s of bandwidth provided by the hardware. This isolation is critical for maintaining stable Time Per Output Token (TPOT) metrics. In production environments, a 10 percent variance in latency might seem negligible, but for interactive applications like real-time coding assistants or medical diagnostic tools, that variance can lead to a degraded user experience or timeout errors.

Dedicated Inference
Full access to HBM3e bandwidth, zero interference from other workloads, and predictable P99 latency.
Shared Inference
Multiplexed memory access, potential 'noisy neighbor' effects, and variable latency during peak regional demand.

Lyceum provides dedicated inference endpoints where the machine is exclusively yours. This ensures that your model performance remains deterministic, regardless of what other teams on the platform are doing. By utilizing NVIDIA software and vLLM, we provide an open-stack orchestration layer that maximizes the efficiency of these dedicated resources without the black-box limitations of proprietary engines.

The Economics of Scale: Finding the Crossover Point

The financial argument for shared inference is built on the premise of low utilization. If your model only processes a few hundred requests per day, paying for a dedicated H100 at a fixed hourly rate is inefficient. However, as your traffic grows, the per-token cost of shared APIs quickly surpasses the hourly cost of a dedicated instance. According to industry reports, the 'crossover point' occurs as sustained utilization increases.

Consider a scenario where a team is serving a Llama 3.1 70B model. On a shared, per-token API, high-volume usage can result in monthly bills that far exceed the cost of a reserved instance. By moving to a dedicated VM, teams can realize significant cost savings compared to traditional hyperscalers. For example, while H100 instances carry high costs on major US-based clouds, Lyceum provides the same hardware in European data centers, with per-second billing and no egress fees.

To optimize these economics, we implement a scale-to-zero capability. This allows your dedicated machine to shut down during periods of inactivity, such as overnight or between batch processing runs. You only pay for the uptime required to serve your traffic, effectively bridging the gap between the flexibility of serverless and the performance of dedicated hardware.

Compliance as a Technical Requirement: The EU Moat

For European startups and scale-ups, the choice of infrastructure is often dictated by legal necessity rather than just performance. The EU AI Act and GDPR impose strict requirements on where data is processed and how it is protected. Many shared inference providers are based in the US and host data on US-owned servers, which can be a deal-breaker for teams in regulated industries like healthcare, pharma, or defense.

Using a US-based shared provider often means your data is subject to the Cloud Act, potentially violating EU data sovereignty. Lyceum is an EU-native inference platform that operates entirely on European soil. Our infrastructure is owned and operated within the EU, ensuring that your inference workloads never leave the jurisdiction. This is not just a compliance checkbox; it is a competitive advantage when selling to European enterprises that require provable data residency.

We prioritize C5, ISO 27001, and AI Act compliance, treating compliance as a core part of our technical stack. For a medical imaging company or a legal-tech startup, the ability to tell a customer that their data is processed on a dedicated, GDPR-compliant GPU in a secure European data center is a powerful differentiator that shared, multi-tenant US clouds cannot easily replicate.

Dedicated vs Shared GPU Inference: Scaling AI Infrastructure

The Architecture of Isolation: Why Memory Bandwidth Matters

Dedicated Inference

Shared Inference

The Economics of Scale: Finding the Crossover Point

Compliance as a Technical Requirement: The EU Moat

Frequently Asked Questions

How does Lyceum handle scale-to-zero for dedicated instances?

Can I use my own Docker images for dedicated inference?

What GPUs are available for dedicated inference?

Is there an API for managing these dedicated instances?

Are there any egress fees for moving data?

Further Reading

Related Resources

Related Articles

vLLM vs TensorRT-LLM: Production Benchmark & Guide

Serverless GPU Cold Start Latency: Architecture Comparison

LLM Inference Tokens Per Second: 2026 Hardware and Software Benchmarks

Inference

Training