Dedicated Inference

Reserved GPU capacity for your models.

Deploy your own models on dedicated hardware. Consistent latency, no cold starts, full control.

Request Access Talk to Sales

llama-3.3-70b-ft

Healthy

Replica 1

73%

Replica 2

68%

Replica 3

45%

1.2K

req/min

89ms

p50

99.9%

uptime

Deploy any HF model or your own custom container

Two paths to production. Choose based on your workflow.

HuggingFace Models

Any model from the Hub

Auto GPU

Terminal

$ lyceum deploy infer meta-llama/Llama-3.3-70B-Instruct

✓ Analyzing model requirements...

✓ Selected GPU: H100 80GB

✓ Provisioning endpoint...

Endpoint ready: llama-3.3-70b-instruct

$ lyceum infer chat -m llama-3.3-70b-instruct

You: Explain quantum computing in simple terms

AI: Quantum computing uses quantum bits that can be 0 and 1 simultaneously...|

We automatically select the optimal GPU based on model architecture and size. No configuration needed.

Custom Container

Any Docker image

-m gpu.a100

Terminal

$ lyceum docker run myapp:latest -m gpu.a100

✓ Pulling image...

✓ Allocating A100 80GB...

✓ Starting container...

Endpoint: https://d7f2a1b3-8080.port.lyceum.technology

$ curl -X POST .../predict -d '{input: "test"}'

{ "result": "prediction_output", "latency_ms": 12 }

Choose your GPU with the -m flag.

OpenAI-compatible

Drop-in replacement for OpenAI API. Change one line of code and your existing apps just work.

Consistent latency

Guaranteed response times with dedicated GPU allocation. No noisy neighbors, no variable performance.

No infrastructure management

We handle scaling, updates, and monitoring. You focus on building your product.

Custom model deployments

Deploy any model: fine-tuned weights, private models, or custom containers with your own architecture.

SLA guarantees

Enterprise-grade uptime with dedicated support. 99.9% availability commitment for production workloads.

Private endpoints

Isolated infrastructure with no shared resources. Your models run on hardware dedicated to you.

// Comparison

Serverless vs. Dedicated

Serverless

Coming Soon

• Per-token billing
• Catalog models
• Variable traffic

Dedicated

Per-second billing
Your custom models
Easily scale up and down

GPU options

Choose the right GPU for your inference workload. Per-second pricing, scale up or down automatically.

GPU	VRAM	Price/hour
NVIDIA B200	192 GB	$5.89
NVIDIA H200	141 GB	$3.69
NVIDIA H100	80 GB	$3.29
NVIDIA A100 Default	80 GB	$1.99
NVIDIA L40S	48 GB	$1.49
NVIDIA T4	16 GB	$0.39

View full pricing details →

Ready to deploy?

Get dedicated GPU capacity for your models.

Request Access Talk to Sales