Dedicated Inference

Reserved GPU capacity for your models.

Deploy your own models on dedicated hardware. Consistent latency, no cold starts, full control.

llama-3.3-70b-ft
Healthy
Replica 1
73%
Replica 2
68%
Replica 3
45%
1.2K
req/min
89ms
p50
99.9%
uptime

Deploy any HF model or your own custom container

Two paths to production. Choose based on your workflow.

HuggingFace

HuggingFace Models

Any model from the Hub

Auto GPU
Terminal
$ lyceum deploy infer meta-llama/Llama-3.3-70B-Instruct
Analyzing model requirements...
Selected GPU: H100 80GB
Provisioning endpoint...
Endpoint ready: llama-3.3-70b-instruct
$ lyceum infer chat -m llama-3.3-70b-instruct
You: Explain quantum computing in simple terms
AI: Quantum computing uses quantum bits that can be 0 and 1 simultaneously...|

We automatically select the optimal GPU based on model architecture and size. No configuration needed.

Custom Container

Any Docker image

-m gpu.a100
Terminal
$ lyceum docker run myapp:latest -m gpu.a100
Pulling image...
Allocating A100 80GB...
Starting container...
Endpoint: https://d7f2a1b3-8080.port.lyceum.technology
$ curl -X POST .../predict -d '{input: "test"}'
{ "result": "prediction_output", "latency_ms": 12 }

Choose your GPU with the -m flag.

OpenAI-compatible

Drop-in replacement for OpenAI API. Change one line of code and your existing apps just work.

Consistent latency

Guaranteed response times with dedicated GPU allocation. No noisy neighbors, no variable performance.

No infrastructure management

We handle scaling, updates, and monitoring. You focus on building your product.

Custom model deployments

Deploy any model: fine-tuned weights, private models, or custom containers with your own architecture.

SLA guarantees

Enterprise-grade uptime with dedicated support. 99.9% availability commitment for production workloads.

Private endpoints

Isolated infrastructure with no shared resources. Your models run on hardware dedicated to you.

// Comparison

Serverless vs. Dedicated

Serverless

Coming Soon
  • Per-token billing
  • Catalog models
  • Variable traffic

Dedicated

  • Per-second billing
  • Your custom models
  • Easily scale up and down

GPU options

Choose the right GPU for your inference workload. Per-second pricing, scale up or down automatically.

GPU VRAM Price/hour
NVIDIA B200
192 GB $5.89
NVIDIA H200
141 GB $3.69
NVIDIA H100
80 GB $3.29
NVIDIA A100 Default
80 GB $1.99
NVIDIA L40S
48 GB $1.49
NVIDIA T4
16 GB $0.39

Ready to deploy?

Get dedicated GPU capacity for your models.