Reserved GPU capacity for your models.
Deploy your own models on dedicated hardware. Consistent latency, no cold starts, full control.
Deploy any HF model or your own custom container
Two paths to production. Choose based on your workflow.
HuggingFace Models
Any model from the Hub
We automatically select the optimal GPU based on model architecture and size. No configuration needed.
Custom Container
Any Docker image
Choose your GPU with the -m flag.
OpenAI-compatible
Drop-in replacement for OpenAI API. Change one line of code and your existing apps just work.
Consistent latency
Guaranteed response times with dedicated GPU allocation. No noisy neighbors, no variable performance.
No infrastructure management
We handle scaling, updates, and monitoring. You focus on building your product.
Custom model deployments
Deploy any model: fine-tuned weights, private models, or custom containers with your own architecture.
SLA guarantees
Enterprise-grade uptime with dedicated support. 99.9% availability commitment for production workloads.
Private endpoints
Isolated infrastructure with no shared resources. Your models run on hardware dedicated to you.
Serverless vs. Dedicated
Serverless
Coming Soon- • Per-token billing
- • Catalog models
- • Variable traffic
Dedicated
- Per-second billing
- Your custom models
- Easily scale up and down
GPU options
Choose the right GPU for your inference workload. Per-second pricing, scale up or down automatically.
| GPU | VRAM | Price/hour |
|---|---|---|
| NVIDIA B200 | 192 GB | $5.89 |
| NVIDIA H200 | 141 GB | $3.69 |
| NVIDIA H100 | 80 GB | $3.29 |
| NVIDIA A100 Default | 80 GB | $1.99 |
| NVIDIA L40S | 48 GB | $1.49 |
| NVIDIA T4 | 16 GB | $0.39 |