LLM Inference & Model Serving Self-Hosted LLM APIs 14 min read read

The Guide to Serving Fine-Tuned LLMs in Production

How to manage memory, choose your inference engine, and scale multi-LoRA architectures without burning your infrastructure budget.

Caspar Lehmkühler

Caspar Lehmkühler

May 30, 2026 · Head of Product at Lyceum Technology

Early fine-tuning setups required racks of GPUs, custom pipelines, and weeks of trial and error. The training phase is largely solved. The real bottleneck has shifted to deployment. When you have a base model and ten different fine-tuned variants for specific tasks, how do you serve them without dedicating a high-end GPU to each one? Traditional deployment methods fail to scale cost-effectively. Analyzing memory requirements, comparing serving engines, and implementing specific infrastructure strategies are essential for scaling.

The Memory Math of Fine-Tuned Inference

Analyzing memory math reveals why serving fine-tuned models is difficult. When you deploy a Large Language Model (LLM), your GPU VRAM is consumed by two primary components: the model weights and the KV (Key-Value) cache.

A 70-billion parameter model illustrates the scale of this problem. In 16-bit precision (FP16), the weights alone consume roughly 140GB of VRAM. If you fine-tune five different variants of this model - perhaps one for legal document parsing, one for code generation, and three for different customer support personas - loading five full copies requires 700GB of VRAM. You would need an 8x H100 node just to hold the weights in memory, before processing a single user request.

The KV Cache Bottleneck

The weights are static, but the KV cache is dynamic. During inference, the model stores the Key and Value states of past tokens to avoid recomputing them for every new token generated. The size of the KV cache scales linearly with the sequence length and the batch size.

  • KV Cache Formula

    2 * sequence_length * layers * hidden_size * batch_size * bytes_per_parameter

For long-context applications, the KV cache can quickly grow larger than the model weights themselves. If you dedicate a GPU to a single fine-tuned model that only receives sporadic traffic, you are wasting massive amounts of expensive memory. According to a Predibase playbook on LLM distillation, serving adapter-based LLMs significantly reduces the cost and complexity of deployment compared to full fine-tuning.

Moving Away from Full Fine-Tuning

The traditional approach of full-parameter fine-tuning creates an unsustainable deployment model. Every new task requires a completely independent set of weights. As the Anyscale Blog on end-to-end LLM workflows highlights, managing these separate endpoints leads to severe underutilization of compute resources. When traffic spikes for one variant but remains flat for another, you cannot easily share compute resources across them. The memory footprint becomes the defining constraint of your infrastructure, forcing engineering teams to over-provision hardware just to keep the system stable during peak loads. This is why the industry has rapidly shifted toward parameter-efficient methods that decouple the base model from the task-specific knowledge.

vLLM vs. TensorRT-LLM: Choosing Your Engine

You cannot run production inference using basic Python scripts. You need a dedicated inference engine. Two primary frameworks dominate the landscape: vLLM and TensorRT-LLM. They solve the same problem but take entirely different philosophical approaches to memory management and execution speed.

vLLM: The King of Flexibility

vLLM revolutionized inference by introducing PagedAttention. Traditional KV cache allocation suffers from severe memory fragmentation, where up to 60 percent of memory can be wasted due to over-provisioning for maximum sequence lengths. PagedAttention solves this by treating the KV cache like virtual memory in an operating system, breaking it into non-contiguous blocks. This allows vLLM to batch significantly more requests concurrently without running out of VRAM.

vLLM is highly dynamic. It supports continuous batching (adding new requests mid-decode) and handles heterogeneous traffic spikes exceptionally well. If you are swapping models frequently or running a wide variety of batch shapes, vLLM is the default choice for maintaining high throughput.

TensorRT-LLM: The King of Raw Speed

TensorRT-LLM, built by NVIDIA, optimizes for raw hardware efficiency. Instead of relying purely on dynamic runtime smarts, it uses ahead-of-time kernel fusion and graph capture. You compile an engine specific to your exact GPU architecture, precision (e.g., FP8), and expected sequence profile.

A benchmark analysis from the NVIDIA developer blog notes that TensorRT-LLM can achieve peak throughput of roughly 700 tokens per second on a Llama 70B model using an A100 GPU, while vLLM hits roughly 600 to 650 tokens per second. However, this speed comes at the cost of flexibility. If your traffic shape changes drastically, the compiled engine may perform sub-optimally, requiring a complete recompilation of the execution graph.

The Lyceum Approach to Inference

At Lyceum, we prioritize open-stack transparency. By standardizing on vLLM and integrating NVIDIA Dynamo, we give engineering teams high-performance inference without the vendor lock-in of proprietary, black-box serving engines. You get the flexibility of open-source with the performance optimizations required for production. This balance is critical when managing multiple fine-tuned variants, as the ability to dynamically allocate memory blocks outweighs the marginal gains of ahead-of-time compilation for most enterprise workloads.

The Multi-LoRA Serving Paradigm

If loading five full copies of a 70B model is financially unviable, what is the alternative? Multi-LoRA Serving provides the solution.

Instead of full-weight fine-tuning, modern teams use Low-Rank Adaptation (LoRA). LoRA freezes the base model weights and trains a small set of adapter weights, often representing just 1 to 2 percent of the base model size. In production, you load the massive base model into GPU memory exactly once. When a request arrives, the inference engine dynamically loads the specific LoRA adapter required for that request, injects it into the forward pass, and unloads it.

Handling Rank-Induced Heterogeneity

While multi-LoRA serving is highly efficient, it introduces a new challenge: rank-induced heterogeneity. Not all adapters are the same size. You might have a rank-8 adapter for a simple classification task and a rank-128 adapter for complex reasoning.

Research published in recent studies on serving heterogeneous LoRA adapters (arXiv:2511.22880) highlights that co-serving adapters of different ranks on the same base model can increase Time-to-First-Token (TTFT) by up to 84 percent if not managed correctly. The inference engine struggles to batch requests that require different amounts of compute, leading to severe pipeline stalls and inefficient GPU utilization.

Workload-Aware Dynamic Placement

To mitigate this performance degradation, production systems must implement workload-aware dynamic adapter placement. This involves grouping requests by adapter rank and utilizing fast PCIe or NVLink bandwidth to swap adapters from CPU RAM to GPU VRAM in milliseconds. By intelligently scheduling requests that share similar compute profiles, the inference engine can maintain high batch sizes without bottlenecking on the largest adapter in the queue. The Anyscale Blog emphasizes that end-to-end LLM workflows must account for these routing complexities. If your API gateway randomly distributes requests across a cluster without awareness of the underlying adapter ranks, you will inadvertently trigger constant context switching, negating the cost benefits of the multi-LoRA architecture.

Infrastructure Strategy: The Hyperscaler Cost Trap

The software stack is only half the battle. The hardware you run it on dictates your unit economics. Many engineering teams start by renting dedicated GPUs from public cloud hyperscalers, only to find their budgets decimated within months.

The Utilization Problem

Low utilization drives up infrastructure costs. Auto-scaling GPUs on public clouds is notoriously difficult. Providers often require block reservations for high-end hardware like H100s. This means you pay for 24/7 uptime, even when your inference traffic drops to zero overnight. If your cluster utilization hovers around 40 percent, which is common for bursty AI workloads, you are paying more than double the effective hourly rate for your compute. The Anyscale Blog notes that optimizing end-to-end LLM workflows requires tight integration between the serving layer and the underlying infrastructure to avoid these idle costs. When you are forced to reserve static blocks of compute, the financial benefits of efficient multi-LoRA serving are completely erased by the hardware bill.

Sovereignty and Per-Second Billing

For European teams, data residency adds another layer of friction. Fine-tuned models often process proprietary company data, personally identifiable information (PII), or regulated financial records. Sending this data to non-EU servers is a direct violation of GDPR and emerging AI Act requirements. You cannot compromise on compliance just to access cheaper compute.

Lyceum provides EU-sovereign GPU infrastructure with provable data residency. Because we own our hardware rather than renting it from hyperscalers, we offer H100 virtual machines with per-second billing and no egress fees. This structural cost advantage allows you to scale to zero and pay only when your models are actively serving traffic, solving the utilization problem entirely. By combining per-second billing with efficient multi-LoRA serving, enterprise teams can deploy dozens of specialized models without the massive financial overhead typically associated with generative AI production environments.

Common Production Mistakes

Even with the right engine and infrastructure, deployment can fail if you ignore operational realities. Moving from a local testing environment to a highly available production endpoint requires a fundamental shift in how you manage resources. Teams often encounter several common pitfalls when moving fine-tuned models to production:

Ignoring Cold Starts

Loading a 40GB model from object storage into GPU VRAM takes time. If your scale-to-zero configuration does not account for this, your first user will experience a 30-second timeout. You must implement distributed caching or keep the base model loaded while only swapping adapters. The Predibase playbook on distilling large language models emphasizes that adapter-based architectures are critical for mitigating these cold starts, as loading a 200MB adapter takes milliseconds compared to minutes for a full model.

Over-provisioning for Peak Traffic

Reserving static blocks of GPUs instead of building a queue-delay autoscaling system leads to massive cost overruns. Your infrastructure should scale based on concurrent requests and queue depth, not arbitrary CPU metrics. As highlighted in the Anyscale Blog regarding end-to-end LLM workflows, relying on traditional web server scaling metrics will cause your GPU cluster to scale too late or over-provision unnecessarily. You need custom metrics tied directly to the inference engine's internal KV cache utilization.

Overlooking Compliance and Data Residency

Treating infrastructure as a commodity can lead to severe legal exposure. Ensure your provider has a clear path to ISO 27001 and strict GDPR compliance before deploying models that handle sensitive user data. Many teams fine-tune models specifically to handle proprietary internal documents. If those models are served on infrastructure that routes traffic outside of your legal jurisdiction, you risk violating data sovereignty laws. Lyceum ensures that all data remains within European borders, providing a secure foundation for enterprise AI deployments.

Model Distillation as an Alternative to Fine-Tuning

While fine-tuning is the standard approach for adapting a model to a specific domain, it is not the only method available. For teams struggling with the computational overhead of serving massive 70B or 100B parameter models, model distillation offers a powerful alternative. Distillation involves training a smaller, more efficient student model to replicate the behavior and output quality of a much larger teacher model.

The Economics of Distillation

The Predibase playbook on distilling large language models outlines how this process can drastically alter your production economics. Instead of serving a massive model that requires multiple GPUs just to hold the weights, you can distill the necessary knowledge into a 7B or 8B parameter model. This smaller model can easily fit on a single, less expensive GPU, significantly reducing your hourly infrastructure costs.

Distillation is particularly effective for narrow, well-defined tasks. If your application only needs to extract JSON entities from legal contracts, you do not need the broad, general knowledge of a 70B model. By generating a high-quality synthetic dataset using the larger model, you can train a smaller model to achieve parity on that specific task.

Combining Distillation with LoRA

The most advanced production setups combine both techniques. You can distill a large model into a smaller base model, and then use multi-LoRA serving on that smaller base model to handle various sub-tasks. This hybrid approach maximizes both memory efficiency and task performance. The smaller base model ensures that the baseline KV cache and weight memory requirements remain low, while the LoRA adapters provide the flexibility to serve multiple user personas or specific customer configurations without deploying separate endpoints. By leveraging these techniques on Lyceum infrastructure, engineering teams can achieve exceptional inference speeds while maintaining strict control over their cloud budgets.

Optimizing End-to-End LLM Workflows

Serving the model is only one component of a production AI system. To achieve reliable performance, engineering teams must optimize the entire pipeline, from the moment a user submits a prompt to the final token generation. The Anyscale Blog on end-to-end LLM workflows emphasizes that bottlenecks often occur outside of the inference engine itself.

Data Preprocessing and Tokenization

Before a request ever reaches the GPU, the input text must be tokenized. In high-throughput systems, inefficient Python-based tokenizers can become a severe CPU bottleneck, starving the GPU of work. Production systems must utilize highly optimized, compiled tokenizers and handle preprocessing asynchronously. If your GPU is waiting for the CPU to format a prompt template, you are wasting expensive compute cycles. Implementing a dedicated preprocessing service that feeds a continuous stream of ready-to-compute tokens into the inference engine is critical for maximizing hardware utilization.

API Gateway and Request Routing

When serving multiple fine-tuned models via LoRA adapters, the API gateway plays a crucial role. It must intelligently route requests based on adapter availability and current GPU memory states. If the gateway simply uses a round-robin approach, it may send a request for a specific adapter to a node that has just unloaded it, forcing a costly reload from CPU memory. A workload-aware router tracks which adapters are currently active in the VRAM of specific nodes and directs traffic accordingly. This minimizes context switching and ensures that the inference engine can maintain high batch sizes. Furthermore, implementing robust retry logic and fallback mechanisms at the gateway level ensures high availability. If a specific node experiences a memory out-of-bounds error due to an unexpectedly large KV cache allocation, the gateway must seamlessly redirect the request to a healthy node without exposing the failure to the end user. By treating the entire workflow as a cohesive system rather than isolated components, teams deploying on Lyceum can extract the maximum possible performance from their allocated hardware.

Benchmarking and Performance Tuning

Deploying a fine-tuned model without establishing a rigorous benchmarking protocol is a recipe for unpredictable production failures. You must understand how your specific model behaves under various load conditions before routing live user traffic to it. The NVIDIA developer blog on LLM inference benchmarking with TensorRT-LLM provides a blueprint for how teams should approach performance tuning.

Defining Key Performance Indicators

Throughput and latency are the two primary metrics, but they are often at odds with each other. Throughput is measured in tokens per second across the entire system, while latency is typically measured as Time-to-First-Token (TTFT) and Time-Per-Output-Token (TPOT). If you aggressively increase your batch size to maximize throughput, your TTFT will inevitably degrade, leading to a poor user experience for interactive chat applications. Conversely, optimizing purely for TTFT by keeping batch sizes small will result in terrible GPU utilization and high infrastructure costs.

Simulating Realistic Traffic

Benchmarking must simulate your actual production traffic shape. Sending uniform requests of 500 input tokens and 50 output tokens will not reveal how your system handles edge cases. You must generate synthetic workloads that mimic the long-tail distribution of real user prompts. This includes testing how the inference engine handles sudden spikes in concurrent requests and how it manages memory when processing maximum-context documents. By utilizing tools that profile GPU memory bandwidth and compute utilization during these stress tests, engineering teams can identify exactly where the bottlenecks lie. Whether you choose vLLM for its dynamic batching or TensorRT-LLM for its compiled execution speed, running these benchmarks on Lyceum infrastructure ensures you have the empirical data needed to provision the correct amount of hardware for your specific use case.

Frequently Asked Questions

How does PagedAttention improve LLM serving?

PagedAttention partitions the KV cache into non-contiguous blocks, similar to virtual memory in an operating system. This eliminates memory fragmentation, allowing the inference engine to batch significantly more requests concurrently without running out of VRAM. By dynamically allocating memory only when needed, it prevents the massive waste associated with pre-allocating contiguous memory for maximum sequence lengths.

Why is auto-scaling GPUs difficult on public clouds?

Public clouds often require block reservations for high-end GPUs like H100s due to capacity constraints. This means true scale-to-zero auto-scaling is rarely supported without massive cold-start delays, forcing teams to pay for idle compute. When utilization drops during off-peak hours, you are still billed the maximum hourly rate, destroying the unit economics of your AI application.

What is rank-induced heterogeneity in LoRA serving?

When serving multiple adapters of different sizes, or ranks, on the same base model, the varying compute requirements can cause severe latency spikes. The inference engine struggles to batch a rank-8 adapter request with a rank-128 adapter request efficiently. According to recent research, this mismatch can increase Time-to-First-Token delays significantly if requests are not intelligently grouped.

How do I handle cold starts when scaling to zero?

To mitigate cold starts, use fast distributed caching with NVMe drives to load weights quickly. Alternatively, keep the massive base model loaded in memory permanently and only scale the lightweight LoRA adapters to zero. Because adapters are tiny, they load from CPU memory to GPU VRAM in milliseconds, completely masking the cold start delay from the user.

Why is EU data sovereignty important for fine-tuned models?

Fine-tuned models often contain proprietary company data, personally identifiable information, or regulated financial records embedded in their weights or processed in their prompts. Hosting them on non-EU infrastructure can violate GDPR and the AI Act. Utilizing a provider like Lyceum ensures provable data residency, making compliance a hard guarantee rather than an operational risk.

Related Resources

/magazine/self-host-llm-api-eu-infrastructure; /magazine/openai-compatible-api-self-hosted; /magazine/deploy-private-llm-endpoint-gpu-cloud