How do I choose between an H100 and an A100 for inference?

The H100 is significantly faster for inference due to its Transformer Engine, which supports FP8 precision. If your budget allows, the H100 provides better price-performance for production workloads. The A100 remains a solid choice for smaller models or when H100 availability is limited.

What is 'Scale to Zero' and how does it affect latency?

Scale to Zero allows your inference endpoint to shut down when not in use, saving costs. The trade-off is 'cold start' latency, where the first request after an idle period takes longer (often 20-30 seconds) as the model is reloaded into GPU memory.

Why should I care about GDPR for LLM inference?

If you process personal data of EU citizens, GDPR requires strict data protection measures. Using US-hosted providers can lead to legal uncertainty. EU-sovereign providers like Lyceum ensure that data never leaves European borders, simplifying compliance for your legal team.

Can I run multiple models on a single GPU?

Yes, if the total VRAM of the models and their KV caches fits within the GPU's capacity. However, sharing a GPU can lead to 'noisy neighbor' issues where one model's traffic spikes increase the latency of the other.

What is continuous batching?

Continuous batching is an optimization where the inference engine inserts new requests into the current processing cycle as soon as a token is generated, rather than waiting for the entire previous batch to complete. This significantly increases throughput and reduces average latency.

Reduce LLM Inference Latency: GPU Optimization Guide

<p>Latency is no longer just a metric for user experience; it is the primary driver of unit economics. As teams move from prototyping to production, the <a href="/magazine/pay-per-token-vs-dedicated-gpu-inference">cost of serving a model</a> often outweighs the cost of training it. For a 100-person AI startup, a 200ms delay in Time to First Token (TTFT) can be the difference between a product that feels like a fluid conversation and one that feels like a broken utility. Reducing latency requires a multi-layered approach that spans from the CUDA kernel level up to the orchestration layer. We see many teams struggle with the transition from hyperscaler credits to sustainable infrastructure, often because they are running unoptimized stacks that leave 60% of GPU performance on the table.</p>

Understanding the Latency Hierarchy: TTFT vs. TPOT

Before optimizing, you must distinguish between the two primary latency metrics. Time to First Token (TTFT) measures how quickly the model starts responding, which is critical for interactive applications. Time Per Output Token (TPOT) measures the speed of subsequent tokens, determining the overall reading speed. According to a 2025 report from Artificial Analysis, the industry benchmark for high-performance Llama 3.1 70B inference is a TTFT under 200ms and a TPOT under 30ms.

Latency bottlenecks typically fall into two categories: compute-bound and memory-bound. During the prefill phase (generating the first token), the GPU is often compute-bound as it processes the entire input prompt in parallel. During the decoding phase (generating subsequent tokens), the GPU becomes memory-bound because it must fetch model weights and the KV cache from HBM (High Bandwidth Memory) for every single token generated.

Prefill Phase
Highly parallel, benefits from raw TFLOPS.
Decoding Phase
Sequential, benefits from high memory bandwidth (GB/s).
KV Cache
Grows with context length, leading to Out-of-Memory (OOM) errors if not managed.

Software Optimization: vLLM, TensorRT-LLM, and NVIDIA Dynamo

The choice of inference engine is the most significant software decision you will make. Standard PyTorch implementations are insufficient for production. Modern engines like vLLM and NVIDIA TensorRT-LLM use a technique called PagedAttention to manage the KV cache. This prevents memory fragmentation and allows for much higher batch sizes, which indirectly reduces latency by increasing throughput.

In March 2026, the release of NVIDIA Dynamo 1.0 provided a standardized orchestration layer that bridges the gap between raw compute and high-level APIs. An open-stack approach combining vLLM with NVIDIA Dynamo ensures customer portability. Unlike black-box proprietary engines, this stack allows you to maintain control over your model weights while achieving performance parity with specialized API providers. By using continuous batching, these engines process new requests immediately rather than waiting for an entire batch to finish, reducing average wait times by up to 70% in high-traffic scenarios.

Quantization Strategies: Balancing Precision and Speed

Quantization reduces the bit-precision of model weights, which decreases the amount of data the GPU must move from memory to the processing cores. Moving from FP16 (16-bit) to FP8 (8-bit) effectively doubles the memory bandwidth of an H100 GPU. According to NVIDIA's 2025 technical documentation, FP8 inference on Hopper architecture provides a 2x to 4x throughput increase with negligible loss in model accuracy.

Common quantization methods include:

AWQ (Activation-aware Weight Quantization)
Protects the most important weights to maintain accuracy at 4-bit precision.
FP8
The current standard for H100 and B200 GPUs, offering a perfect balance of speed and precision.
INT8
Older but reliable for previous-generation hardware like the A100.

For teams running large models like Llama 3 405B, quantization is not optional; it is a requirement to fit the model within the VRAM of a single 8-GPU node. Reducing the memory footprint also allows for larger KV caches, enabling longer context windows without a linear increase in latency.

Architectural Tactics: Speculative Decoding

Speculative decoding is a powerful technique where a smaller, faster "draft" model predicts the next few tokens, which are then verified by the larger "target" model in a single forward pass. If the draft model is correct, you generate multiple tokens in the time it would usually take to generate one. This can improve TPOT by 2x to 3x without changing the underlying model's weights.

However, speculative decoding requires careful implementation. If the draft model's acceptance rate is low (below 50%), the overhead of verification can actually increase latency. We recommend using a draft model from the same family as your target model, for example, using a Llama 3 8B model to speculate for a Llama 3 70B model. This ensures a higher alignment in token distribution and better performance gains.

Infrastructure and Data Residency: The Hidden Latency

Network latency often negates GPU-level optimizations. For European startups, hosting models in US-based data centers adds 100ms to 150ms of round-trip time (RTT) due to physical distance. This is a deal-breaker for real-time applications like voice AI or interactive coding assistants. Furthermore, EU-regulated teams in healthcare or manufacturing face strict GDPR and AI Act requirements that mandate data residency within Europe.

Lyceum provides a sovereign alternative by hosting all infrastructure in European data centers. By provisioning VMs across a network of supply-side partners, you can deploy inference endpoints close to your end-users. Our Pythia AI Scheduler further optimizes costs by predicting VRAM requirements and selecting the most efficient GPU for your specific workload, often resulting in 30-34% savings compared to unmanaged hyperscaler instances. Using an OpenAI-compatible API, you can transition from US-based providers to EU-sovereign infrastructure with zero code changes, ensuring both compliance and low-latency performance.

Reduce LLM Inference Latency on GPUs: A Technical Guide

Understanding the Latency Hierarchy: TTFT vs. TPOT

Prefill Phase

Decoding Phase

KV Cache

Software Optimization: vLLM, TensorRT-LLM, and NVIDIA Dynamo

Quantization Strategies: Balancing Precision and Speed

AWQ (Activation-aware Weight Quantization)

FP8

INT8

Architectural Tactics: Speculative Decoding

Infrastructure and Data Residency: The Hidden Latency

Frequently Asked Questions

How do I choose between an H100 and an A100 for inference?

What is 'Scale to Zero' and how does it affect latency?

Why should I care about GDPR for LLM inference?

Can I run multiple models on a single GPU?

What is continuous batching?

Further Reading

Related Resources

Related Articles

vLLM vs TensorRT-LLM: Production Benchmark & Guide

Serverless GPU Cold Start Latency: Architecture Comparison

LLM Inference Tokens Per Second: 2026 Hardware and Software Benchmarks

Inference

Training