LLM Inference & Model Serving Inference Optimization 6 min read read

Optimizing LLM Inference Throughput with Batching Strategies

A technical guide to continuous batching and memory management

Magnus Grünewald

Magnus Grünewald

April 15, 2026 · CEO at Lyceum Technology

Large Language Model (LLM) inference is fundamentally limited by memory bandwidth rather than raw compute power. When serving a single request, the GPU spends most of its time moving model weights from VRAM to the processing cores, leaving the actual compute units underutilized. Batching addresses this by processing multiple sequences simultaneously, allowing the weights to be loaded once and applied to many tokens. However, traditional batching methods often lead to high latency and inefficient memory use due to varying sequence lengths. For engineering teams scaling production workloads, understanding the transition from static to continuous batching is essential for maintaining cost-effective infrastructure without sacrificing user experience.

The Memory Bandwidth Bottleneck in LLM Serving

Understanding why batching is critical requires looking at the arithmetic intensity of LLM inference. During the generation phase, the model processes one token at a time. For every token generated, the entire set of model parameters must be read from the GPU memory. For a 70B parameter model in FP16, this means reading 140 GB of data just to perform a few hundred million floating-point operations. On an NVIDIA H100 with 3.35 TB/s of bandwidth, the theoretical limit for a single stream is extremely low compared to the GPU's 2,000 TFLOPS of compute power.

This imbalance creates the memory wall. If you serve only one user at a time, your GPU utilization will likely hover around 5 percent. Batching is the primary mechanism to increase this utilization. By grouping 32 or 64 requests together, the cost of loading those 140 GB of weights is amortized across all requests in the batch. The goal is to move the workload from being memory-bound to being compute-bound, where the GPU cores are finally the bottleneck instead of the memory bus.

  • Arithmetic Intensity: The ratio of compute operations to memory accesses.
  • Memory-Bound: Performance limited by how fast data moves from VRAM to cores.
  • Compute-Bound: Performance limited by the speed of the GPU cores themselves.

According to research published in the 2025 AI Infrastructure Report, teams that fail to implement advanced batching strategies often see 60 percent higher operational costs due to idle GPU cycles. For European startups transitioning off hyperscaler credits, this inefficiency can quickly become a terminal burn rate issue.

Evolution from Static to Continuous Batching

Early inference servers relied on static batching. In this model, the server waits for a specific number of requests to arrive or for a timeout to trigger before forming a batch. All requests in that batch start at the same time and must finish at the same time. This creates a significant problem: the tail latency effect. If one request in a batch of 16 requires a 500-token response while the others only need 50 tokens, the GPU remains occupied by that single long request while the other 15 slots sit empty.

Continuous batching, also known as iteration-level batching, was introduced to solve this. Instead of waiting for the entire batch to complete, the scheduler operates at the level of individual token generation steps. As soon as one request in the batch finishes (reaches an EOS token), a new request from the queue can immediately take its place in the next iteration. This ensures that the batch is always full, maximizing throughput.

A document parsing service illustrates this well. You might have a mix of short metadata extractions and long summary tasks. With static batching, your throughput is dictated by the slowest summary. With continuous batching, the metadata tasks cycle through the GPU rapidly, keeping the hardware saturated.

PagedAttention and KV Cache Optimization

The biggest hurdle to large batch sizes is not the model weights, but the Key-Value (KV) cache. To avoid recomputing the entire sequence at every step, LLMs store the intermediate attention keys and values for all previous tokens in VRAM. As batch sizes and sequence lengths increase, the KV cache can easily consume 40 GB or more of memory, leading to Out-of-Memory (OOM) errors.

Traditional systems allocate a contiguous block of memory for the maximum possible sequence length for each request. This leads to two types of waste:

  1. Internal Fragmentation: Memory reserved for tokens that haven't been generated yet.
  2. External Fragmentation: Small gaps between allocated blocks that cannot be used for new requests.

PagedAttention, popularized by the vLLM project, treats KV cache memory like virtual memory in an operating system. It breaks the cache into small, non-contiguous blocks (pages). This allows the system to allocate memory only as needed, reducing waste to nearly zero. According to benchmarks from early 2025, PagedAttention allows for batch sizes up to 4x larger than traditional methods on the same hardware.

By implementing PagedAttention alongside continuous batching, Lyceum Technology enables users to serve models like Llama 3 or Mistral with significantly higher concurrency. This is particularly vital for GDPR-compliant workloads where data must remain within EU-sovereign infrastructure, as it allows teams to do more with fewer GPUs, directly lowering the cost of compliance.

Throughput Benchmarks and Real-World Gains

Evaluating batching strategies requires analyzing the relationship between throughput and latency. As you increase the batch size, total throughput (tokens per second across all users) increases, but the latency for an individual user (time per output token) also rises. The goal is to find the saturation point where throughput plateaus before latency becomes unacceptable.

Batching StrategyThroughput (Tokens/Sec)GPU UtilizationMemory Waste
No Batching~45<10%High
Static Batching~18045%30-50%
Continuous Batching~42085%<5%

In a 2025 performance study, continuous batching combined with FP8 quantization on NVIDIA H100 GPUs showed a 2.3x improvement in throughput compared to standard FP16 static batching. For a scale-up processing millions of tokens daily, this translates to thousands of euros in monthly savings, especially when using providers that eliminate hidden costs like egress fees. Common mistakes include setting the batch size too high, which can lead to request starvation where new requests wait too long in the queue, or failing to account for the memory overhead of multi-modal inputs.

For teams transitioning from experimental setups to production, we recommend starting with a conservative batch size and monitoring the KV cache occupancy. If your occupancy is consistently below 70 percent, you have room to increase your concurrency.

Frequently Asked Questions

Does batching work for all model architectures?

Yes, batching is a fundamental optimization for almost all Transformer-based models, including LLMs, vision transformers, and encoders. However, the specific implementation details like KV cache management are most critical for generative decoder-only models.

Can I use continuous batching with any GPU?

Continuous batching is a software-level orchestration strategy. While it can run on most modern NVIDIA GPUs, it is most effective on data center cards like the A100, H100, or B200 which have the memory capacity and bandwidth to handle large concurrent batches.

How do I determine the best batch size for my model?

The optimal batch size depends on your GPU's VRAM, the model size, and your latency requirements. You should benchmark your specific workload by increasing the batch size until you either hit VRAM limits or your time-per-output-token exceeds your target threshold.

Is continuous batching compatible with speculative decoding?

Yes, modern inference engines like NVIDIA Dynamo and vLLM are increasingly integrating speculative decoding with continuous batching. This allows the system to verify multiple tokens at once while still maintaining high concurrency.

How does Lyceum handle batching for dedicated inference?

Lyceum Technology's inference engine uses an open-stack approach based on vLLM and NVIDIA Dynamo. The platform includes native support for continuous batching and PagedAttention, ensuring that your dedicated European nodes operate at peak efficiency with OpenAI-compatible APIs.

Related Resources

/magazine/vllm-production-deployment-guide-2026; /magazine/nvidia-dynamo-inference-orchestration-guide; /magazine/reduce-llm-inference-latency-gpu