GPU Memory Management OOM Troubleshooting 14 min read read

Multi-GPU Tensor Parallelism Setup: Configuration and Optimization Guide

Scale your LLM inference and training across multiple GPUs by mastering tensor parallelism configurations.

Caspar Lehmkühler

Caspar Lehmkühler

May 22, 2026 · Head of Product at Lyceum Technology

As model sizes scale beyond the memory limits of single accelerators, multi-GPU setups become mandatory. Even an 80GB H100 cannot hold a 70B parameter model in full precision. While data parallelism works perfectly for smaller models, massive architectures require splitting the model itself. Tensor parallelism achieves this by dividing individual layers across multiple GPUs. This technique interacts with pipeline parallelism and relies on high interconnect bandwidth to maximize KV cache utilization and overall system throughput.

The Mechanics of Tensor Parallelism

Tensor parallelism distributes the computation of a single tensor operation across multiple devices. This technique shards large linear layers to distribute both memory and compute, a method popularized by the Megatron-LM architecture.

Matrix Multiplication Sharding

Consider a standard matrix multiplication operation where an input matrix is multiplied by a weight matrix. In a single-GPU setup, the entire weight matrix resides in the VRAM of one device. In a tensor parallel setup, the weight matrix is split into smaller blocks.

Typically, the first linear layer in a transformer block undergoes column-wise sharding. The weight matrix is divided along its columns, and each GPU holds one column block. The input matrix is multiplied by each column block independently on the respective GPUs. This produces partial output matrices.

The subsequent linear layer then uses row-wise sharding. The input to this second layer is the sharded output from the first layer. Because of the mathematical properties of matrix multiplication, the GPUs can perform the second multiplication locally. However, to ensure the final result is correct, the system must aggregate the partial results from all GPUs. This is accomplished using an AllReduce communication operation.

This specific arrangement of column-wise followed by row-wise sharding is brilliant because it requires only a single AllReduce operation per transformer block. By minimizing communication overhead, you keep the GPUs saturated with compute rather than waiting on network transfers.

Interconnect Bandwidth Requirements

Unlike pipeline parallelism which splits the model layer-by-layer, tensor parallelism splits the weights within the layers themselves. Every GPU receives an identical batch of data but computes only a fraction of the matrix multiplication. Because the devices must synchronize their partial results constantly, tensor parallelism demands massive interconnect bandwidth.

Tensor parallelism is not recommended for GPUs linked only by standard PCIe. The communication overhead will severely degrade performance. You must run tensor parallelism on GPUs connected via high-speed interconnects like NVIDIA's NVLink, which provides up to 900 GB/s of bandwidth compared to the 64 GB/s of PCIe Gen 4.

Configuring Tensor Parallelism in vLLM

When deploying models for inference, vLLM provides native support for tensor parallelism. You control this via the tensor_parallel_size parameter, which dictates how many GPUs will share the model weights. Configuring this parameter correctly is the first step toward achieving optimal hardware utilization.

Super-Linear KV Cache Scaling

Tensor parallelism in vLLM enables super-linear scaling of the KV cache. When you split a model across multiple GPUs, the memory required for the model weights on each individual GPU decreases proportionally. Because the operating system and the inference engine have fixed memory overheads that do not scale with the model size, the VRAM freed up by sharding the weights translates directly into available space for the KV cache.

Moving from a single GPU to a multi-GPU tensor parallel setup significantly increases the available KV cache blocks. This expanded cache allows you to process much larger batch sizes, resulting in higher token throughput. You often get more than double the performance from doubling the hardware because the primary bottleneck shifts from compute limitations to memory capacity. This makes tensor parallelism highly cost-effective for high-traffic inference endpoints.

Handling Quantization Constraints

Align your tensor parallel size with your model's architecture and quantization scheme. The tensor dimensions must divide cleanly by the number of parallel shards. If your model uses a quantization block size of 128 and a layer output size of 64, setting a tensor parallel size of 8 will trigger a strict division error. The vLLM engine will explicitly warn you that the output size is not divisible by the weight quantization block. Always verify that your chosen GPU count evenly divides the hidden dimensions of your specific model before launching your deployment.

Monitoring Startup Logs

When you launch vLLM, pay close attention to the startup logs. The system reports the total number of tokens that can be stored in the GPU KV cache and estimates the maximum concurrency for your configured sequence length. If these numbers fall below your production requirements, you need to increase your tensor parallel size to free up more VRAM. Monitoring these logs is crucial for capacity planning and ensuring your multi-GPU setup can handle expected traffic spikes without dropping requests.

Combining Parallelism Strategies (3D Parallelism)

Tensor parallelism breaks down when forced across slow network links. Due to the constant AllReduce synchronization required after every layer computation, running tensor parallelism across different physical servers creates severe bottlenecks. The network simply cannot keep up with the GPUs.

Bridging Nodes with Pipeline Parallelism

For models that exceed the capacity of a single node, you must combine tensor parallelism with pipeline parallelism. Restrict tensor parallelism to GPUs within the same physical node connected via high-speed NVLink, and use pipeline parallelism to bridge multiple nodes over Ethernet or InfiniBand networks.

Pipeline parallelism divides the model vertically rather than horizontally. In an 80-layer model, the first node might hold layers 1 through 40, while the second node holds layers 41 through 80. Node 1 processes the forward pass for its assigned layers and sends the resulting hidden states to Node 2. Because this communication only happens at the boundary between layer groups, the bandwidth requirements are significantly lower than the constant, intensive synchronization required by tensor parallelism.

Configuring 3D Parallelism

If you deploy a massive model across two nodes with eight GPUs each, you would set tensor_parallel_size=8 and pipeline_parallel_size=2 in your vLLM configuration. This specific configuration maximizes the high-bandwidth internal NVLink for layer computations while minimizing the data sent across the slower inter-node network.

For massive training runs, engineering teams employ 3D parallelism. This approach combines tensor, pipeline, and data parallelism. Data parallelism duplicates the entire model pipeline across multiple worker groups, feeding each group a different micro-batch of training data. This allows you to scale out to thousands of GPUs simultaneously. However, managing the pipeline bubbles, which is the idle time when GPUs are waiting for data from previous pipeline stages, requires advanced scheduling techniques like the 1F1B schedule to maintain high hardware utilization.

Infrastructure Requirements for Multi-GPU Workloads

Underlying hardware infrastructure dictates whether a tensor parallel setup achieves peak utilization or stalls entirely on communication overhead. Software configuration can only take you so far if the physical network connecting your accelerators is inadequate.

The Hyperscaler Bottleneck

Auto-scaling large GPU clusters on hyperscaler platforms rarely works efficiently in practice. Securing eight interconnected H100s dynamically often results in allocation failures or requires expensive, long-term block reservations. Furthermore, hyperscaler pricing models make sustained training runs or 24/7 inference endpoints financially unsustainable for most scale-ups. Virtualization layers on these platforms can also introduce latency that degrades the performance of tightly coupled tensor parallel operations.

The Lyceum Technology Advantage

Lyceum Technology infrastructure supports multi-GPU workloads. By owning the underlying infrastructure across European data centers, Lyceum Technology delivers raw GPU access via SSH. Dedicated, bare-metal performance avoids the virtualization overhead of standard cloud providers. This direct access ensures that your AllReduce operations execute with the lowest possible latency.

For teams operating under strict regulatory requirements, data residency is non-negotiable. Lyceum ensures full GDPR compliance by keeping all data within EU borders. You can deploy an OpenAI-compatible inference API on sovereign infrastructure, maintaining complete control over your models and customer data while significantly reducing infrastructure expenses compared to hyperscaler list prices.

NVLink and Topology Requirements

When configuring your infrastructure, you must ensure you select instances with NVLink enabled. A cluster of eight GPUs connected only via standard PCIe will severely bottleneck your tensor parallel workloads due to limited bandwidth. The Pythia AI Scheduler developed by Lyceum Technology automatically predicts VRAM requirements and selects the optimal GPU configuration to improve job efficiency, ensuring your workloads are placed on hardware with the correct topology for maximum throughput.

Advanced Optimization Techniques

Advanced optimization techniques beyond basic tensor parallelism can further improve multi-GPU performance, especially as model architectures become more complex and context windows grow larger.

Sequence Parallelism for Memory Efficiency

For training workloads, NVIDIA Megatron Core documentation recommends using sequence parallelism in conjunction with tensor parallelism. While tensor parallelism shards the model weights, sequence parallelism partitions the activations along the sequence dimension. This drastically reduces the activation memory footprint during the forward and backward passes. This combination allows you to train on much longer context windows without triggering out-of-memory errors on your existing hardware, effectively maximizing the utility of your VRAM.

Expert Parallelism for MoE Architectures

If you are working with Mixture of Experts models, you should evaluate expert parallelism. In a standard tensor parallel setup for an MoE model, the linear layers of every single expert would be sharded across all GPUs. This creates massive communication overhead. Instead, expert parallelism assigns entire experts to specific GPUs. This reduces the communication burden because the routing network only sends tokens to the specific GPUs hosting the relevant experts. This targeted routing keeps the network traffic low and the compute utilization high.

Context Parallelism for Extreme Sequences

Context parallelism supports extreme long-context scenarios. Context parallelism divides the input sequence across multiple GPUs, allowing you to process documents with millions of tokens simultaneously. Unlike sequence parallelism which only partitions the activations for memory savings, context parallelism actually partitions the attention computation itself. The GPUs compute partial attention scores and communicate to assemble the final attention matrix. When combined with tensor parallelism, context parallelism unlocks the ability to run inference on massive documents that would otherwise be impossible to process on a single node.

Data Parallelism and ZeRO Optimization

While tensor parallelism is essential for splitting individual layers, it is often combined with Data Parallelism to scale training throughput. Understanding how these two interact is critical for setting up efficient multi-GPU training clusters.

Standard Data Parallelism

Data parallelism is the most straightforward distributed training technique. In a standard data parallel setup, the exact same model weights are duplicated across every single GPU in the cluster. The training dataset is divided into smaller micro-batches, and each GPU processes a different micro-batch independently. After the backward pass, the GPUs synchronize their gradients using an AllReduce operation, update their local weights, and proceed to the next step. However, standard data parallelism fails when the model weights and optimizer states exceed the VRAM of a single GPU.

Zero Redundancy Optimizer

The Zero Redundancy Optimizer solves memory limitations of standard data parallelism without the high communication overhead of tensor parallelism. ZeRO partitions the memory footprint of the model across the data parallel ranks instead of duplicating it. ZeRO operates in three distinct stages. Stage 1 partitions the optimizer states. Stage 2 partitions the gradients. Stage 3 partitions the actual model parameters.

Comparing ZeRO and Tensor Parallelism

ZeRO Stage 3 and tensor parallelism both allow you to train models larger than a single GPU's memory, but they do so differently. Tensor parallelism splits the mathematical operations of a layer, requiring constant, high-bandwidth communication during the forward pass. ZeRO Stage 3 keeps the layer operations intact but fetches the required weights from other GPUs just in time for the computation, discarding them afterward. Depending on your network topology, combining ZeRO Stage 1 or 2 with tensor parallelism often yields the best balance of memory savings and compute efficiency for massive models.

Managing Pipeline Bubbles with 1F1B Scheduling

When you combine tensor parallelism within a node and pipeline parallelism across nodes, you introduce a new challenge known as the pipeline bubble. Managing this idle time is crucial for maintaining high cluster utilization.

The Problem with Naive Pipelining

In a naive pipeline parallel setup, Node 1 processes a batch of data through its assigned layers and sends the output to Node 2. While Node 2 is processing that data, Node 1 sits completely idle. Node 2 then finishes the forward pass, computes the loss, and starts the backward pass. Node 1 remains idle until Node 2 sends the gradients back. This sequential dependency creates massive gaps in compute utilization, known as pipeline bubbles. If you are paying for expensive hardware, having GPUs sit idle for 50 percent of the training step is unacceptable.

Micro-Batching Strategies

Dividing the global batch into smaller micro-batches mitigates pipeline bubbles. Node 1 processes the first micro-batch and passes it to Node 2. Instead of waiting, Node 1 immediately begins processing the second micro-batch. This overlapping of computation significantly reduces the idle time. However, the standard approach still requires all forward passes to complete before any backward passes can begin, which consumes a massive amount of activation memory.

The 1F1B Schedule

The solution to this memory and utilization problem is the One Forward One Backward schedule. In the 1F1B schedule, once the pipeline is full, each GPU alternates between processing one forward pass for a new micro-batch and one backward pass for an older micro-batch. This steady state keeps all GPUs constantly active and strictly limits the number of activations that must be stored in memory at any given time. Configuring the correct number of micro-batches relative to your pipeline stages is essential to minimize the bubble and maximize the efficiency of your multi-GPU setup.

Troubleshooting Multi-GPU Communication Bottlenecks

Setting up a multi-GPU tensor parallel environment is complex, and misconfigurations often manifest as severe performance degradation rather than outright crashes. Knowing how to troubleshoot communication bottlenecks is a vital skill for AI infrastructure engineers.

Identifying Topology Mismatches

The most common cause of poor tensor parallel performance is a hardware topology mismatch. Tensor parallelism requires massive bandwidth. If your GPUs are communicating over standard PCIe rather than NVLink, your throughput will plummet. You should always verify your physical interconnects before launching a workload. Tools provided by the hardware vendor can map the exact communication paths between your GPUs. If the output shows that traffic between GPU 0 and GPU 1 is routing through the host CPU rather than a direct high-speed link, you must reconfigure your hardware or adjust your tensor parallel size to group only directly connected devices.

Diagnosing Quantization and Dimension Errors

Software misconfigurations also cause significant issues. As noted in vLLM documentation, tensor parallelism divides the hidden dimensions of your model across the GPUs. If you are using a quantized model, the layer output size must be perfectly divisible by the product of your quantization block size and your tensor parallel size. If you receive division errors during engine startup, you must either change your tensor parallel size, select a model with a different hidden dimension, or use a different quantization format.

Monitoring GPU Utilization

Finally, monitor your GPU utilization metrics closely. In a healthy tensor parallel setup, all participating GPUs should show nearly identical, high utilization rates. If you observe one GPU pegged at maximum capacity while others are frequently dropping to zero, you likely have a severe pipeline bubble, an unbalanced workload distribution, or a stalled AllReduce operation. Investigating these utilization gaps will guide you toward tuning your micro-batch sizes or adjusting your parallelization strategy.

Frequently Asked Questions

What is the difference between tensor and pipeline parallelism?

Tensor parallelism splits individual mathematical layers across multiple GPUs, requiring constant, high-speed synchronization during computation. This demands high-bandwidth interconnects like NVLink. Pipeline parallelism divides the model layer-by-layer across GPUs, passing the complete output of one stage to the next. This requires less frequent communication, making it suitable for bridging multiple nodes over standard Ethernet or InfiniBand networks.

How does tensor parallelism affect KV cache?

Splitting the model weights across multiple GPUs significantly frees up VRAM on each device. Because the operating system and inference engine have fixed memory overheads, this freed VRAM allows the inference engine to allocate much more memory to the KV cache. This scales super-linearly, enabling much larger batch sizes and drastically improving overall token throughput during inference.

Can I run tensor parallelism over PCIe?

Running tensor parallelism over standard PCIe links is highly discouraged and will result in poor performance. The constant AllReduce communication required between GPUs during every layer computation will severely bottleneck the system due to PCIe bandwidth limits. You should always use high-bandwidth interconnects like NVLink to keep the GPUs saturated with compute tasks rather than waiting on network transfers.

How do I configure tensor parallelism in vLLM?

You configure it using the tensor_parallel_size parameter during engine initialization. Set this value to the number of GPUs connected via NVLink within your single node. If your model is too large and you need more GPUs than a single node provides, you must combine this setting with the pipeline_parallel_size parameter to distribute the workload across multiple physical servers.

Why do I get a quantization block size error with tensor parallelism?

Tensor parallelism divides the model's hidden dimensions across the specified number of GPUs. If the layer output size is not perfectly divisible by your quantization block size multiplied by the tensor parallel size, the engine cannot split the weights evenly. The system will throw an error to prevent corrupted outputs, requiring you to adjust your GPU count or quantization scheme.

Related Resources

/magazine/avoid-cuda-oom-large-language-model; /magazine/cuda-out-of-memory-fine-tuning-llama; /magazine/how-to-prevent-oom-errors-pytorch-training