10 min read read

ZeRO-3 vs FSDP: A Deep Dive into Memory Efficiency for LLMs

Maximilian Niroomand

Maximilian Niroomand

February 23, 2026 · CTO & Co-Founder at Lyceum Technologies

ZeRO-3 vs FSDP: A Deep Dive into Memory Efficiency for LLMs
Lyceum Technologies

Training modern large language models (LLMs) has fundamentally changed the requirements for GPU memory management. Traditional Distributed Data Parallel (DDP) methods replicate the entire model state on every GPU, which quickly leads to Out-of-Memory (OOM) errors as model parameters exceed the capacity of a single card. To solve this, engineers turn to sharding techniques like Microsoft's DeepSpeed ZeRO-3 and PyTorch's Fully Sharded Data Parallel (FSDP). Both frameworks aim to reduce the memory footprint by distributing parameters, gradients, and optimizer states across the cluster. However, choosing between them requires a nuanced understanding of their communication patterns, offloading capabilities, and integration complexity. This article explores the technical trade-offs to help you maximize efficiency on sovereign GPU infrastructure.

The Memory Wall and the Need for Sharding

The primary bottleneck in training large-scale models is the memory footprint of the model states. For a model with 7 billion parameters, the weights alone occupy approximately 14 GB in half-precision (FP16/BF16). However, the total memory required during training is significantly higher. Optimizer states, such as those used in Adam, require an additional 12 bytes per parameter (4 bytes for the master weight copy, 4 bytes for the momentum, and 4 bytes for the variance). Gradients add another 2 to 4 bytes per parameter. When combined with activations and temporary buffers, a 7B model can easily require 60 GB to 80 GB of VRAM, exceeding the capacity of many common GPUs like the A100 40GB.

Why Standard DDP Falls Short

Standard Distributed Data Parallel (DDP) is inefficient because it replicates these states across every GPU in the cluster. If you have 8 GPUs, you are essentially wasting 7/8ths of your total aggregate memory on redundant data. This redundancy prevents researchers from scaling to larger models without resorting to complex model parallelism or pipeline parallelism. ZeRO (Zero Redundancy Optimizer) was introduced to eliminate this redundancy by sharding the states. By distributing the data, the memory requirement per GPU scales as 1/N, where N is the number of GPUs. This allows for training models that are orders of magnitude larger than what a single device could hold. Understanding how ZeRO-3 and FSDP implement this sharding is critical for any team moving beyond hyperscaler credits and into production-grade AI development on platforms like Lyceum, where hardware selection is automated for cost and performance.

DeepSpeed ZeRO-3: Architecture and Mechanics

DeepSpeed ZeRO-3 is the most advanced stage of the Zero Redundancy Optimizer. It builds upon Stage 1 (sharding optimizer states) and Stage 2 (sharding gradients) by also sharding the model parameters themselves. In a ZeRO-3 configuration, no single GPU holds the full set of weights for any given layer during the entire training step. Instead, parameters are fetched from other GPUs just-in-time for the forward and backward passes and then discarded (released) immediately after use. This process is managed through a series of collective communication calls, specifically all-gather and reduce-scatter operations.

One of the defining features of ZeRO-3 is its integration with DeepSpeed's ecosystem, which includes advanced features like ZeRO-Infinity. This allows for offloading model states not just to CPU memory, but also to NVMe storage, effectively enabling the training of trillion-parameter models on limited GPU resources. The framework uses a sophisticated prefetching mechanism to hide the latency of these data transfers. By predicting which parameters will be needed next, ZeRO-3 can begin the all-gather operation while the current layer is still computing. This overlapping of communication and computation is vital for maintaining high hardware utilization. For teams operating in EU-sovereign environments like Lyceum's Berlin and Zurich regions, ZeRO-3 provides a robust, battle-tested way to handle massive datasets while ensuring that the orchestration layer handles the underlying hardware complexity seamlessly.

PyTorch FSDP: The Native Sharding Alternative

When comparing the memory efficiency of ZeRO-3 and FSDP, the theoretical savings are identical as both follow the same mathematical principles of sharding. The memory footprint for model states on each GPU is reduced from $O(M)$ to $O(M/N)$, where $M$ is the model size and $N$ is the number of GPUs. However, practical differences arise in how each framework handles auxiliary memory, such as buffers, temporary tensors, and fragmentation. DeepSpeed ZeRO-3 often has a slight edge in extremely constrained environments due to its highly optimized memory allocator and its ability to aggressively offload almost everything to the CPU or NVMe.

FSDP, on the other hand, can sometimes suffer from higher memory fragmentation if the `auto_wrap_policy` is not configured correctly. If the wrapped units are too large, the temporary memory required to hold the gathered parameters during computation can cause OOM errors. Conversely, if the units are too small, the communication overhead becomes the dominant factor, leading to poor GPU utilization. Lyceum's platform addresses this by providing precise predictions of memory footprints before jobs run, allowing engineers to fine-tune their FSDP wrapping policies or ZeRO-3 configurations without the usual trial-and-error. In many benchmarks, FSDP shows slightly better performance in terms of TFLOPS per GPU when the model fits comfortably within the aggregate VRAM of the cluster, while ZeRO-3 remains the king of 'over-subscription' where the model size vastly exceeds the total VRAM.

Communication Overhead and Scaling Laws

The cost of memory efficiency is increased communication. Both ZeRO-3 and FSDP increase the total volume of data moved across the network by approximately 50% compared to standard DDP. Specifically, they require an all-gather operation to collect parameters before computation and a reduce-scatter operation to synchronize and shard gradients after computation. This makes the performance of these frameworks highly dependent on the interconnect bandwidth of the cluster. In a multi-node setup, the bottleneck is almost always the inter-node network (e.g., InfiniBand or 100GbE+ Ethernet).

ZeRO-3 manages this overhead through its 'communication overlap' features, which are highly tunable via the `deepspeed_config.json`. Parameters like `overlap_comm` and `allgather_bucket_size` allow engineers to balance the size of communication chunks. FSDP uses a similar concept with its `limit_all_gather_inplace` and `backward_prefetch` settings. In practice, FSDP's integration with the PyTorch autograd engine can sometimes lead to more efficient overlapping of the backward pass. However, DeepSpeed's long history of optimization for massive clusters often gives it the advantage in complex, multi-node environments. For European scaleups using Lyceum, the absence of egress fees and the use of high-performance local networking in Berlin and Zurich data centers mitigate some of these communication costs, making both frameworks viable for large-scale training.

CPU Offloading and NVMe Integration

When GPU memory is simply not enough, offloading to system RAM (CPU) or disk (NVMe) becomes necessary. DeepSpeed ZeRO-3 is the pioneer in this space. Its ZeRO-Offload and ZeRO-Infinity modules allow for the optimizer states and parameters to reside in CPU memory, only moving to the GPU for the actual computation. This can increase the effective capacity of a single GPU by 10x or more. The trade-off is a significant hit to training speed, as the PCIe bus becomes the bottleneck. However, for fine-tuning large models on a budget, this is an invaluable feature.

PyTorch FSDP also supports CPU offloading through the `cpu_offload` parameter in the `FullyShardedDataParallel` constructor. While functional, it is generally less feature-rich than DeepSpeed's implementation. For example, FSDP's offloading is primarily focused on the parameters and gradients, whereas DeepSpeed can offload the entire optimizer logic to the CPU (using highly optimized CPU kernels for Adam). This means that with DeepSpeed, the CPU not only stores the data but also performs the weight updates, freeing up the GPU to focus entirely on the forward and backward passes. For teams using Lyceum's auto hardware selection, the platform can help determine if a high-VRAM GPU (like an H100) is more cost-effective than a lower-spec GPU paired with aggressive DeepSpeed offloading, based on the total cost of compute (TCC).

Related Resources

/magazine/gpu-utilization-too-low-how-to-fix; /magazine/pytorch-memory-profiler-production; /magazine/gradient-checkpointing-memory-savings