GPU Memory Management Memory Profiling 13 min read read

FP8 Training on H100: Benchmarks and Memory Savings

How 8-bit floating-point precision halves VRAM consumption and accelerates large language model training.

Caspar Lehmkühler

Caspar Lehmkühler

May 18, 2026 · Head of Product at Lyceum Technology

Large language model training is fundamentally constrained by GPU memory. A single 70B parameter model in 16-bit precision exhausts the 80GB capacity of an NVIDIA H100, forcing engineering teams into complex tensor parallelism and pipeline sharding strategies. The introduction of 8-bit floating-point precision fundamentally changes this math. By cutting the memory footprint of weights, gradients, and optimizer states in half, FP8 allows infrastructure leads to maximize cluster utilization. Understanding the technical mechanics of FP8 on Hopper architecture reveals the real-world memory savings and implementation details required for effective deployment.

The Evolution of Numerical Precision in Deep Learning

Deep learning precision has evolved through several generations, constantly trading a degree of numerical accuracy for computational efficiency. For years, FP32 (32-bit floating point) served as the undisputed standard for training neural networks. It stores values using 8 exponent bits and 23 mantissa bits, providing immense precision but requiring substantial memory and memory bandwidth. Memory bandwidth, rather than raw compute, often dictates the speed of large language model training. When data must be moved from high-bandwidth memory to the compute cores, smaller data types move faster. FP32 requires 4 bytes per parameter, severely limiting how fast data can be fed to the tensor cores.

The Shift to 16-Bit Formats

As models scaled into the billions of parameters, the industry shifted toward 16-bit formats. FP16 and BF16 reduced memory requirements by half. BF16, in particular, became the default for large language models because it retains the 8 exponent bits of FP32, offering the same dynamic range and preventing the catastrophic overflow issues that plagued early FP16 training runs. This transition allowed researchers to train larger models on the same hardware footprint, but the exponential growth of model sizes quickly consumed these gains.

The 8-Bit Breakthrough

FP8 takes this reduction one step further by utilizing only 8 bits total. This format cuts compute and memory requirements roughly in half compared to BF16 while maintaining production-quality accuracy. However, squeezing complex neural network representations into 8 bits requires specialized hardware and intelligent software management to prevent the model from collapsing during training. The transition to FP8 is not merely a software trick. It represents a fundamental co-design of hardware and software, spearheaded by the NVIDIA Hopper architecture. By reducing the bit width, the GPU can store twice as many parameters in the same VRAM space, drastically reducing the need for complex tensor parallelism across multiple nodes. This evolution from 32-bit to 8-bit precision highlights a broader trend in artificial intelligence infrastructure, where efficiency gains at the hardware level unlock new possibilities for model scaling and deployment.

The Mechanics of FP8 on Hopper Architecture

The NVIDIA H100 GPU introduced native hardware support for FP8 through its Hopper architecture. Unlike previous formats, FP8 is not a single specification. NVIDIA implements two distinct FP8 variants, each optimized for different phases of the training loop.

Dual Formats: E4M3 and E5M2

The first format is E4M3. This format allocates 4 bits to the exponent and 3 bits to the mantissa. It provides higher precision but a narrower dynamic range. E4M3 is utilized primarily for the forward pass, storing weights and activations where precision is critical for accurate predictions. The second format is E5M2. This format allocates 5 bits to the exponent and 2 bits to the mantissa. It sacrifices some precision to provide a wider dynamic range. E5M2 is deployed during the backward pass, where gradient values can fluctuate wildly and require a broader range to prevent underflow. This dual-format approach allows the architecture to balance the need for precision with the need for dynamic range, adapting to the specific mathematical requirements of each training phase.

The Role of the Transformer Engine

Managing the transition between these formats manually would be an engineering nightmare. To solve this, the H100 features a dedicated Transformer Engine. The hardware dynamically selects the optimal precision for each layer and operation. It scales values up before casting them to FP8 to prevent underflow, executes the matrix multiplication using the high-speed FP8 Tensor Cores, and then scales the results back down. This dynamic scaling ensures that the model maintains the same convergence behavior as a model trained entirely in higher precision. The Transformer Engine operates seamlessly under the hood, integrating with major deep learning frameworks to abstract away the complexity of mixed-precision management. By automating the casting and scaling processes, the Hopper architecture allows machine learning engineers to focus on model architecture and hyperparameter tuning rather than low-level numerical stability issues. This hardware-software synergy is what makes FP8 training viable for massive foundation models.

Quantifying the Memory Savings

The primary advantage of FP8 training is the drastic reduction in VRAM consumption. Storing a model's parameters, gradients, and optimizer states requires massive memory allocation. Moving from 16-bit to 8-bit precision cuts the baseline storage requirement for weights and activations by exactly 50%.

Halving the Parameter Footprint

Internal benchmarks show that FP8 training can reduce overall memory usage significantly compared to legacy FP32 mixed precision setups. A 70-billion parameter model provides a clear example of these savings. In BF16, the weights alone consume roughly 140GB of memory. FP8 reduces this to 70GB. This immediate reduction means that models which previously exceeded the capacity of a single 80GB H100 can now fit comfortably, fundamentally altering the infrastructure requirements for large-scale training. The savings extend beyond just the static weights.

Activation Memory and Batch Size

During the forward pass, the network must also store activations in memory to calculate gradients later. FP8 halves this activation memory footprint, which is often the largest consumer of VRAM during training with large batch sizes or long sequence lengths. These memory savings unlock several architectural advantages for machine learning engineers. With less memory consumed by weights and activations, you can increase the batch size, directly improving hardware utilization. Furthermore, models that previously required tensor parallelism across eight GPUs can often fit on four. This reduces the communication overhead between nodes, which is frequently the primary bottleneck in distributed training. By shrinking the memory footprint, FP8 allows teams to train larger models on fewer GPUs, maximizing the return on investment for high-performance compute clusters. The ability to double the batch size or sequence length without triggering out-of-memory errors provides a massive boost to overall training efficiency and model capability.

Throughput Gains and Hardware Utilization

Memory savings directly translate to computational speed. Because FP8 values occupy less space in high-bandwidth memory (HBM3) and the L2 cache, the GPU spends less time moving data and more time executing matrix multiplications. The H100 FP8 Tensor Cores are designed to process these smaller data types at unprecedented speeds.

Accelerating Matrix Multiplications

NVIDIA's technical benchmarks demonstrate that FP8 training delivers a 30% to 40% throughput improvement over BF16 baselines. For example, training a Llama 3 8B model yields a 1.30x speedup, while larger models like the 405B variant see up to a 1.53x speedup. The performance gap widens on larger models because they involve more matrix multiplications and data movement, both of which benefit substantially from the reduced memory footprint. When data is smaller, the GPU can load it into the tensor cores much faster, keeping the compute units fully saturated. This saturation is critical for achieving high teraFLOPS utilization during large-scale training runs.

Infrastructure Optimization

Maximizing these throughput gains requires infrastructure that does not bottleneck the GPUs. High-performance clusters with 18-second VM provisioning and per-second billing allow infrastructure leads to spin up H100 nodes, test FP8 scaling recipes, and scale down immediately. Lyceum Technology provides owned GPU infrastructure across European data centers, ensuring that teams have the raw compute power needed for intensive training runs while the Pythia AI Scheduler optimizes workload placement for high GPU utilization. By combining the hardware acceleration of the H100 with highly optimized cloud infrastructure, organizations can drastically reduce the time required to train foundation models. The synergy between FP8 precision and bare-metal performance ensures that every clock cycle is used efficiently, translating technical benchmarks into real-world business value.

Implementation Challenges and Scaling Recipes

Adopting FP8 introduces specific engineering challenges. The reduced dynamic range of 8-bit floats makes models highly susceptible to numerical instability. If gradient values fall below the representable range of E5M2, they underflow to zero, effectively halting the learning process for those parameters.

Managing Numerical Instability

When implementing FP8, engineering teams must avoid common pitfalls. First, you cannot quantize everything. Certain operations, such as LayerNorm and softmax, are highly sensitive to precision loss and must remain in FP32 or BF16. Second, the optimizer must maintain a high-precision FP32 copy of the master weights. Gradient updates during training are often extremely small. If the master weights were stored in FP8, these tiny updates would be lost due to the limited precision. Keeping a master copy in FP32 ensures that small gradient accumulations are accurately recorded.

The Necessity of Delayed Scaling

To mitigate the risks of underflow and overflow, frameworks like NVIDIA NeMo use advanced scaling recipes. The most common approach is delayed scaling. The system tracks the maximum absolute values of tensors over several iterations and calculates a scaling factor. This factor shifts the tensor values into the safe representable range of the FP8 format before the matrix multiplication occurs. After the operation, the results are scaled back down to their original magnitude. Delayed scaling is computationally efficient because it uses historical data to compute the scaling factor, avoiding the overhead of calculating the maximum value on the fly for every single operation. Implementing these recipes correctly is crucial for maintaining model convergence. Without proper scaling, the network will quickly diverge, rendering the training run useless. Engineering teams must carefully validate their scaling configurations before committing to a large-scale FP8 training job.

Optimizing the Software Stack for FP8

Hardware capabilities are only half of the equation. To fully leverage FP8 memory savings, your software stack must be explicitly configured to utilize the Transformer Engine. The open-source ecosystem has rapidly adapted to support these lower-precision formats, but configuration requires precision.

Framework Integration

PyTorch natively integrates with the NVIDIA Transformer Engine, allowing developers to wrap their model layers in an autocast context manager. When enabled, the framework automatically intercepts linear layers and executes them using FP8 Tensor Cores. However, achieving maximum memory reduction requires integrating advanced memory management techniques alongside FP8. The Transformer Engine handles the complex casting between E4M3 and E5M2 seamlessly, but the surrounding software architecture must be optimized to prevent bottlenecks. Data loading pipelines, for example, must be fast enough to keep up with the accelerated matrix multiplications.

Combining FP8 with Distributed Training

Combining FP8 with Fully Sharded Data Parallel (FSDP) or DeepSpeed ZeRO allows teams to distribute the FP32 master weights and optimizer states across multiple GPUs. While FP8 halves the memory footprint of the active forward and backward passes, the optimizer states remain a significant memory burden. By sharding these high-precision states, you prevent any single GPU from running out of memory, enabling the training of models exceeding 100 billion parameters on standard H100 clusters. Open-stack transparency is critical here. Utilizing standard frameworks like vLLM and NVIDIA Dynamo ensures that training scripts remain portable and are not locked into black-box proprietary engines. This flexibility allows engineering teams to experiment with different distributed training strategies, finding the optimal balance between memory consumption, communication overhead, and raw throughput for their specific model architecture.

Transitioning from BF16 to FP8: A Practical Framework

Moving a production training pipeline from BF16 to FP8 requires a systematic approach to ensure model convergence remains stable. Engineering teams should not flip the switch on a massive training run without validating the numerical stability of the new precision format.

Establishing a BF16 Baseline

Running a small-scale baseline training job in BF16 establishes the ground truth. Record the loss curve, gradient norms, and validation metrics. This baseline serves as the ground truth for your model's expected behavior. It is crucial to capture detailed metrics during this phase, as subtle deviations in the loss curve can indicate underlying numerical issues when transitioning to lower precision. Next, enable FP8 using the default delayed scaling recipe provided by the Transformer Engine.

Validating FP8 Convergence

Run the exact same job and overlay the loss curves. In a properly configured setup, the FP8 loss curve should perfectly track the BF16 baseline. If you observe divergence or sudden spikes in the loss, it typically indicates that a sensitive operation, such as a custom activation function or a specific normalization layer, is being incorrectly cast to FP8. After confirming stability on a small scale, gradually increase the batch size. Monitor the VRAM consumption using standard profiling tools like NVIDIA Nsight Systems. You should observe a distinct drop in memory usage, allowing you to push the batch size higher than what was possible in BF16. Finally, evaluate the throughput in tokens per second to quantify the exact speedup on your specific architecture. This methodical validation ensures that you capture the memory savings and throughput gains of FP8 without compromising the integrity of your foundation model. Rushing this process can lead to degraded model quality, negating the computational benefits of the H100 architecture.

The Economics of FP8 Training

The combination of halved memory requirements and 40% higher throughput fundamentally alters the unit economics of AI training. By fitting larger models on fewer GPUs and completing training runs faster, teams can drastically reduce their compute spend.

Reducing Compute Spend

A training job that previously required 30 days on a massive cluster can now be completed in three weeks on a smaller footprint. This acceleration translates directly into lower infrastructure costs and faster time to market for new AI products. Traditional cloud pricing structures often negate these savings. Hyperscaler GPU pricing remains unsustainable for weeks-long training runs, and mandatory block reservations force startups to pay for idle compute. When locked into long-term contracts, the efficiency gains of FP8 do not always result in lower monthly bills.

Sovereign Infrastructure

This challenge is addressed by offering a structural cost advantage. Because Lyceum Technology owns its infrastructure, teams access H100 VMs at competitive rates compared to hyperscaler list prices. There are no egress fees, and per-second billing ensures you pay strictly for exact usage. This model complements the speed of FP8 training, allowing you to spin down resources the moment your accelerated training job completes. For European teams, compliance is just as critical as cost. The platform provides EU-native infrastructure with GDPR compliance and data residency. All data stays in European data centers, providing a secure, sovereign environment for training proprietary models on sensitive datasets. By combining the technical efficiency of FP8 with the economic efficiency of owned infrastructure, AI startups can scale their training operations without exhausting their runway. This holistic approach to AI infrastructure ensures that technical breakthroughs at the hardware level deliver tangible business outcomes.

Frequently Asked Questions

How does FP8 affect the KV cache during training?

During fine-tuning or training with long context windows, the KV cache consumes significant memory. Storing the KV cache in FP8 halves its memory footprint compared to 16-bit formats like BF16. This massive reduction allows machine learning engineers to double the sequence length or increase the batch size without triggering out-of-memory errors, significantly improving the efficiency of processing long documents.

Do I need to change my hyperparameters for FP8 training?

You typically do not need to alter learning rates, weight decay, or batch sizes when switching from BF16 to FP8. The Transformer Engine handles the precision casting and scaling under the hood. However, you may choose to increase your batch size to take advantage of the freed memory.

Why are master weights still kept in FP32?

Gradient updates during the training process are often extremely small. If the master weights were stored in FP8, these tiny numerical updates would be lost entirely due to the limited precision of the 8-bit format, preventing the model from learning. Keeping a master copy in FP32 ensures that small gradient accumulations are accurately recorded and applied over time.

What software frameworks support FP8 training?

FP8 training is natively supported by the NVIDIA Transformer Engine, which integrates directly with PyTorch via an autocast context manager. Furthermore, major distributed training frameworks like NVIDIA NeMo, Megatron-LM, and DeepSpeed have built-in support for FP8 mixed-precision recipes, allowing teams to deploy delayed scaling and advanced memory management techniques out of the box.

How does Lyceum Technology support FP8 workloads?

Lyceum Technology provides bare-metal access and virtual machines powered by NVIDIA H100 GPUs across European data centers. With 18-second provisioning, per-second billing, and full support for the NVIDIA software stack, teams can deploy FP8 training runs instantly. This ensures high throughput while maintaining strict GDPR compliance and absolute data sovereignty.

Related Resources

/magazine/gpu-utilization-too-low-how-to-fix; /magazine/pytorch-memory-profiler-production; /magazine/gradient-checkpointing-memory-savings