GPU Memory Management VRAM Estimation 15 min read read

LoRA vs Full Fine-Tuning Memory Cost: VRAM Math

A technical breakdown of optimizer states, gradients, and why an 8B model fails on a 24GB GPU.

Justus Amen

Justus Amen

May 20, 2026 · GTM at Lyceum Technology

Most engineers do the same mental math when planning a fine-tuning job. An 8-billion parameter model in FP16 takes about 16GB of memory. A standard RTX 4090 has 24GB of VRAM. You load the model, start the training script, and immediately hit a CUDA Out of Memory error. This logic works perfectly for inference, but it completely collapses during training. Training an LLM requires your GPU to hold much more than the model weights alone. You have to account for gradients, optimizer states, and activations. This guide breaks down the exact memory costs of full fine-tuning versus Parameter-Efficient Fine-Tuning methods like LoRA and QLoRA.

The VRAM Equation for Full Fine-Tuning

To understand why full fine-tuning is so expensive, you need to look at what actually lives in your GPU memory during a training step. According to engineering breakdowns and documentation from Hugging Face, the true bottleneck for massive models lies in the model states and optimizer tracking.

The Hidden Bottleneck of Optimizer States

When you run a full fine-tune on an 8B model using the AdamW optimizer, your GPU must hold four distinct components simultaneously. First, you have the model weights. These are the base parameters of the model. In 16-bit precision, such as FP16 or BF16, 8 billion parameters consume roughly 16GB of VRAM. Second, you have the gradients. Every single parameter needs a gradient calculated during the backward pass of the neural network. In 16-bit precision, this requires another 16GB of memory.

Third, and most importantly, you have the optimizer states. This is the hidden killer that catches many engineers off guard. The AdamW optimizer maintains two states per parameter, specifically momentum and variance. These are typically stored in 32-bit precision, or FP32, to maintain numerical stability during the training process. That translates to 8 bytes per parameter, adding a massive 32GB of VRAM overhead just to track the optimization process.

Breaking Down the 69GB Requirement

Finally, you must account for activations. These are the intermediate tensors flowing through the network during the forward and backward passes. Depending on your batch size and sequence length, activations usually consume between 2GB and 10GB of memory.

Add it all up: 16GB for weights, 16GB for gradients, 32GB for optimizer states, and roughly 5GB for activations. This equals approximately 69GB of VRAM. When the training script initiates, the memory allocator attempts to reserve space for all these components simultaneously. If the total exceeds the physical VRAM, the process crashes immediately. Therefore, understanding this exact mathematical breakdown is the critical first step for any machine learning engineer planning a deployment. This mathematical reality is exactly why full fine-tuning an 8B model fails instantly on a standard 24GB GPU. Updating every single parameter creates a optimization problem in memory. You need at least an 80GB A100 GPU, or a multi-GPU cluster, to train an 8B model using this traditional method.

How LoRA Changes the Memory Math

Low-Rank Adaptation, commonly known as LoRA, solves the massive memory bottleneck by freezing the base model and learning only a small set of low-rank adapter layers. This fundamentally alters the hardware requirements for training large language models.

The Mechanics of Low-Rank Adaptation

Sebastian Raschka notes in his technical breakdowns that full fine-tuning is prohibitively expensive because it updates all of the model's trainable weights. LoRA bypasses this brute-force approach by freezing the original billions of parameters. Instead of changing the base weights, you inject new, trainable rank decomposition matrices into the transformer architecture. You only train these new adapter parameters, which usually total between 10 million and 50 million depending on your specific rank configuration.

Calculating the New VRAM Footprint

The VRAM equation for an 8B model changes significantly when you implement LoRA. First, the base weights still sit in memory. The frozen model consumes 16GB in FP16 precision. Second, the adapter weights are introduced. If you have 20 million parameters in BF16, they take up a mere 40MB of space. Third, you only compute gradients for the adapter layers, which costs another 40MB.

The most dramatic savings come from the optimizer states. Because the AdamW optimizer only needs to track momentum and variance for the 20 million adapter parameters, the optimizer memory footprint plummets from 32GB down to about 160MB.

By switching to LoRA, your total VRAM requirement drops from 69GB to roughly 18GB, factoring in a small amount of memory for activations. You can now successfully fine-tune an 8B model on a single 24GB GPU without triggering an Out of Memory error. This architectural shift lowers the hardware barrier for AI development. By drastically lowering the barrier to entry, smaller research teams and independent developers can experiment with frontier models. They no longer need to secure massive funding just to cover the cloud compute costs associated with basic model adaptation.

QLoRA and Paged Optimizers

Quantized LoRA, or QLoRA, further reduces memory costs by combining the architectural efficiency of LoRA with aggressive data quantization techniques. QLoRA reduces the memory footprint of the base model weights by loading them in a 4-bit quantized format, such as NormalFloat4 (NF4), while continuing to train the LoRA adapters in higher precision.

Pushing Limits with 4-Bit Quantization

For an 8B model, 4-bit quantization shrinks the base weight footprint dramatically. Instead of consuming 16GB of VRAM in 16-bit precision, the base model only requires about 4GB of memory. This massive reduction frees up critical space on the GPU for longer sequence lengths or larger batch sizes. According to the Hugging Face Blog on Parameter-Efficient Fine-Tuning, the adapters themselves remain in 16-bit precision to ensure the model actually learns effectively during the backward pass.

The Role of Paged Optimizers

However, as noted in recent machine learning engineering guides, optimizer states and activation spikes can still present a bottleneck if you scale up to larger models or process massive context windows. This is where paged optimizers come into play. Paged 8-bit AdamW acts like virtual memory for your GPU. It offloads optimizer states to your system's CPU RAM when GPU memory gets exceptionally tight.

By moving these states back and forth between the CPU and GPU seamlessly, paged optimizers prevent Out of Memory errors during memory-intensive steps, such as gradient checkpointing. With the combination of QLoRA and paged optimizers, the total VRAM required to train an 8B model drops to an 6GB. This approach is particularly valuable when working with consumer-grade hardware or older generation enterprise cards. By intelligently managing how and where memory is allocated, developers can bypass physical hardware limitations. The synergy between 4-bit quantization and paged memory management represents one of the most significant leaps in accessible machine learning engineering.

VRAM Benchmarks for Llama 3

To put this memory math into perspective, benchmarks provide perspective on the estimated memory requirements for training the Llama 3 family of models using different techniques. These figures align closely with recent benchmarks and technical documentation published by Hugging Face.

Scaling Up to Llama 3 70B

The 8B model is relatively manageable, but scaling up to the 70B parameter version introduces severe hardware constraints. If you want to fine-tune a 70B model using LoRA, you still need roughly 160GB of VRAM. That requirement mandates a multi-GPU setup, such as two 80GB A100s or an 8x RTX 4090 cluster. Full fine-tuning a 70B model demands a massive 500GB of VRAM, pushing you firmly into enterprise-grade cluster territory where high-bandwidth interconnects become mandatory.

Llama 3 405B Memory Requirements

The Llama 3 405B model represents the extreme end of the spectrum. According to Hugging Face benchmarks, full fine-tuning this massive model requires over 3.25TB of VRAM. This is not a typo; it requires terabytes of memory just to hold the weights, gradients, and optimizer states. Even with the efficiency of LoRA, training the 405B model requires approximately 950GB of VRAM.

Applying QLoRA brings the requirement down to about 250GB, which is still a formidable number that requires at least four 80GB GPUs working in tandem. These benchmarks illustrate why Parameter-Efficient Fine-Tuning is not just a cost-saving measure, but a strict technical necessity for most organizations. When planning a deployment, teams must carefully calculate these requirements against their available budget. Attempting to run a 405B model without the proper multi-node architecture will result in immediate failure. Therefore, leveraging these advanced quantization and adaptation techniques is mandatory for anyone looking to harness the power of the largest open-source models available today.

Infrastructure Strategy for Fine-Tuning

The choice between full fine-tuning and LoRA dictates your entire infrastructure strategy. Understanding the VRAM math is only the first step; applying that math to real-world deployment determines your project's budget and timeline.

Matching Hardware to the Training Method

If you are teaching a model something radically new, like a new language or a complex new reasoning modality, full fine-tuning yields the highest theoretical ceiling. But it requires massive compute. Renting multi-node clusters from hyperscalers gets expensive fast, especially for training runs that last weeks or months. You are paying a premium for the flexibility of renting, which quickly erodes your budget.

Infrastructure Strategy with Lyceum

This is where owned infrastructure provides a structural cost advantage. Lyceum provides raw GPU access via virtual machines provisioned in 18 seconds across European data centers. Because Lyceum operates its own infrastructure rather than renting from hyperscalers, you get significant price leadership. If you are adapting a model to a specific tone, format, or domain, LoRA is the pragmatic choice. You can run LoRA jobs on smaller Lyceum instances and iterate much faster, testing different hyperparameters without burning through a massive budget.

Data Privacy and Compliance

Regardless of the fine-tuning method you choose, data privacy remains a hard requirement for European teams. If you are fine-tuning models on proprietary codebases, medical records, or financial data, non-EU hosting is a strict deal-breaker. Lyceum ensures all data stays in GDPR-compliant European data centers with zero egress fees. This provides a truly sovereign environment to train and serve models, ensuring that your sensitive intellectual property never crosses borders or gets locked into restrictive hyperscaler ecosystems. By combining the memory efficiency of LoRA with the cost-effective, sovereign infrastructure provided by Lyceum, organizations can build highly specialized, production-ready models. This strategic alignment of software techniques and hardware provisioning ensures that artificial intelligence initiatives remain both technically feasible and financially sustainable over the long term.

The Impact of Sequence Length on VRAM Consumption

While model weights and optimizer states represent static memory costs, the memory required for activations scales dynamically based on your training parameters. Specifically, your batch size and sequence length play a massive role in determining whether your fine-tuning job will succeed or crash.

Why Context Windows Consume Memory

When fine-tuning models on long documents, such as legal contracts or extensive coding repositories, the activation memory can easily exceed the memory required for the model weights themselves. Every token in your sequence generates intermediate tensors during the forward pass, and these tensors must be stored in VRAM until the backward pass computes the gradients. If you attempt to fine-tune an 8B model with an 8k or 16k context window, the activation memory will spike dramatically, often causing an Out of Memory error even if you are using LoRA.

Gradient Checkpointing as a Solution

To mitigate this dynamic memory spike, machine learning engineers rely on a technique called gradient checkpointing. Instead of storing all intermediate activations for the backward pass, gradient checkpointing recomputes them on the fly. According to the Hugging Face Blog on Parameter-Efficient Fine-Tuning, this technique trades compute time for significant VRAM savings. Training will typically be 20 to 30 percent slower because the GPU has to perform extra calculations, but the memory footprint for activations is drastically reduced.

This technique is essential when pushing sequence lengths to their limits on constrained hardware. Even with LoRA's minimal optimizer footprint, a massive sequence length requires careful memory management. Understanding how sequence length impacts your hardware requirements is just as important as calculating your base model size. If you ignore the activation memory overhead, your training scripts will inevitably fail. Proper implementation of gradient checkpointing ensures that your infrastructure can handle the complex demands of modern, long-context fine-tuning tasks without needing to provision larger, more expensive GPU clusters.

Task Complexity: When to Choose LoRA vs Full Fine-Tuning

The decision to use LoRA or full fine-tuning should not be based entirely on VRAM availability. While memory constraints often force teams into using Parameter-Efficient Fine-Tuning, the actual learning objective of your project should guide your methodology.

Evaluating the Training Objective

As noted by Sebastian Raschka in his technical comparisons, the choice between these methods hinges on task complexity. If the goal of your project is domain adaptation, style transfer, or teaching the model a specific response format, LoRA is highly effective. For example, if you want an 8B model to output JSON exclusively, or you want it to adopt the tone of your company's customer service guidelines, training a small set of adapter layers is more than sufficient. In these scenarios, LoRA achieves parity with full fine-tuning while saving tens of gigabytes of VRAM.

When Full Fine-Tuning is Necessary

However, if you are teaching the model an entirely new language from scratch, or injecting vast amounts of net-new factual knowledge, full fine-tuning often yields a higher theoretical performance ceiling. Updating all billions of parameters allows the model to deeply internalize new fundamental patterns that adapter layers might struggle to capture.

Teams must weigh the marginal performance gain against the exponential increase in compute costs. Full fine-tuning an 8B model requires the 69GB VRAM infrastructure discussed earlier, meaning higher costs and longer training times. Ultimately, the decision requires a careful analysis of your specific use case. If you are simply refining how a model interacts with users, the massive compute overhead of updating every parameter is entirely unnecessary. By aligning your training methodology with your actual business objectives, you can optimize both your infrastructure spend and your deployment timelines. For the vast majority of enterprise use cases, LoRA or QLoRA provides the optimal balance of cost, speed, and model quality.

How Rank and Alpha Configurations Alter LoRA Memory

When implementing LoRA, the exact VRAM savings depend heavily on the hyperparameters you choose before starting the training script. The two most important variables in this equation are the rank parameter and the alpha parameter.

Understanding the Rank Parameter

The rank, often denoted as "r" in training scripts, determines the inner dimension of the low-rank matrices injected into the transformer layers. A rank of 8 or 16 is considered standard for most fine-tuning tasks. This configuration keeps the trainable parameter count relatively low, usually around 10 to 20 million parameters for an 8B model. However, if a team decides to increase the rank to 64 or 128 to capture more complex linguistic patterns, the number of trainable parameters scales up proportionally.

Calculating the VRAM Shift

When the trainable parameter count increases, the optimizer state memory increases as well. While a rank of 8 might require only 160MB of optimizer memory, a rank of 128 could push that requirement much higher. Additionally, the alpha parameter, which scales the learned weights, must be adjusted in tandem with the rank. While the alpha parameter itself does not directly consume extra VRAM, improper tuning can lead to unstable training.

If the training becomes unstable, engineers are forced to restart jobs, wasting valuable GPU hours and effectively increasing the overall cost of the project. Hugging Face guidelines recommend careful experimentation with these hyperparameters. Starting with a low rank and gradually increasing it allows teams to maintain the delicate balance between model quality and memory efficiency. Mastering these configurations is a core competency for modern machine learning engineers. By understanding exactly how the rank and alpha parameters influence both the learning capacity and the physical memory footprint, teams can execute highly efficient training runs. This level of technical precision prevents resource waste and accelerates the overall development lifecycle.

Frequently Asked Questions

What is the exact memory breakdown for an 8B model during full fine-tuning?

For an 8B model in FP16 precision, the base weights consume 16GB of VRAM. Gradients require another 16GB during the backward pass. The AdamW optimizer states require a massive 32GB because they store 8 bytes per parameter for momentum and variance. Activations consume an additional 2 to 10GB depending on batch size and sequence length. This totals approximately 69GB of VRAM, explaining why standard 24GB GPUs fail instantly.

How does LoRA save so much VRAM?

LoRA freezes the base model weights, meaning you do not need to calculate gradients or maintain optimizer states for the billions of base parameters. Instead, you inject and train small adapter layers, typically containing 10 to 50 million parameters. You only track gradients and optimizer metrics for these adapters, which reduces the optimizer memory footprint from tens of gigabytes down to mere megabytes.

What are paged optimizers?

Paged optimizers act like virtual memory for your GPU during intensive training runs. When GPU memory is exhausted during a training step, paged optimizers automatically offload the optimizer states to your system's CPU RAM. This prevents Out of Memory errors during unexpected memory spikes and is highly effective when combined with QLoRA to train models on constrained hardware.

How much VRAM does Llama 3 70B need for LoRA?

Fine-tuning Llama 3 70B with standard LoRA requires approximately 160GB of VRAM. This necessitates a multi-GPU cluster, such as two 80GB A100 GPUs or an array of smaller cards connected via high-bandwidth links. For comparison, full fine-tuning the same 70B model would require roughly 500GB of VRAM, pushing the workload into enterprise supercomputing territory.

Why should European teams care about infrastructure location for fine-tuning?

Fine-tuning often involves proprietary or highly sensitive datasets, such as medical records, financial data, or internal corporate codebases. To comply with GDPR and the AI Act, this data must remain strictly within the European Union. Lyceum provides owned GPU infrastructure entirely within European data centers to ensure strict data sovereignty, preventing sensitive information from crossing borders.

Is per-second billing available for fine-tuning workloads?

Yes. Lyceum offers precise per-second billing across its entire GPU infrastructure. You only pay for the exact compute time your training job requires, with no minimum commitments or hidden base fees. This is particularly advantageous for iterative LoRA fine-tuning, where developers need to run multiple short experiments to find the optimal hyperparameters without wasting budget.

Further Reading

Related Resources

/magazine/gpu-memory-calculator-deep-learning; /magazine/gpu-memory-estimation-before-training; /magazine/predict-vram-usage-pytorch-model