Production GPU Infrastructure Cluster Management 14 min read read

Kubernetes GPU Node Setup for ML: Stop Wasting 95% of Your Compute

A practical guide to configuring NVIDIA device plugins, preventing OOM errors, and maximizing cluster utilization for AI workloads.

Maximilian Niroomand

Maximilian Niroomand

May 26, 2026 · CTO & Co-Founder at Lyceum Technology

Setting up Kubernetes for machine learning requires more than getting pods to run, it requires strict control over unit economics. The default Kubernetes scheduler treats GPUs as indivisible integers. If an ML engineer requests a GPU for a lightweight inference service, that entire chip is locked, even if the workload only consumes 2GB of VRAM. This structural flaw is why a Cast AI report found that average GPU utilization across clusters is an abysmal 5%. For AI startups and scale-ups, that level of waste significantly impacts margins. This guide covers the practical steps to configure your Kubernetes GPU nodes, manage memory to prevent OOM crashes, and implement scheduling strategies that actually utilize your hardware.

The Core Stack: NVIDIA Device Plugin and Container Toolkit

Before Kubernetes can schedule machine learning workloads on a GPU, the cluster must recognize the underlying hardware. Kubernetes does not natively understand GPUs out of the box; instead, it relies entirely on a device plugin architecture to expose specialized hardware to the kubelet. This integration is the foundational step for any AI infrastructure.

The Three Essential Layers of GPU Integration

The setup requires three distinct layers to be configured on every GPU-enabled node within your cluster to ensure seamless operation:

  1. NVIDIA Drivers

    The host operating system must have the appropriate NVIDIA drivers installed. The driver version must be strictly compatible with the CUDA version required by your machine learning applications. Mismatched versions will result in immediate pod failures and scheduling errors.
  2. NVIDIA Container Toolkit

    This toolkit modifies the container runtime, such as containerd or CRI-O, to inject the necessary NVIDIA libraries and device files into the containers at runtime. Without this critical toolkit, the containerized application cannot communicate with the physical GPU hardware.
  3. NVIDIA Device Plugin

    This is a Kubernetes DaemonSet that communicates directly with the kubelet via a gRPC socket. It registers the hardware and exposes it as a schedulable resource under the identifier nvidia.com/gpu, allowing the scheduler to track capacity.

Automating with the NVIDIA GPU Operator

When you deploy the NVIDIA Device Plugin, it continuously monitors the health of the GPUs and reports their status back to the control plane. If a GPU fails or overheats, the plugin marks the node as unready for GPU workloads, preventing the scheduler from assigning new pods to degraded hardware.

For teams managing their own infrastructure, the modern approach is to deploy the NVIDIA GPU Operator. Instead of manually installing drivers and toolkits on every single node, the Operator automates the lifecycle of all software components required to provision GPUs. It deploys the driver, container toolkit, device plugin, and DCGM exporter for metrics as containerized components. This ensures strict version compatibility across the stack and drastically reduces the operational burden on infrastructure teams. By leveraging the Operator, you can scale your cluster dynamically without worrying about manual node configuration, ensuring that new nodes are instantly ready to accept machine learning workloads.

The 5% Utilization Crisis and the Integer Problem

The technical implementation of the device plugin introduces a severe economic problem for infrastructure teams managing AI workloads. By default, the Kubernetes scheduler treats the nvidia.com/gpu resource as an indivisible integer. You can request one GPU, or you can request four GPUs, but you absolutely cannot request a fraction of a GPU or a specific amount of VRAM for smaller tasks.

The Root Cause of the 5% Utilization Crisis

According to the Cast AI State of Kubernetes Optimization Report, this integer constraint is the primary reason average GPU utilization sits at a dismal 5 percent across the industry. When organizations pay premium rates for high-end compute, this level of waste is financially devastating and unsustainable. The waste typically falls into two distinct categories that plague modern clusters:

  • Idle Allocation Waste

    A pod requests a full GPU for a low-traffic inference API. The pod passes its health checks and locks the node entirely. The GPU utilization remains near zero for the majority of the day, but no other workload can access the hardware because the Kubernetes scheduler considers the resource fully consumed by the inference service.
  • Batch Gap Waste

    During a machine learning training job, the GPU executes forward and backward passes rapidly, but it sits completely idle while waiting for CPU-bound data loading, network transfers, or model checkpointing. The GPU is technically marked as "in use" by the scheduler, but effective utilization drops well below 40 percent during these bottlenecks.

The Financial Impact of Integer Constraints

When you are paying premium rates for high-end hardware, idle allocation destroys your unit economics. The inability to bin-pack smaller workloads onto a single powerful GPU means companies are forced to over-provision clusters massively to handle peak loads. To fix this structural flaw, infrastructure teams must implement sharing mechanisms that break the integer barrier. By allowing multiple pods to utilize a single physical GPU, organizations can drastically increase their cluster density, reduce their cloud bills, and ensure that expensive AI hardware is actually performing compute operations rather than waiting for requests.

Strategies for GPU Sharing: MIG, MPS, and Time-Slicing

To maximize cluster efficiency and solve the utilization crisis, you must configure your nodes to share GPU resources effectively. The optimal method depends entirely on your specific workload profile, such as whether you are running latency-sensitive inference services or throughput-heavy batch processing jobs that require maximum compute power.

Multi-Instance GPU Partitioning

Available on Ampere and Hopper architectures, Multi-Instance GPU partitioning allows you to physically divide a single GPU into up to seven isolated instances. Each instance has dedicated compute cores, L2 cache, and memory bandwidth, operating almost like an independent device.

Because the isolation happens at the hardware level, this method provides strict fault isolation and guaranteed Quality of Service. If one partition crashes due to an out-of-memory error, the other partitions remain completely unaffected. This makes hardware partitioning the gold standard for multi-tenant Kubernetes clusters where different teams deploy unpredictable workloads and require strict separation.

Multi-Process Service Architecture

The Multi-Process Service is a client-server architecture that allows multiple CUDA processes to share a single GPU context. Instead of partitioning the hardware physically, this service overlaps the compute kernels from different processes, executing them concurrently on the GPU to maximize throughput.

This approach is highly effective for workloads that do not fully saturate the Streaming Multiprocessors, such as small batch inference. However, it lacks strict memory protection. A memory leak in one process can exhaust the shared High Bandwidth Memory, crashing all other processes sharing the daemon. It requires careful memory management at the application level to prevent cascading failures.

Software Time-Slicing

Time-slicing is the most basic form of sharing available in Kubernetes. The NVIDIA Device Plugin can be configured to advertise multiple virtual GPUs for every physical GPU. The operating system context switcher then rapidly swaps the execution contexts of the different processes.

While time-slicing allows for high pod density, it introduces significant latency spikes due to the overhead of context switching. It is best suited for asynchronous batch jobs or CI/CD testing environments where latency is not a primary concern and maximum concurrency is the main objective for the engineering team.

Taming CUDA Out-of-Memory (OOM) Errors

When you pack multiple machine learning workloads onto a single node to improve utilization, memory management becomes the critical failure point for your infrastructure. In standard CPU workloads, exceeding memory limits triggers the Linux OOM killer, which terminates the pod gracefully. In GPU workloads, exhausting High Bandwidth Memory causes the machine learning framework to throw a CUDA out of memory exception, crashing the application directly and often leaving the node in an unstable state.

Understanding High Bandwidth Memory Exhaustion

Memory exhaustion is incredibly common during model training. When PyTorch or TensorFlow builds a computational graph, it must store massive amounts of activation data for the backward pass. If the batch size is too large, or if the model weights exceed the physical constraints of the VRAM, the job will fail instantly, wasting hours of compute time.

Mitigating these errors requires strict configuration at both the infrastructure and application layers to ensure stability:

  • Framework-Level Limits

    Kubernetes cannot natively enforce hard limits on GPU memory usage. You must configure the machine learning framework to restrict its own allocation. For example, in PyTorch, setting specific memory fractions prevents a single process from monopolizing the entire VRAM.
  • Gradient Checkpointing

    This technique trades compute for memory. Instead of storing all activations during the forward pass, the framework discards them and recomputes them during the backward pass. This significantly reduces the memory footprint, allowing larger models to fit into limited memory spaces.
  • Mixed Precision Training

    Utilizing lower precision data types halves the memory required for model weights and activations compared to standard precision, without sacrificing model accuracy.

Monitoring and Observability

Proper monitoring is absolutely essential for preventing crashes. Deploying the DCGM exporter allows you to scrape granular metrics, such as memory utilization, clock speeds, and temperature, directly into Prometheus. This gives your infrastructure team the visibility needed to tune workloads, adjust batch sizes, and optimize memory allocation before an application crashes and disrupts the development pipeline.

Advanced ML Scheduling with Kueue and Volcano

The default Kubernetes scheduler evaluates pods individually. It filters nodes based on available resources and scores them to find the optimal placement. This approach works perfectly for stateless web servers, but it causes significant inefficiencies for distributed machine learning training, which requires a fundamentally different scheduling paradigm.

The Necessity of Gang Scheduling

Distributed training requires gang scheduling, which is an all-or-nothing approach to resource allocation. If a training job requires eight GPUs across four nodes, all eight pods must start simultaneously. If the standard scheduler places six pods but runs out of capacity for the final two, the six running pods will hang indefinitely, waiting for their peers to initialize. They lock the GPUs, consume your budget, and make absolutely zero progress. This deadlock scenario is a primary driver of wasted cloud spend and delayed model deployments.

To solve this structural issue, infrastructure teams must deploy specialized batch schedulers designed specifically for AI workloads:

Volcano Batch Scheduler

Volcano is a mature batch scheduling system that introduces the concept of PodGroups to Kubernetes. It ensures that the scheduler only binds pods to nodes if the entire PodGroup can be accommodated. If the cluster lacks the capacity for the full job, all pods remain in a pending state, leaving the GPUs free for smaller, less demanding workloads. This prevents partial deployments from locking up valuable hardware.

Kueue Job Management

Kueue is a Kubernetes-native job queueing system designed specifically for multi-tenant environments. It manages cluster-wide quotas, cohort borrowing, and workload preemption. Instead of overwhelming the scheduler with hundreds of pending pods, Kueue holds jobs in a queue until sufficient GPU resources become available. It enforces strict fair-sharing policies, ensuring that one team cannot monopolize the entire cluster while others wait.

Implementing Kueue alongside the NVIDIA device plugin provides the admission control necessary to run enterprise-scale AI workloads without deadlocking your infrastructure or wasting expensive compute cycles on stalled training jobs.

The Build vs. Buy Decision for European AI Teams

Building and maintaining a GPU-aware Kubernetes cluster requires deep, specialized expertise that many organizations simply do not possess. Managing the NVIDIA GPU Operator, configuring hardware partitions, tuning Kueue for gang scheduling, and debugging memory errors demands a dedicated infrastructure team. For many AI startups and scale-ups, managing hardware is a painful distraction from their core objective, which is building and deploying sophisticated machine learning models.

The Cost of Hyperscaler Infrastructure

Furthermore, relying on traditional hyperscalers for this infrastructure often leads to unsustainable costs. Public clouds require massive block-reservations for high-end hardware, and their pricing models are prohibitive for weeks-long training runs or sustained inference workloads. The hidden costs of egress fees and mandatory support contracts further erode AI budgets, making it difficult for European companies to compete globally.

The Lyceum Advantage

This is where Lyceum Technology offers a specialized infrastructure model. The platform provides high-end virtual machines at competitive rates with no egress fees compared to traditional hyperscalers. With per-second billing across the board, teams only pay for the exact compute they consume, eliminating the financial waste associated with idle allocation.

For European enterprises, data residency and compliance are hard requirements. Non-EU hosting is often a deal-breaker due to strict regulatory constraints. The infrastructure is EU-sovereign, GDPR-compliant infrastructure, ensuring all data and model weights remain strictly within European data centers. This compliance path offers a significant competitive moat for teams navigating the AI Act and ISO 27001 requirements.

Instead of wrestling with Kubernetes device plugins and manual scheduling, teams can provision a virtual machine instantly or deploy models directly via the inference engine. The platform utilizes the Pythia AI Scheduler, which handles VRAM prediction and automatic GPU selection to deliver significant cost savings without manual configuration. Built on open-stack transparency utilizing vLLM and NVIDIA Dynamo, The architecture ensures customer portability by design, completely avoiding the vendor lock-in associated with proprietary black-box engines.

Overcoming the GPU Utilization Crisis with Kubernetes Optimization

The machine learning industry faces a massive infrastructure challenge. As organizations rush to deploy generative AI and large language models, they are purchasing or renting expensive hardware at unprecedented rates. However, as highlighted by recent industry reports, the actual utilization of these resources remains shockingly low, creating a severe bottleneck for innovation and profitability.

Kubernetes was originally designed for stateless microservices, not stateful, hardware-accelerated machine learning workloads. When AI workloads are forced into standard Kubernetes paradigms without proper optimization, the result is severe resource stranding. A pod might request a powerful accelerator for a lightweight task, effectively locking the entire device and preventing any other workload from utilizing the remaining compute capacity. This fundamental mismatch is the root cause of the utilization crisis.

Solving the Crisis Through Intelligent Scheduling

To solve this utilization crisis and save your AI budget, infrastructure teams must move beyond default configurations. Implementing intelligent scheduling, hardware partitioning, and dynamic resource allocation is no longer optional. By utilizing advanced Kubernetes optimization techniques, organizations can bin-pack multiple inference services onto a single node, dynamically scale resources based on actual demand, and ensure that expensive hardware is fully saturated.

Optimizing Kubernetes for machine learning requires a holistic approach. It involves configuring the container runtime, deploying specialized device plugins, and utilizing batch schedulers that understand the unique requirements of AI workloads. When executed correctly, these optimizations can transform a highly inefficient cluster into a streamlined, cost-effective engine for machine learning innovation.

Best Practices for Setting Up GPU Workloads in Kubernetes

Successfully running machine learning workloads in Kubernetes requires strict adherence to infrastructure best practices. A poorly configured cluster will suffer from frequent crashes, low utilization, and unpredictable performance. To build a resilient environment for AI, engineering teams must standardize their deployment processes and implement robust monitoring solutions.

Standardizing Node Configuration

The first best practice is to standardize node configuration using automated tooling. Manually installing drivers and container toolkits on individual nodes introduces configuration drift and version mismatches. Utilizing tools like the NVIDIA GPU Operator ensures that every node in the cluster runs the exact same software stack. This consistency is critical for preventing scheduling errors and ensuring that machine learning pods can migrate seamlessly between nodes during scaling events. Automation reduces human error and accelerates the provisioning of new hardware.

Implementing Resource Limits and Requests

Another critical practice is the strict enforcement of resource limits and requests within your pod specifications. While Kubernetes struggles with fractional GPU allocation by default, defining clear CPU and system memory limits prevents auxiliary processes from starving the primary machine learning workload. Furthermore, configuring the machine learning framework to respect memory boundaries is essential for preventing out-of-memory crashes that can destabilize the entire node and disrupt other running applications.

Continuous Monitoring and Alerting

Finally, continuous monitoring is the backbone of a healthy machine learning cluster. Infrastructure teams must deploy comprehensive observability stacks that capture hardware-level metrics. Scraping data from the DCGM exporter allows teams to monitor utilization, memory consumption, and thermal performance in real time. Setting up automated alerts for high memory usage or temperature spikes enables proactive intervention before a critical training job fails.

By following these best practices, organizations can build a robust Kubernetes environment that supports the demanding requirements of modern machine learning applications, ensuring high availability and optimal performance for both training and inference workloads.

Frequently Asked Questions

What is the difference between MIG and Time-Slicing?

Multi-Instance GPU (MIG) physically partitions the GPU hardware, providing strict isolation of compute cores, L2 cache, and memory bandwidth. If one MIG partition crashes, the others are unaffected. Time-slicing, on the other hand, relies on the operating system to rapidly switch execution contexts between processes on the same physical hardware. Time-slicing offers no memory isolation and introduces latency spikes during context switches.

How does gang scheduling work for ML training?

Gang scheduling ensures that a distributed training job is treated as a single, indivisible unit. Schedulers like Volcano use PodGroups to verify that the cluster has enough available GPUs to run every pod in the job simultaneously. If capacity is insufficient, all pods remain pending, preventing a scenario where a partial deployment locks up GPUs indefinitely while waiting for resources.

Why is my GPU utilization so low in Kubernetes?

Low GPU utilization is typically caused by the Kubernetes scheduler's inability to allocate fractional GPUs. If a pod requests a GPU, the entire device is locked to that pod, even if the workload only uses a fraction of the compute capacity. This "idle allocation waste" can be resolved by implementing MIG, MPS, or time-slicing to allow multiple workloads to share the hardware.

How does Lyceum Technology handle GDPR compliance for ML workloads?

Lyceum Technology operates entirely on EU-sovereign infrastructure. Unlike traditional hyperscalers that often route data globally, Lyceum ensures that all model weights, training datasets, and inference logs remain strictly within European data centers. This localized approach provides a clear, provable path to strict GDPR compliance and is designed to align with the requirements of the upcoming EU AI Act.

Related Resources

/magazine/slurm-to-kubernetes-gpu-migration; /magazine/gpu-networking-infiniband-distributed-training; /magazine/multi-agent-orchestration-gpu-scaling