Production GPU Infrastructure Cluster Management 13 min read read

Migrating GPU Workloads from Slurm to Kubernetes: A Practical Guide

How ML engineering teams can transition from legacy HPC schedulers to containerized infrastructure without sacrificing GPU utilization or gang scheduling.

Justus Amen

Justus Amen

May 27, 2026 · GTM at Lyceum Technology

Writing `sbatch` scripts and watching jobs land predictably on GPU nodes is standard practice in academic research and high-performance computing (HPC). Writing `sbatch` scripts and watching jobs land predictably on GPU nodes is muscle memory. But as AI workloads evolve from pure training to continuous fine-tuning and real-time inference, infrastructure teams are standardizing on Kubernetes. However, Kubernetes was built for stateless microservices, not tightly coupled distributed training. Moving GPU workloads from Slurm to Kubernetes often leads to fragmented clusters, silent hangs, and plummeting utilization rates. This guide breaks down the architectural differences, the concrete challenges of migrating, and the frameworks you need to make Kubernetes handle heavy AI workloads effectively.

The Architectural Divide: HPC vs. Containerized Infrastructure

Slurm and Kubernetes approach resource management from fundamentally different assumptions. This divide informs deployment scripts and infrastructure strategy. The transition requires a deep understanding of how each system views compute resources and workload lifecycles.

Slurm Assumes Limited Resources and Infinite Workloads

Emerging from the scientific computing world, Slurm was designed for environments where compute is fixed and jobs must queue. Its core strength is gang scheduling: when you request eight GPUs across two nodes, Slurm allocates them together or not at all. This deterministic behavior is exactly what distributed training requires. If a PyTorch dist.barrier() call executes, it needs all ranks present; otherwise, the job blocks indefinitely.

Beyond gang scheduling, Slurm excels at fair-share queueing. Its multifactor priority system automatically decays the priority of teams that over-consume GPU resources, redistributing capacity to under-served users without manual intervention. Furthermore, Slurm offers native Message Passing Interface (MPI) integration, binding to srun task slots. This tight integration ensures that high-performance computing tasks run with minimal overhead, maintaining high throughput for tightly coupled workloads.

Kubernetes Assumes Infinite Resources and Finite Workloads

Built to manage stateless microservices, Kubernetes schedules pods independently using a filter-and-score mechanism. By default, it prefers to spread pods across the least-allocated nodes to ensure high availability for web services. This design is perfect for web servers but disastrous for distributed machine learning.

If you submit a four-node training job, Kubernetes might start three pods while the fourth remains stuck in a pending state because no single node has free GPUs. The lack of native gang scheduling causes silent hangs and wasted billing time for ML teams. Kubernetes also requires explicit quota management and third-party operators to achieve the multi-tenant fairness that Slurm provides out of the box. Without these additions, a single user can easily monopolize a Kubernetes cluster, leaving other critical workloads starved for resources.

Why Migrate? The Case for a Unified Stack

If Slurm is so effective for training, why are organizations pushing to migrate their GPU workloads to Kubernetes? The answer lies in the evolution of the AI lifecycle. Historically, HPC teams ran massive simulations on supercomputers using Slurm, while software teams built microservices on the cloud using Kubernetes. Today, Large Language Models (LLMs) and generative AI have forced these two worlds to collide, creating a demand for unified infrastructure.

The Limitations of Slurm for Modern AI Services

Slurm is the undisputed king of batch scheduling, but it struggles with services. It lacks native concepts for Ingress, Load Balancers, or Service Meshes. Running a model serving API, a continuous integration dashboard, or an interactive Jupyter environment on Slurm requires heavy workarounds. As teams move models from training into production, maintaining two separate infrastructure stacks, Slurm for training and Kubernetes for inference, creates massive operational overhead. Any attempt to share capacity, enforce quotas, or standardize access must be implemented twice.

Overcoming Stranded Capacity and Fragmented Visibility

  • Stranded Capacity

    Static allocation of machines to either Slurm or Kubernetes leaves hardware idle. If the Slurm cluster is busy but the Kubernetes cluster is quiet, you cannot shift nodes across boundaries. This hard division of resources leads to significant financial waste, especially given the high cost of modern GPUs.
  • Fragmented Visibility

    Without a single control plane, answering basic questions about cluster health, GPU utilization, and workload mix becomes a manual data-gathering exercise. Platform engineers are forced to stitch together logs and metrics from disparate systems.
  • Ecosystem Tooling

    The broader software ecosystem, from CI/CD pipelines to monitoring stacks like Prometheus and Grafana, integrates natively with Kubernetes. Slurm requires custom exporters and brittle scripts to achieve the same level of observability.

Migrating to Kubernetes offers a unified control plane for data preparation, training, fine-tuning, and inference. However, it requires bridging the gap between service-oriented container management and batch-oriented compute allocation, a challenge that requires careful architectural planning.

Three Strategies for Kubernetes Migration

Organizations migrating GPU workloads generally adopt one of three architectural patterns to bridge the gap between HPC and containerized infrastructure. Choosing the right path depends on your team's Kubernetes expertise and existing investment in legacy scripts.

1. Kubernetes-Native Batch with Volcano or Kueue

This approach replaces Slurm entirely. You run raw Kubernetes and install custom schedulers like Volcano or Kueue to handle batch workloads. Volcano is a mature batch scheduling engine that introduces gang scheduling, job queuing, and fair-share policies to the Kubernetes control plane. Kueue takes a slightly different approach, acting as a job queueing controller that works alongside the default scheduler to manage quotas.

While this provides a pure containerized environment, it requires rewriting all existing sbatch scripts into complex YAML manifests. It also demands deep Kubernetes operational expertise to configure pod groups and priority classes correctly. Teams must be prepared for a steep learning curve and significant refactoring of their deployment pipelines.

2. Slurm on Kubernetes via Slinky

For teams with years of investment in Slurm scripts, running Slurm inside Kubernetes offers a powerful middle ground. Projects like NVIDIA's Slinky represent Slurm daemons as Kubernetes Custom Resource Definitions (CRDs). This allows you to spin up a Slurm login node and worker nodes as Kubernetes pods.

Researchers keep their familiar sbatch and srun commands, while platform engineers manage the underlying infrastructure through Kubernetes. Internal deployments demonstrate that the Slinky operator scales to over 8,000 GPUs and supports non-disruptive rolling updates. This hybrid approach minimizes friction for end-users while modernizing the backend.

3. Managed GPU Infrastructure

For many AI startups and scale-ups, managing either Volcano or Slinky is an unnecessary distraction from core model development. Instead of building the scheduling systems in-house, teams rely on managed platforms that abstract the infrastructure entirely. This allows them to submit jobs or provision virtual machines via API without configuring the underlying schedulers, freeing up engineering cycles for actual machine learning research.

Common Pitfalls and Utilization Drops

Migration failures rarely happen at the deployment stage; they happen at runtime. Watch out for these specific technical hurdles when moving GPU workloads to Kubernetes, as they can severely impact both performance and budget.

The Partial Allocation Deadlock

Failing to implement strict gang scheduling will result in distributed jobs hanging. If you use default Kubernetes scheduling, you will pay for idle GPUs while pods wait for their peers to initialize. You must enforce pod group scheduling to ensure all-or-nothing allocation. Without this, a large distributed training job might secure 90 percent of its required nodes, blocking other jobs from running while it waits indefinitely for the final 10 percent.

Network Topology Ignorance

High-performance training relies on specific hardware topologies, such as NVLink and InfiniBand. Kubernetes does not natively understand which GPUs share a PCIe switch or how nodes are connected across the spine-leaf network. If your scheduler places communicating pods on nodes with high-latency interconnects, training throughput will collapse. Implementing topology-aware scheduling using tools like the NVIDIA GPU Operator and ComputeDomains is mandatory to maintain the performance levels expected from a Slurm environment.

Storage and Filesystem Bottlenecks

Slurm environments typically rely on shared parallel filesystems like Lustre or NFS. Kubernetes utilizes Container Storage Interface (CSI) drivers and persistent volumes. If your CSI driver isn't optimized for high-throughput, low-latency reads, your expensive GPUs will sit idle waiting for data. Data loading is frequently the silent killer of GPU utilization in containerized environments.

The 30 Percent Utilization Ceiling

Schedulers create cluster fragmentation. As jobs launch and fail across a fixed pool of compute, idle gaps emerge. Large jobs wait in the queue, while smaller ones fill the cracks. Despite decades of optimization, average GPU utilization rarely surpasses 30 percent in standard configurations. Overcoming this requires dynamic resizing, checkpointing, and intelligent workload profiling to pack jobs tightly without causing out-of-memory errors.

The Lyceum Technology Approach to GPU Infrastructure

Building and maintaining a Kubernetes cluster capable of handling heavy AI workloads requires dedicated platform engineering teams. For European AI startups and scale-ups, this overhead detracts from shipping models and introduces compliance risks when relying on US-based hyperscalers.

Abstracting Infrastructure Complexity

Lyceum Technology provides an alternative: owned GPU infrastructure across European data centers, built specifically for AI workloads. Engineering teams access compute through a platform designed to maximize performance and minimize operational burden.

  • Raw Compute in Seconds

    Provision virtual machines with raw SSH access in 18 seconds. Lyceum standardizes the environment across 40+ supply-side partners, providing unified metrics for GPU and memory utilization without the complexity of managing Kubernetes nodes.
  • Intelligent Scheduling

    The Pythia AI Scheduler handles VRAM prediction, runtime estimation, and automatic GPU selection. This drives significant cost savings per job compared to manual allocation. This directly addresses the utilization ceiling that plagues traditional schedulers, ensuring your hardware works harder.

Sovereignty and Unified Workflows

  • EU Data Sovereignty

    All data stays in European data centers. Lyceum provides full GDPR compliance, offering a clear path for teams navigating the AI Act, C5, and ISO 27001 requirements. Non-EU hosting is often a deal-breaker for regulated industries, making this a critical structural advantage for enterprise deployments.
  • Unified Inference

    Transitioning from training to production is straightforward. The Lyceum Inference Engine allows you to host any LLM on dedicated infrastructure and serve it via an OpenAI-compatible API. A serverless inference option is also in development, enabling scale-to-zero efficiency.

By abstracting the infrastructure complexity, Lyceum allows ML engineers to focus on model architecture rather than cluster fragmentation. With per-second billing and zero egress fees, you maintain structural cost advantages over hyperscalers without the operational burden of managing Kubernetes from scratch.

Phased Migration Strategy

Migrating from Slurm to Kubernetes is not a simple lift-and-shift operation. It requires a fundamental shift in how you think about resource allocation, network topology, and job scheduling. The technical challenges are significant, but the cultural shift within your engineering organization is often the most difficult hurdle to clear.

Managing the Cultural Shift

For researchers accustomed to writing simple shell scripts and submitting them via sbatch, the transition to writing verbose YAML manifests can feel like a massive step backward in productivity. Platform engineering teams must invest heavily in internal developer platforms or abstraction layers to shield data scientists from the underlying Kubernetes complexity. If you force researchers to become Kubernetes administrators, your migration will likely face severe internal resistance.

Implementing a Phased Migration

A successful migration should be executed in phases. Start by moving stateless workloads, such as data preprocessing pipelines and model inference services, to Kubernetes. These workloads naturally align with Kubernetes' core strengths. Once the platform team is comfortable managing GPU nodes and storage integrations, you can begin testing batch schedulers like Volcano or Kueue with smaller, single-node training jobs. Finally, tackle the complex, multi-node distributed training workloads, ensuring that gang scheduling and topology-aware placement are functioning perfectly before deprecating the legacy Slurm cluster.

Unified Infrastructure Strategy

Whether you choose to implement Volcano, deploy Slinky to run Slurm on Kubernetes, or rely on a managed provider like Lyceum Technology, the goal remains the same. You are building a unified, highly utilized AI infrastructure stack that accelerates your path from research to production. By carefully navigating the architectural differences and avoiding common pitfalls like partial allocation deadlocks, your organization can successfully bridge the gap between high-performance computing and modern container orchestration.

Bridging the Gap with Multi-Cluster Orchestration

As organizations scale their AI operations, they often find themselves managing multiple disparate environments. A company might have an on-premise Slurm cluster for sensitive data, a cloud-based Kubernetes cluster for scalable inference, and temporary virtual machines for exploratory research. Managing workloads across these fragmented environments introduces massive friction.

The Challenge of Multi-Environment Deployments

When a machine learning engineer wants to run a training job, they shouldn't have to rewrite their deployment scripts based on where the compute is located. Unfortunately, moving a workload from a Slurm environment to a Kubernetes cluster typically requires translating shell scripts into YAML manifests, adjusting volume mounts, and reconfiguring environment variables. This manual translation is error-prone and slows down the pace of research. Fragmentation is a primary driver of low developer productivity in AI teams.

Abstracting the Infrastructure Layer

To solve this, many teams are turning to higher-level orchestration tools that abstract the underlying infrastructure entirely. Frameworks like SkyPilot allow users to define their resource requirements, such as the number of GPUs and the required memory, in a single configuration file. The orchestrator then translates this generic request into the specific syntax required by the target environment, whether that is a Slurm sbatch submission or a Kubernetes pod manifest.

This abstraction layer provides several critical benefits during a migration. First, it allows researchers to continue working without interruption while platform engineers swap out the underlying infrastructure. Second, it enables seamless bursting to cloud resources when on-premise Slurm clusters are at capacity. By decoupling the workload definition from the execution environment, organizations can achieve a unified workflow across both legacy HPC systems and modern Kubernetes deployments, significantly reducing the pain of migration.

Deep Dive: How NVIDIA Slinky Integrates Slurm and Kubernetes

For organizations that cannot afford to abandon their existing Slurm workflows, running Slurm on top of Kubernetes provides a compelling hybrid architecture. NVIDIA's Slinky project, previously known as SUNK, is at the forefront of this integration, offering a robust method for marrying HPC batch scheduling with containerized node management.

Architecture of a Containerized Slurm Cluster

Slinky operates by deploying Slurm components as Kubernetes native resources. The Slurm controller (slurmctld) and the database daemon (slurmdbd) are deployed as standard Kubernetes deployments, ensuring high availability and easy scaling. The critical innovation, however, lies in how Slinky handles the worker nodes. Instead of running the Slurm daemon (slurmd) directly on the host operating system, Slinky deploys it as a DaemonSet across the Kubernetes cluster.

This means that every Kubernetes node equipped with a GPU can automatically register itself as a Slurm worker node. When a user submits a job via sbatch, the Slurm controller schedules the job exactly as it would on bare metal. The slurmd pod on the assigned node then executes the task, leveraging Kubernetes' container runtime to isolate the workload.

Scaling and Performance Parity

One of the primary concerns with running a scheduler inside another scheduler is performance overhead. However, NVIDIA's internal testing has shown that this architecture scales exceptionally well. Deployments of Slinky manage clusters with over 8,000 GPUs without significant degradation in scheduling latency.

Furthermore, Slinky supports non-disruptive rolling updates. Platform engineers can update the underlying Kubernetes nodes or the Slurm daemons themselves without killing active training jobs. This level of operational flexibility is difficult to achieve with traditional bare-metal Slurm deployments, making Slinky a powerful tool for modernizing legacy AI infrastructure while maintaining the gang scheduling and fair-share queueing that researchers rely on.

Frequently Asked Questions

What are the main differences between Slurm and Kubernetes for AI?

Slurm is a batch scheduler designed for fixed high-performance computing (HPC) clusters. It excels at queuing, gang scheduling, and maximizing utilization for finite resources. Kubernetes is a container management platform built for microservices, excelling at elastic scaling and high availability but lacking native batch scheduling capabilities. Migrating requires bridging this fundamental architectural divide.

How does the PyTorch dist.barrier() function behave on Kubernetes?

In distributed training, `dist.barrier()` forces all processes to synchronize. If Kubernetes schedules three out of four required pods and leaves the fourth pending due to resource constraints, the three running pods will hit the barrier and block indefinitely. This silent hang wastes expensive GPU hours and requires custom schedulers to prevent.

What is NVIDIA Slinky?

NVIDIA Slinky (formerly known as SUNK) is an open-source project that integrates Slurm with Kubernetes. It allows organizations to run full Slurm clusters on Kubernetes infrastructure by managing Slurm daemons as pods. This bridges the gap between legacy HPC workflows and modern container management, scaling to thousands of GPUs efficiently.

Should I use Volcano or Kueue for Kubernetes batch scheduling?

Both are strong options depending on your needs. Volcano is a mature batch scheduling engine that provides robust gang scheduling and advanced queuing, ideal for heavy AI workloads. Kueue is a newer, Kubernetes-native job queueing controller that works alongside the default scheduler to manage quotas and fair-share resource allocation seamlessly.

How does Lyceum Technology handle GPU infrastructure?

Lyceum Technology provides owned, GDPR-compliant GPU infrastructure across European data centers. Instead of managing complex Kubernetes or Slurm deployments, teams can provision VMs in 18 seconds or deploy models via an OpenAI-compatible Inference Engine. This approach benefits from per-second billing and intelligent scheduling, removing the operational burden of cluster management entirely.

Further Reading

Related Resources

/magazine/kubernetes-gpu-node-setup-ml; /magazine/gpu-networking-infiniband-distributed-training; /magazine/multi-agent-orchestration-gpu-scaling