Production GPU Infrastructure Container Deployment 15 min read read

How to Run a Production ML Pipeline Without a DevOps Team

Stop managing infrastructure and start shipping models. A technical blueprint for AI startups to scale training and inference with zero operational overhead.

Caspar Lehmkühler

Caspar Lehmkühler

May 26, 2026 · Head of Product at Lyceum Technology

The primary bottleneck for AI startups is not model architecture. It is infrastructure operations. According to Google Cloud's State of AI Infrastructure report, while 98% of organizations are using generative AI, 83% cite cost efficiency and scaling pain as their top pressures [1]. Engineering teams are burning cycles managing Kubernetes clusters, fighting for hyperscaler GPU allocations, and debugging CUDA drivers instead of fine-tuning models. You do not need a dedicated DevOps team to run a production-grade ML pipeline. The shift toward serverless GPU execution and managed inference APIs means you can decouple compute from operations entirely. If you are a team of 15 to 100 engineers, building your own infrastructure abstraction layer is a massive misallocation of resources. This guide breaks down exactly how to architect a zero-DevOps machine learning pipeline, from experimentation to production serving.

The True Cost of DIY GPU Infrastructure

The Hidden Costs of Hardware Maintenance

Managing local GPU servers or raw hyperscaler virtual machines looks cost-effective on a spreadsheet. In reality, it destroys engineering velocity and inflates your operational budget.

Consider the local hardware route. Purchasing a single server equipped with eight NVIDIA H100 GPUs represents a massive capital expenditure. But the hardware cost is only the baseline. The hidden tax comes from maintenance. You have to manage cooling requirements, configure reverse proxies, and build VPN tunnels to access the hardware securely. When you build a custom rig, you are suddenly responsible for PCIe lane configurations, NVLink bridge stability, and power supply constraints. Updating BIOS settings to support Resizable BAR or debugging a faulty PCIe riser takes days of engineering time. When a drive fails or a node crashes, your machine learning engineers are forced to play sysadmin.

Furthermore, your utilization rate will be terrible. Industry data indicates that average GPU utilization rates hover between 15% and 40% in traditional deployments [2]. You are paying for premium silicon that sits idle while your engineers sleep, write code, or analyze results.

The Hyperscaler Capacity Gap

The hyperscaler route is equally problematic. Public clouds require block-reservations for high-end GPUs. If you attempt to use on-demand instances, you will face severe capacity bottlenecks. You request a machine, wait twenty minutes, and receive a capacity error. When you finally secure an instance, you realize the default machine image has the wrong NVIDIA driver version for your PyTorch build, forcing you into hours of debugging CUDA toolkit conflicts.

Many startups fall into the hyperscaler credit trap. They receive a massive grant, build a sloppy architecture that dedicates one GPU per model 24/7, and burn through their allocation in months. When the credits expire, they are left with an unsustainable infrastructure bill and a pipeline that cannot scale.

Standardizing the Training Pipeline with Containers

Container-First Development as the Foundation

To eliminate operational overhead, you must standardize how workloads are packaged and executed. The foundation of a zero-DevOps pipeline is container-first development. Instead of configuring environments on remote virtual machines via SSH, package your training scripts, dependencies, and environment variables into a Docker container. This creates an immutable artifact that runs identically on a local RTX 4090 or a massive H100 cluster. Containerization isolates your code from the underlying host system, eliminating the classic environment mismatch problem.

A robust machine learning Dockerfile starts with an official NVIDIA base image, such as nvidia/cuda:12.1.0-base-ubuntu22.04. From there, you install your specific Python version, copy your dependency files, and install your libraries. By locking down your PyTorch version and CUDA runtime within the container, you guarantee that your training script will execute flawlessly regardless of the host machine's configuration. This level of standardization means that any engineer on your team can pull the image and reproduce a training run without spending hours debugging dependency conflicts.

Leveraging Serverless Training Architectures

Once your workload is containerized, you can leverage serverless execution for training runs. Consider a biotech startup training a protein folding model. This workload requires federated learning, massive parallel compute, and strict FP32 precision. Instead of provisioning a cluster manually, the team submits their Docker container to a serverless execution platform. They define the hardware requirements and pass the necessary execution flags.

The platform automatically provisions the required nodes, pulls the container, executes the training loop, streams the output logs, and tears down the infrastructure the moment the job completes. You pay strictly per second of execution, similar to the pricing models seen on platforms like Replicate [2]. There are no idle costs and no cluster management tasks. If the training run takes three weeks, the infrastructure scales up and down precisely around that workload. By adopting this serverless approach, teams can run complex, multi-node training jobs without ever touching a Kubernetes manifest or configuring a load balancer.

Conquering Inference Complexity

The Mechanics of LLM Inference

Training is a batch process with a defined end state. Inference is a continuous operational commitment. Serving large language models or vision foundation models requires handling concurrent requests, managing KV caches, and scaling replicas based on unpredictable traffic spikes.

The mechanics of LLM inference are notoriously complex. The process is split into a compute-heavy prefill phase and a memory-bandwidth-bound decode phase. To achieve high throughput, you must implement continuous batching, which dynamically groups incoming requests at the iteration level rather than waiting for an entire sequence to finish. Managing the KV cache manually to prevent out-of-memory errors requires deep systems engineering knowledge. This is the exact point where most startups panic and hire an MLOps engineer. Building a custom serving stack means writing FastAPI wrappers, handling request queues, managing GPU memory allocation, and configuring load balancers. If you dedicate a virtual machine to a single model, you pay for 24/7 uptime even if your users only trigger the model a few times a day.

Managed Endpoints and Scale-to-Zero

Managed inference endpoints solve this problem entirely. For European teams needing strict data privacy, Lyceum provides a dedicated inference engine that hosts your model on EU-sovereign infrastructure. You deploy your Hugging Face model or custom Docker image to a dedicated GPU, and you receive an OpenAI-compatible API endpoint. You change the base URL in your SDK, and your application works instantly.

The platform handles round-robin load balancing and scales to zero when idle. This means the machine shuts down during periods of inactivity, and you pay only when serving traffic. Leading serverless platforms charge by the second of compute time used [1], drastically reducing costs for bursty workloads. With rapid virtual machine provisioning times, cold starts are minimized, giving you the control of owned infrastructure with the simplicity of a managed API. You avoid the overhead of maintaining a persistent server while ensuring your application remains highly responsive to user demand.

The European Compliance Imperative

Navigating Data Residency Requirements

If you operate in healthcare, manufacturing, or enterprise software within the European Union, data residency is a hard requirement. You cannot route sensitive data through servers located outside the EU. When pitching a factory anomaly detection system or a medical image segmentation tool to a European enterprise, the first question they will ask concerns data protection.

Enterprise clients operate on zero-trust architectures. They require strict adherence to compliance frameworks like ISO 27001, SOC 2, and the upcoming EU AI Act. Many serverless GPU platforms rely on multi-cloud hyperscaler backends or decentralized marketplaces where you have zero control over where the compute actually happens [1]. This architecture fails GDPR requirements immediately. Furthermore, US-based providers are subject to the CLOUD Act, which creates unacceptable legal exposure for European defense and pharmaceutical companies. Non-EU hosting is a deal-breaker for regulated industries, making infrastructure location just as important as compute performance.

Building a Competitive Moat with Sovereignty

Compliance serves as a powerful competitive moat. By running workloads on owned GPU infrastructure across European data centers, you guarantee provable data residency. The stack relies on open-source standards like vLLM and NVIDIA Dynamo rather than proprietary black-box engines, ensuring complete transparency and customer portability. This owned-infrastructure model also provides a structural cost advantage.

For example, high-performance virtual machines are available at a fraction of the cost of legacy hyperscalers, delivering significant savings without sacrificing performance. When you decouple your operations from a dedicated DevOps team, you must ensure your chosen serverless provider inherently solves these compliance challenges. Providers like Lyceum allow European startups to leverage the speed of serverless execution while maintaining the strict data sovereignty required to close enterprise contracts. This ensures that scaling your machine learning pipeline does not inadvertently introduce regulatory liabilities.

Storage and Data Gravity

The Impact of Data Gravity on Pipeline Velocity

Compute is only half of the machine learning equation. Data gravity dictates that your storage architecture will ultimately define your infrastructure costs and pipeline velocity. Moving terabytes of training data across the public internet is slow and prohibitively expensive. During a training run, your GPU is only as fast as the data pipeline feeding it. If your PyTorch DataLoaders are waiting on slow network attached storage, your expensive H100 GPUs will sit idle, waiting for the next batch of images or text. You need high-throughput storage solutions that can saturate the GPU memory bandwidth.

Hyperscalers are notorious for their egress fees. They make it cheap to upload your datasets but charge exorbitant rates when you need to move that data to a different compute provider or download your model weights. This creates vendor lock-in at the storage layer, forcing you to use their overpriced compute instances simply because your data is trapped there. When evaluating top serverless GPU clouds, you must consider how they handle data ingress and egress [1].

Decoupling Storage from Compute

A modern pipeline requires decoupling storage from compute without incurring financial penalties. You should utilize S3-compatible storage solutions that offer zero egress fees. When you trigger a serverless training job, the compute node should pull the dataset directly from this storage bucket, execute the workload, and write the final model weights back to the bucket.

By eliminating data transfer charges, you maintain the flexibility to route your compute workloads to the most cost-effective hardware available at any given moment. This architecture allows your machine learning engineers to treat compute as a stateless, disposable resource. You can spin up a massive cluster for a few hours, process your data, and shut it down without worrying about where the data will live long-term. This separation of concerns is critical for operating a zero-DevOps pipeline efficiently.

A Blueprint for Zero-DevOps Scaling

Transitioning to a zero-DevOps pipeline requires a disciplined approach to engineering. Follow this technical framework to scale your infrastructure without expanding your headcount.

  1. Develop Locally, Train Serverlessly

    Use short-lived GPU instances for model testing and experimentation. Spin up a machine for a thirty-minute session to validate your code and ensure your gradients are updating correctly. Once the architecture is stable, push the container to a registry and trigger a serverless training run for the full dataset.
  2. Integrate CI/CD for Machine Learning

    Connect your code repositories to your infrastructure. When a developer merges a pull request, use automated actions to build the Docker container, run unit tests on a small CPU instance, and then trigger the serverless GPU training job via an API call. This automates the entire deployment pipeline.
  3. Predict VRAM Requirements

    Out-of-memory errors are the most common cause of failed training runs. Utilize tools like the Pythia AI scheduler to predict VRAM requirements and estimate runtimes before you provision hardware. This prevents wasted compute spend and accelerates your iteration cycles by ensuring you select the correct GPU tier for the job.
  4. Implement Scale-to-Zero Inference

    For workloads that are not consistently saturated, configure your endpoints to scale to zero. If a factory camera only runs inference when an anomaly is detected, do not pay for a persistent GPU. The slight cold-start latency is a worthwhile trade-off for eliminating idle costs.
  5. Demand Open Standards

    Avoid platforms that require proprietary SDKs or custom model packaging formats. If a provider does not support standard Docker containers or an OpenAI-compatible API, you are locking yourself into their ecosystem. Maintain portability by design so you can migrate workloads as hardware pricing fluctuates.

By standardizing on containers, leveraging serverless execution, and utilizing managed inference APIs, your engineering team can focus entirely on model performance and product features. Infrastructure should be an API call, not a full-time job.

Automating Model Deployment with CI/CD

Bridging the Gap Between Code and Compute

Operating without a dedicated DevOps team requires rigorous automation. You cannot rely on manual scripts or SSH sessions to deploy new model versions. Instead, you must integrate Continuous Integration and Continuous Deployment (CI/CD) practices directly into your machine learning workflows. This ensures that every code commit or model weight update is automatically tested, packaged, and deployed to your serverless infrastructure.

The process begins in your version control system. When a machine learning engineer pushes a new model architecture to a repository, a GitHub Action or GitLab CI pipeline should trigger automatically. This pipeline first runs unit tests on a lightweight CPU instance to verify code integrity. Once the tests pass, the pipeline builds a new Docker container containing the updated code and dependencies. This container is then pushed to a secure container registry. By automating the build process, you eliminate the risk of human error and ensure that the exact same container is used for both testing and production.

Triggering Serverless Deployments via API

After the container is securely stored in the registry, the CI/CD pipeline makes an API call to your serverless GPU provider. Platforms that offer pay-per-second billing and API-driven deployments allow you to programmatically update your inference endpoints [2]. The API call instructs the provider to pull the latest container image and route incoming traffic to the new version.

This automated deployment strategy supports advanced release techniques like blue-green deployments or canary releases. You can route a small percentage of user traffic to the new model to monitor performance and latency before fully deprecating the old version. If an issue arises, rolling back is as simple as triggering another API call to revert to the previous container tag. By treating infrastructure as code and relying on managed APIs, your machine learning engineers can push updates to production multiple times a day without ever needing to consult a systems administrator.

Cost Optimization Strategies for Serverless ML

Maximizing the Value of Pay-Per-Second Billing

Transitioning to a zero-DevOps pipeline fundamentally changes your cost structure. Instead of amortizing massive capital expenditures over several years, you shift entirely to an operational expenditure model. Serverless GPU platforms typically charge based on the exact number of seconds your workload runs [2]. While this eliminates the financial drain of idle hardware, it requires a different approach to cost optimization to ensure your cloud bill remains predictable.

The most effective way to control costs is to match your workload to the appropriate hardware tier. Not every training job requires an NVIDIA H100. For fine-tuning smaller models or running inference on tabular data, an RTX 4090 or an A100 might deliver better cost-to-performance ratios. By profiling your workloads locally and understanding your exact VRAM requirements, you can select the most economical instance type for each specific task. Because serverless platforms allow you to define hardware requirements via an API call, you can dynamically switch GPU types based on the needs of the current job.

Monitoring and Enforcing Resource Limits

Another critical optimization strategy is implementing strict resource limits and monitoring. Without a DevOps team to manually audit cloud usage, a runaway training script or an infinite loop can quickly accumulate massive charges. You must utilize the budgeting and alerting tools provided by your serverless platform. Set hard caps on execution time for training jobs to ensure that a stalled process is automatically terminated before it drains your budget.

For inference workloads, take full advantage of scale-to-zero capabilities. Ensure your endpoints are configured to shut down aggressively during periods of low traffic. While you might experience a slight cold-start delay when the next request arrives, the cost savings of turning off a high-end GPU overnight are substantial. By combining containerized workloads with intelligent API usage, your engineering team can maintain tight control over infrastructure spending while retaining the ability to scale instantly when user demand spikes.

Frequently Asked Questions

Why is auto-scaling GPUs on public clouds so difficult?

Legacy hyperscalers were fundamentally designed for CPU workloads, making GPU provisioning incredibly slow. Requesting a high-end GPU instance often takes several minutes, and capacity is frequently unavailable on demand due to massive hardware shortages. This delay makes real-time auto-scaling impossible for latency-sensitive inference workloads. Consequently, machine learning teams are forced to pay for persistent, block-reserved instances just to guarantee availability, resulting in massive idle costs.

How do I avoid vendor lock-in with GPU cloud providers?

To avoid vendor lock-in, you must standardize your workloads using open-source tools. Package your training scripts and dependencies inside Docker containers to ensure portability across different hardware providers. Furthermore, use inference platforms that support open-source orchestration frameworks like vLLM and provide an OpenAI-compatible API. You should strictly avoid proprietary model packaging formats or black-box inference engines that restrict your ability to migrate.

What are the compliance risks of using US-based serverless GPU platforms?

US-based cloud providers are subject to the CLOUD Act, a law that can compel them to hand over customer data to US authorities regardless of where the physical servers are located. For European companies operating in highly regulated sectors like healthcare, defense, or enterprise software, this legal exposure directly violates strict data residency requirements and GDPR mandates. Choosing an EU-sovereign provider mitigates this risk entirely.

How does scale-to-zero pricing work for machine learning models?

Scale-to-zero is a pricing mechanism that automatically shuts down the GPU instance hosting your machine learning model when there are no incoming API requests. This ensures you stop paying for expensive compute during idle periods. When a new request arrives, the serverless platform rapidly provisions the machine, processes the request, and resets the idle timer, optimizing your operational expenditure.

Why should I decouple storage from compute in my ML pipeline?

Tying your data to a specific compute provider exposes you to massive egress fees if you ever want to move your workloads. By using independent, S3-compatible storage with zero egress fees, you can freely route your training jobs to whichever GPU provider offers the best pricing and availability at that moment.

Further Reading

Related Resources

/magazine/deploy-docker-container-gpu-cloud; /magazine/gpu-cloud-api-ci-cd-automation; /magazine/vllm-vs-tgi-vs-triton-inference-server