ML Training Without AWS: A Guide to Sovereign GPU Infrastructure
Why European AI teams are moving beyond hyperscalers for performance and compliance.
Aurelien Bloch
February 23, 2026 · Head of Research at Lyceum Technologies
AWS is the default starting point for many ML engineers due to initial credits and a familiar ecosystem. However, as models scale and datasets grow into the terabyte range, the 'hyperscaler tax' becomes a significant burden. This tax manifests as exorbitant egress fees, complex VPC configurations, and a lack of specialized support for deep learning workloads. Furthermore, European enterprises face increasing pressure to maintain data residency within the EU to satisfy GDPR and sovereign requirements. Moving ML training away from AWS is no longer just a cost-saving measure: it is a strategic necessity for teams requiring high-performance compute, predictable pricing, and data sovereignty in a competitive AI landscape.
The Hidden Costs of Hyperscaler ML Training
When teams evaluate the cost of ML training on AWS, they often focus solely on the hourly rate of P4d or P5 instances. This narrow focus ignores the substantial 'hidden tax' associated with the broader AWS ecosystem. Data egress fees are perhaps the most punitive of these costs. Moving large datasets from S3 to an external compute provider or even between regions can result in significant unexpected charges. For a team training a large language model on a 10TB dataset, the cost of simply moving that data can rival the cost of the compute itself. This creates a 'vendor lock-in' effect where the cost of leaving the ecosystem becomes a barrier to optimization.
Beyond data movement, the complexity of AWS orchestration often requires dedicated DevOps resources. Managing SageMaker, EC2 clusters, and FSx for Lustre involves a steep learning curve and significant maintenance overhead. Many AI startups find that their ML engineers spend more time debugging IAM roles and VPC peering than they do refining their model architectures. This operational friction reduces the velocity of research and development. Furthermore, AWS instances are often over-provisioned. Without precise visibility into memory footprints and utilization, teams default to the largest available instances to avoid Out-of-Memory (OOM) errors, leading to an industry-average GPU utilization of only 40%. This inefficiency means that 60% of the compute budget is effectively wasted on idle silicon.
The Case for EU-Sovereign GPU Infrastructure
For European AI companies, data sovereignty is not an optional feature: it is a fundamental requirement. The US Cloud Act allows US authorities to request data stored by US-based companies, regardless of where the physical servers are located. This creates a significant legal risk for enterprises handling sensitive healthcare, financial, or personal data. By moving ML training without AWS and onto a sovereign European provider, companies ensure that their data never leaves the jurisdiction of the EU. Lyceum operates clusters in Berlin and Zurich, providing a legally robust environment that is GDPR compliant by design.
Sovereignty also extends to the physical infrastructure and energy grid. European data centers are often subject to stricter sustainability regulations and have better access to renewable energy sources compared to many US-based regions. For companies with ESG (Environmental, Social, and Governance) targets, training models on a European cloud is a measurable way to reduce the carbon footprint of their AI operations. Additionally, local infrastructure means lower latency for European engineering teams. When your compute is physically closer to your researchers, interactive sessions in JupyterLab or VS Code feel more responsive, improving the developer experience. Choosing a provider with a focus on the DACH region (Germany, Austria, Switzerland) ensures that support and infrastructure are aligned with local business hours and regulatory standards.
Solving the 40% GPU Utilization Problem
One of the most significant challenges in ML infrastructure is the massive gap between theoretical and actual GPU utilization. Most teams struggle with a 40% average utilization rate because they lack the tools to profile their workloads accurately before execution. This leads to a cycle of over-provisioning: engineers select an H100 with 80GB of VRAM for a job that might only require 32GB, simply because they cannot risk a crash mid-training. This 'safety margin' is expensive and unnecessary if you have the right orchestration layer. Specialized orchestration addresses this by providing precise predictions for runtime, memory footprint, and utilization before a job even starts.
By analyzing the model architecture and dataset parameters, the platform can auto-detect potential memory bottlenecks. If a job is likely to bottleneck on I/O rather than compute, the system can suggest a different hardware configuration or optimize the data loading pipeline. This workload-aware approach shifts the focus from 'renting a box' to 'executing a task.' When the infrastructure understands the requirements of a PyTorch or JAX script, it can schedule that script on the most cost-effective hardware that meets the performance constraints. This level of intelligence is missing from generic hyperscalers, which treat a GPU instance as just another virtual machine. Improving utilization from 40% to 80% can significantly increase your effective compute budget without increasing your spend.
Automated Hardware Selection and Optimization
The GPU market is no longer dominated by a single 'best' chip. While the NVIDIA H100 is the gold standard for large-scale LLM training, other chips like the A100, L40S, or even specialized inference cards might be more cost-effective for specific tasks like fine-tuning, embedding generation, or computer vision. Manually benchmarking every model against every available GPU type is a waste of engineering time. An intelligent orchestration platform should handle this automatically. The platform features an auto hardware selection engine that allows engineers to define their priorities: cost-optimized, performance-optimized, or time-constrained.
For example, if a team needs to run a batch of experiments by Monday morning, the system can select a cluster of L40S GPUs that provides the necessary throughput at a lower price point than H100s. Conversely, for a massive pre-training run where interconnect speed is the primary bottleneck, the system will prioritize H100s with NVLink. This level of granularity ensures that you are always using the right tool for the job. The platform also monitors for OOM errors in real-time and can automatically suggest a hardware upgrade for the next run if a bottleneck is detected. This proactive management eliminates the 'guesswork' that typically defines ML infrastructure management on AWS.
Streamlining Deployment with One-Click PyTorch
The developer experience on traditional cloud platforms is often fragmented. Engineers have to manage Docker images, configure SSH keys, set up drivers, and ensure that the CUDA version matches their framework requirements. This setup phase can take hours or even days for complex multi-node clusters. To move ML training without AWS effectively, the replacement must offer a superior developer experience. Lyceum provides a one-click PyTorch deployment workflow that abstracts away the underlying infrastructure. Whether you are using the CLI, a VS Code extension, or a RESTful API, the goal is to get from code to training as quickly as possible.
Consider a typical workflow using the Lyceum CLI: a researcher can submit a job with a single command like lyceum job submit --image pytorch/pytorch:latest --script train.py. The platform handles the provisioning of the GPU, the mounting of the dataset, and the logging of the output. There is no need to manually manage EC2 instances or worry about spot instance interruptions. For teams that prefer an integrated environment, the VS Code extension allows for seamless remote development. You can write code locally and execute it on a high-performance GPU in Berlin with the same ease as running it on your laptop. This tight integration between the development environment and the compute cluster reduces context switching and allows ML engineers to focus on what they do best: building models.
Total Cost of Compute (TCC) vs. Hourly Rates
The traditional model of cloud pricing is based on hourly instance rates, which is a poor metric for ML workloads. What matters to an AI team is not how much an hour of an A100 costs, but how much it costs to complete a specific training run. This is the 'Total Cost of Compute' (TCC). TCC includes the hourly rate, the time spent on setup, the cost of data egress, and the cost of failed runs due to infrastructure issues. When you factor in these variables, AWS is often significantly more expensive than specialized providers. Workload-aware pricing that aligns the cost with the actual value delivered.
By predicting the runtime and memory requirements of a job, the platform can provide a more accurate estimate of the total cost before the job starts. This transparency allows CTOs and Team Leads to manage their budgets more effectively. Furthermore, the absence of egress fees means that you can move your models and results out of the cloud without being penalized. This 'zero egress' policy is a cornerstone of a sovereign cloud, as it ensures that the user retains full control over their data and intellectual property. When you combine higher utilization rates with the elimination of hidden fees, the TCC on a specialized platform can be 30-50% lower than on a general-purpose hyperscaler.
Technical Implementation: Migrating Your Workload
Migrating ML training without AWS is a straightforward process if your code is already containerized or uses standard frameworks like PyTorch, TensorFlow, or JAX. The first step is to decouple your data storage from AWS-specific services like S3. While many providers offer S3-compatible APIs, moving your primary dataset to a local, high-performance storage solution within the same data center as your GPUs will significantly improve I/O performance. The platform supports standard data mounting options that make this transition seamless. Once your data is accessible, the next step is to adapt your training scripts to use the Lyceum orchestration layer.
For most teams, this simply means replacing torchrun or srun commands with the Lyceum CLI. The platform's Slurm integration also makes it easy for teams coming from academic or on-prem environments to adapt their existing scheduling scripts. A key technical advantage during migration is the platform's ability to auto-detect hardware requirements. You don't need to manually specify the exact GPU model if you don't want to: you can simply specify your performance targets, and the orchestrator will handle the rest. This abstraction layer makes your ML pipeline more portable and resilient to hardware shortages. By following a container-first approach, you ensure that your training environment is reproducible across different providers, further reducing vendor lock-in.
The Future of Sovereign AI Infrastructure
The shift toward specialized, sovereign GPU clouds is part of a broader trend in the technology industry. As AI becomes a core component of national and corporate infrastructure, the reliance on a few US-based monopolies is being questioned. The future of AI infrastructure is distributed, efficient, and compliant. European companies are leading this charge by demanding providers that understand their unique regulatory and performance needs. Specialized providers are at the forefront of this movement, building a platform that not only provides the raw compute power needed for modern AI but also the intelligence to use that power efficiently.
In the coming years, we expect to see even tighter integration between hardware and software orchestration. Features like automated model sharding, dynamic resource scaling, and cross-cluster scheduling will become standard. By moving away from the generic 'one-size-fits-all' approach of AWS, AI teams can unlock new levels of performance and innovation. The goal is to make high-performance compute as accessible and easy to use as a local workstation, without the overhead of traditional cloud management. For teams in Berlin, Zurich, and across Europe, the path to sovereign AI starts with choosing infrastructure that is built for the specific demands of machine learning.