Kubernetes Unlocking Fungible GPU Capacity

Lambda customer at Iambic Therapeutics on GPU infrastructure choices for ML training and inference

if Kubernetes could become a much more mature and widely adopted tool for machine learning training workloads, then it could be the case that GPUs could become a lot more fungible.

Analyzed 6 sources

If training moves onto a common Kubernetes control layer, a cloud GPU stops looking like a custom environment and starts looking like a schedulable commodity. That matters because today large training runs still depend on provider specific cluster setup, storage wiring, and networking choices, which is why teams reserve long blocks of time on a vendor like Lambda instead of treating GPU hours like interchangeable spot capacity. Kubernetes is the path to portability, but the missing pieces are still job scheduling, data movement, and reproducible cluster setup.

1 sacra 2 sacra 3 amazon 4 coreweave

The practical bottleneck is not the chip, it is the surrounding system. Iambic describes training as needing identical GPUs, HGX style builds, high quality InfiniBand, custom Kubernetes components, and storage that has to be tuned with the provider. Those dependencies make switching clouds slow and expensive even when list prices are close.

1 sacra
Inference is already closer to fungible because Kubernetes and cloud primitives are more mature there. Iambic uses AWS for on demand inference because instances spin up reliably, storage is stable, and infrastructure can be managed as code. That is much closer to a standard cloud workflow than reserving a multi month training cluster.

1 sacra 3 amazon
The market is trying to close this gap. CoreWeave positions its cloud as Kubernetes native and bundles pieces like Slurm on Kubernetes, observability, and managed Ray, while Kubeflow now supports Volcano for gang scheduling so all pods in a training job can start together. Those are the exact layers needed to make training workloads easier to move.

2 sacra 4 coreweave 5 coreweave 6 kubeflow

If these orchestration layers keep improving, NeoCloud pricing should converge faster and reserved training contracts should lose some of their lock in. The winners would be providers that combine cheap GPU supply with a clean scheduler, portable storage, and one step job submission, because that turns training infrastructure into something customers can shop for by price and availability instead of rebuilding their stack each time.

1 sacra 2 sacra 4 coreweave 5 coreweave 6 kubeflow