CUDA vs ROCm cloud lock-in

Diving deeper into

Lambda customer at Iambic Therapeutics on GPU infrastructure choices for ML training and inference

Interview
you have "green cloud providers" (NVIDIA) versus "red cloud providers" (AMD).
Analyzed 4 sources

The key implication is that GPU cloud lock in could shift from contract and data migration friction to software stack incompatibility. Today, most training clouds sell the same small set of NVIDIA systems, so a team can compare Lambda, CoreWeave, or AWS mostly on price, interconnect, and support. If AMD becomes a real alternative, the harder move would be from CUDA based NVIDIA workflows to ROCm based AMD workflows, which turns chip choice into a new moat for cloud providers.

  • In practice, portability is high inside the NVIDIA world because customers are usually asking for the same H100, B200, or A100 style clusters with the same reference architectures and networking. In the Iambic evaluation, Lambda and CoreWeave were close substitutes, and the final decision mostly came down to per GPU pricing once the cluster spec was fixed.
  • Switching across NVIDIA and AMD is harder because the stack underneath changes. NVIDIA training commonly assumes CUDA. AMD is building the alternative around ROCm, with its own compatibility layers, PyTorch support, and optimization guides for MI300 class GPUs. That means engineering teams would need to retest frameworks, kernels, libraries, and performance tuning when they move.
  • That split could create two cloud lanes. One lane would optimize around NVIDIA supply, CUDA tooling, and familiar training workflows. The other would optimize around AMD hardware economics. Within each lane, clouds could still look interchangeable. Across lanes, migration would feel more like a platform rewrite, which gives providers more room to hold price and keep customers longer.

If AMD keeps improving ROCm and wins real training workloads, GPU cloud competition will stop being only about who can source more NVIDIA boxes. The market will start to resemble two semi separate ecosystems, with cloud providers bundling hardware, software tuning, and support into a more durable offering. That would make chip allegiance one of the most important strategic choices in AI infrastructure.