Training Needs Owned GPU Clusters
Lambda customer at Iambic Therapeutics on GPU infrastructure choices for ML training and inference
Owning the hardware matters most when the product is not just GPU time, but a promise that a specific cluster will behave predictably for weeks or months of training. In practice, training teams need identical GPUs, fast InfiniBand links, reserved capacity, and someone who can swap failed machines without breaking a run. That is why Lambda and CoreWeave win training contracts, while inference layers can sit on top of AWS and sell convenience instead of physical control.
-
At Iambic, training on Lambda uses reserved A100, H100, and B200 clusters with HGX style reference architectures and high quality interconnects, while inference runs mostly on AWS where on demand reliability and surrounding services like S3 and EKS matter more than who owns the box.
-
This maps cleanly to the market structure. Lambda is expanding owned data center capacity and moving toward owning more of its facilities, which lowers long term unit costs and gives it tighter control over uptime, capacity planning, and cluster quality for training buyers.
-
By contrast, Baseten, Replicate, and Together are built around abstraction. Baseten routes workloads across more than 10 cloud providers, Replicate turns models into serverless APIs, and Together adds token pricing and developer tooling on top of rented capacity from Lambda and CoreWeave.
The split between owned training clouds and asset light inference layers should get sharper. Training will keep rewarding providers that control power, networking, and replacement cycles at the rack level. Inference will keep rewarding companies that hide infrastructure complexity, squeeze more utilization out of hyperscalers, and package that into simple APIs and predictable bills.