GPU Interconnect Determines Training Economics

Diving deeper into

Lambda customer at Iambic Therapeutics on GPU infrastructure choices for ML training and inference

Interview
the quality of the GPU interconnect between the GPUs—your InfiniBand type of thing—was of critical importance to the success of training our models.
Analyzed 6 sources

This was really a statement about training economics being set by the network, not just by the GPU. When a model is spread across many GPUs, they have to swap gradients and model state constantly, so weak interconnect turns expensive H100s into idle silicon waiting on data. That is why Iambic cared about HGX style clusters with strong InfiniBand, and why a cheaper NeoCloud with the right fabric beat a pricier hyperscaler with the wrong one.

  • For this workload, compute was bought as a reserved multi GPU cluster for 12 to 18 months, not as random hourly instances. Iambic wanted identical GPUs, NVIDIA reference architecture, and high quality InfiniBand so the whole cluster could run one synchronous training job without network bottlenecks.
  • The split with inference is the clearest proof point. Iambic kept training on Lambda at roughly $500,000 to $1M per month, but ran most inference on AWS at roughly $50,000 to $100,000 per month. Training rewarded raw cluster performance, inference rewarded mature tooling, uptime, and on demand availability.
  • This also explains the market structure between GPU clouds. CoreWeave and Lambda won early by being willing to customize dense GPU clusters for growth stage AI teams, while hyperscalers optimized for broader cloud consistency. Another customer, Heyday, made the opposite tradeoff for production inference and chose CoreWeave over cheaper Lambda because Kubernetes, autoscaling, and public facing deployment mattered more there.

Going forward, interconnect quality becomes a bigger buying criterion as models spread across more GPUs and customers push for larger reserved clusters. The likely result is a sharper split, with NeoClouds and AI first infrastructure providers winning training by packaging the best networked clusters, and hyperscalers staying strongest where software maturity and reliability matter most.