Interconnect Trumps GPUs for Training
Lambda customer at Iambic Therapeutics on GPU infrastructure choices for ML training and inference
This was a buying signal that distributed model training had stopped being just about getting GPUs, and had become about getting the right network between them. For Iambic, the bottleneck was not the H100 itself, it was the fabric that lets many GPUs act like one larger machine during training. AWS offered EFA, but at that point the practical cluster design Iambic wanted was not available on the timeline. Lambda and CoreWeave won by being willing to build closer to the HGX and InfiniBand setup advanced customers were already learning they needed.
-
In large training jobs, GPUs constantly swap gradients and model state. If that traffic is slow or jittery, expensive GPUs sit idle waiting on each other. That is why Iambic treated interconnect as a hard requirement rather than a nice to have feature.
-
AWS was pushing EFA as its cloud native answer for low latency, high bandwidth cluster networking, but AWS itself later shipped EFA only interfaces in October 2024 to remove scaling limits and routing problems for AI and ML clusters. That lines up with the idea that the earlier product was not yet good enough for this use case.
-
This also helps explain why specialists pulled share from hyperscalers. CoreWeave focused on enterprise customers reserving thousands of GPUs, while Lambda built a reputation for flexible GPU training infrastructure. In this market, willingness to customize rack design and networking could matter as much as list price.
The next phase of the market pushes every cloud toward tighter GPU clusters and better east west networking. As model training spreads beyond frontier labs into biotech, robotics, and industrial AI, the winners will be the providers that can deliver turnkey clusters where compute, storage, and interconnect arrive as one working system, not separate parts.