Reserved HGX Clusters for Training
Lambda customer at Iambic Therapeutics on GPU infrastructure choices for ML training and inference
Reliable on-demand clusters remain rare because training needs a very specific kind of cloud product, not just spare GPUs. Teams like Iambic need identical machines, fast InfiniBand links, NVIDIA reference designs like HGX, and guaranteed access for months at a time so large jobs do not stall halfway through. That pushes serious training buyers toward reserved contracts, while on-demand cloud is still mostly good for single nodes, lighter experiments, or production inference.
-
Iambic separates training from inference because the bottlenecks are different. Training wants fixed clusters reserved 24, 7 for 18 months or longer, while inference wants smaller GPUs that can spin up instantly inside mature services like S3, EC2, EKS, and Terraform based workflows on AWS.
-
The practical gap is interconnect quality and scheduling, not raw chip access alone. In late 2023, Iambic found AWS and Oracle could not deliver the InfiniBand quality it needed on the right timeline, while Lambda and CoreWeave would customize cluster specs and still came in cheaper per GPU hour.
-
This creates a split market. NeoClouds like Lambda win reserved training spend with lower prices and more willingness to build bespoke clusters, while hyperscalers and production focused platforms win inference and customer facing workloads because autoscaling, uptime, APIs, security, and operational tooling matter more there than the cheapest H100 hour.
The next step is turning reserved training clusters into a cleaner self serve product. If providers like Lambda can make multi node HGX clusters feel as easy to request as a normal cloud instance, more training workloads move on demand. Until then, reserved capacity and long lead time procurement remain the default for teams doing serious model development.