Inference Prefers Aggregators Over Owners
Lambda customer at Iambic Therapeutics on GPU infrastructure choices for ML training and inference
Inference shifts value away from whoever owns the rack and toward whoever makes shared GPU capacity feel fast, reliable, and easy to consume. For training, Iambic wants fixed clusters, top tier interconnect, and direct operator accountability when hardware fails. For inference, the job is different, matching spiky requests to available capacity, picking the cheapest GPU that fits the model, and wrapping it in APIs, autoscaling, storage, and uptime tooling.
-
Iambic runs training on reserved Lambda clusters with InfiniBand and spends roughly $500,000 to $1M per month there, but runs most inference on AWS at roughly $50,000 to $100,000 per month because on demand A10G and L40S instances fit smaller, variable serving workloads better than bespoke clusters.
-
That creates room for an inference layer like Together AI. It buys or reserves GPU supply from clouds such as CoreWeave and Lambda, then sells developers token based APIs, serverless endpoints, and dedicated reasoning clusters. The customer is paying for routing, batching, latency, and easier deployment, not for direct ownership of the silicon.
-
The economic logic is also different. Training buyers care about per GPU hour price, interconnect quality, and long term reservations. Inference buyers often care more about whether requests start instantly, stay up, and can be managed through standard cloud software. That is why AWS can win inference despite higher raw infrastructure costs.
As inference matures, more of the market should consolidate around software layers that aggregate commodity GPU supply and sell performance as a service. GPU owners will still matter, but the strongest margins are likely to accrue to companies that turn scattered cloud capacity into predictable token throughput for developers and enterprises.