Baseten Chains per-step scaling
Gimlet Labs
This is Baseten moving from model hosting into workflow infrastructure. In practice, a team can split one AI product into small services, like retrieval on CPU, reranking on a smaller GPU, and generation on a larger GPU, then tune scaling for each service separately. That matters because compound systems fail when one stage becomes the bottleneck or forces the whole stack onto the most expensive hardware.
-
Baseten’s docs describe Chains as multi step inference across independent services, where each step sets its own hardware, dependencies, and scaling rules. That makes Chains more like a production graph orchestrator than a single model endpoint, and it overlaps directly with Gimlet’s multi stage graph design.
-
The concrete win is cost and latency control. Preprocessing or routing can stay on CPU, while only the heavy model step sits on GPU. Baseten also lets deployments scale to different replica counts based on each step’s traffic and concurrency settings, so one hot stage can expand without overprovisioning the rest.
-
The key difference versus Gimlet is the hardware assumption underneath the graph. Baseten is built around picking among cloud CPU and GPU resources for each chain step, while Gimlet is centered on scheduling across heterogeneous AI hardware and compiler driven optimization. Fireworks and Modal extend into adjacent territory, but from serving and serverless code roots rather than heterogeneous scheduling.
This category is heading toward full stack inference operating systems, not just faster endpoints. The winners will be the platforms that can treat an agent or RAG app as a live graph of many steps, place each step on the right compute, and keep latency low as traffic spikes. Baseten is pushing hard in that direction, which makes the overlap with Gimlet increasingly strategic.