From Model APIs to Workflow Runtimes
Gimlet Labs
This shift moves the control point in inference from the model endpoint to the workflow scheduler. In practice, a production agent is a chain of jobs, retrieval, long context prefill, token by token decoding, tool execution, and response assembly, with very different latency and hardware needs. That is why the market is widening from single API calls into platforms that decide where each step runs, how it scales, and how it meets latency targets.
-
Inference clouds are already moving up from simple serving into workflow backends. Baseten lets each step in a chain run on different hardware and autoscaling settings. Fireworks customers increasingly want unified APIs, concurrency guarantees, observability, and workflow aware primitives like retries, fallbacks, and stateful agents.
-
Gimlet is pushing that idea further by breaking an agent into a compute graph and routing fragments across NVIDIA, AMD, Intel, Cerebras, d-Matrix, and others. Its wedge is that no single chip is best for every stage, so the winner may be the platform that can mix chips inside one workload, not the one with the biggest single model catalog.
-
The other path is vertical integration. Groq argues a tightly coupled chip and cloud stack can win many agent workloads through deterministic latency, while NVIDIA is productizing orchestration and optimization inside its own stack with Dynamo and TensorRT-LLM. That compresses standalone inference vendors toward either deeper workflow control or true cross chip heterogeneity.
The next phase of inference will look less like renting a model and more like operating an AI application runtime. The strongest platforms will own scheduling, routing, observability, and policy across whole workflows, while silicon vendors keep climbing upward to bundle those controls into their own clouds.