From Model APIs to Workflow Runtimes

Diving deeper into

Gimlet Labs

Company Report
The inference market is shifting from single-model API serving to compound agentic workflows
Analyzed 4 sources

This shift moves the control point in inference from the model endpoint to the workflow scheduler. In practice, a production agent is a chain of jobs, retrieval, long context prefill, token by token decoding, tool execution, and response assembly, with very different latency and hardware needs. That is why the market is widening from single API calls into platforms that decide where each step runs, how it scales, and how it meets latency targets.

  • Inference clouds are already moving up from simple serving into workflow backends. Baseten lets each step in a chain run on different hardware and autoscaling settings. Fireworks customers increasingly want unified APIs, concurrency guarantees, observability, and workflow aware primitives like retries, fallbacks, and stateful agents.
  • Gimlet is pushing that idea further by breaking an agent into a compute graph and routing fragments across NVIDIA, AMD, Intel, Cerebras, d-Matrix, and others. Its wedge is that no single chip is best for every stage, so the winner may be the platform that can mix chips inside one workload, not the one with the biggest single model catalog.
  • The other path is vertical integration. Groq argues a tightly coupled chip and cloud stack can win many agent workloads through deterministic latency, while NVIDIA is productizing orchestration and optimization inside its own stack with Dynamo and TensorRT-LLM. That compresses standalone inference vendors toward either deeper workflow control or true cross chip heterogeneity.

The next phase of inference will look less like renting a model and more like operating an AI application runtime. The strongest platforms will own scheduling, routing, observability, and policy across whole workflows, while silicon vendors keep climbing upward to bundle those controls into their own clouds.