Per-Workflow Inference Scheduling
Fireworks AI customer at Hebbia on serving state-of-the-art models with unified APIs
This points to the next battleground in inference, moving from serving models fast on average to giving each workflow its own performance lane. Hebbia was running three very different jobs through one API, live analyst chat that needs sub second answers, long agent chains where one slow step stalls the whole task, and huge batch document runs where speed matters less than cost and throughput. A shared scheduler works early, but at scale it starts forcing expensive tradeoffs between responsiveness and GPU utilization.
-
Hebbia used Fireworks because it normalized open models behind OpenAI style endpoints, handled autoscaling, and exposed token and latency telemetry. That let Hebbia swap in DeepSeek or Llama quickly without rebuilding routing, parsing, or rate limiting for each model.
-
The scheduling gap matters because Hebbia mixed interactive and offline work. A chat request from an analyst competes very differently for GPU time than a batch job over hundreds of thousands of documents. Without workload level controls, the provider can optimize the fleet overall, but the app cannot explicitly protect the urgent path.
-
This is where managed inference starts to resemble cloud infrastructure. Fireworks documents say shared serverless deployments do not provide latency SLAs, while dedicated and on demand deployments are the path for more predictable performance. AWS Bedrock is already exposing a latency setting that routes requests to standard or optimized inference, which is the kind of knob workload aware platforms expand over time.
The direction is toward inference platforms that look less like a single endpoint and more like a traffic control layer for AI applications. The winners will let teams declare which calls are realtime, which are bulk, and which can wait, then automatically place those workloads on the right capacity and pricing tier without forcing customers to run raw GPU clusters themselves.