Cloud Providers Entering Serverless Inference
RunPod customer at Segmind on GPU serverless platforms for AI model deployment
This points to serverless inference becoming a default cloud feature, not a niche startup category. The shift is from renting raw GPUs to buying a ready API for a model, where the provider handles cold starts, scaling, and routing. RunPod has already moved in that direction with Public Endpoints, while specialist platforms like Baseten, Together AI, and Replicate package open models as instant APIs, and hyperscalers now offer similar serverless model access inside their own clouds.
-
The product change is concrete. Instead of uploading containers, picking GPU sizes, and tuning autoscaling, developers increasingly just point code at an OpenAI compatible endpoint. RunPod Public Endpoints, Baseten Model APIs, Together serverless inference APIs, and Replicate model endpoints all package model serving this way.
-
Specialists proved the demand first. Segmind sells visual model inference by the GPU second, Baseten has turned deployment into an API workflow with Truss, and Together now gets roughly 30 to 40% of revenue from API usage. That shows real willingness to pay for convenience above raw compute.
-
The big cloud response is already visible. AWS Bedrock offers managed inference modes, Azure AI Foundry supports serverless API deployments and pay as you go model access, and Microsoft is adding partner served open model inference through Foundry. That makes direct entry by most cloud providers an execution path already underway.
Over the next few years, the market is likely to split in two. Raw GPU clouds will keep serving teams that want deep control, while most broader cloud providers move up into prepackaged inference APIs. The winners will be the platforms that make model access feel instant, cheap, and interchangeable, while still keeping latency low enough for production apps.