RunPod customer at Segmind on GPU serverless platforms for AI model deployment

Background
We spoke with a RunPod customer at Segmind who uses RunPod's serverless GPU infrastructure for running various generative AI models, primarily for inference and occasionally for fine-tuning.
Our conversation explores how RunPod's flexible GPU options, serverless architecture, and community support have helped them manage costs while scaling their API-based platform for developers.
Key points via Sacra AI:
- As AI teams balance inference and training workloads, they're choosing serverless approaches for training to avoid paying for 24/7 fixed GPU allocation, while GPU availability and pricing drive provider selection. "We chose serverless for training because training is a larger job. If you want a fixed GPU, it must run 24/7. But with serverless, when the training starts, it spins up the GPU, and once it's done, it shuts down the worker. Primarily, the decision about which cloud provider to use depends on GPU availability and prices."
- As API-generated platforms face variable loads, serverless GPU infrastructure automatically scales resources up and down based on demand, eliminating the cost inefficiency of 24/7 on-demand instances. "We can't run these models on on-demand because the load keeps changing. With on-demand GPUs, scalability is an issue. In serverless, it automatically scales server cost based on the load... Once we get a request, the GPU starts automatically, serves the request, and shuts down once done. This is a perfect fit for our use case."
- While pricing matters, the availability of diverse GPU options (from 16GB to 180GB VRAM) combined with ready-made community templates for experimentation creates meaningful differentiation between GPU cloud providers. "The primary reason we went with RunPod is the availability of different kinds of GPUs... RunPod has a large pool of different kinds of GPUs... These ranges are enough to see where our model can fit easily. Let's say I deploy an open-source model which requires 30GB VRAM—I can deploy it on a 32GB VRAM model."
Questions
- Which GPU providers are you primarily working with today?
- At a high level, what kinds of workloads are you typically running on those GPU platforms? And if you're able to share, what does your monthly GPU spend look like?
- When you say primarily inference, what kinds of models are you running in production?
- When you need to do training, what does that look like? Are you fine-tuning existing models or training from scratch? And how does that shape your infrastructure needs compared to inference?
- That's helpful. To make sure I fully understand, are you saying that inference runs persistently on something like an on-demand instance, while episodic fine-tuning jobs run in a serverless environment to manage cost?
- What is it about RunPod's serverless setup that made it a good fit for your needs compared to setting up on-demand or dedicated instances?
- Have you tried or evaluated other serverless GPU providers?
- Which other serverless GPU platforms have you evaluated?
- Is there anything specific that stands out about RunPod's dashboard or UI that makes it especially usable for you and your team?
- You mentioned earlier that using RunPod serverless makes a big difference in managing costs. Can you talk more about how you compare pricing between providers? Is cost per request or per GPU second something you're closely monitoring? And was pricing a major reason you went with RunPod?
- Let's go deeper on that availability point. When you say availability of different kinds of GPUs, can you elaborate on what types matter most for your workloads?
- In your experience, is this cloud GPU space commoditized where most people just shop on price across similar machines? Or do you think there's still room for meaningful differentiation between providers?
- Do those community and pre-built templates actually save your team time when spinning up new endpoints or experiments? Any particular example where it streamlined something that would have been more complex on another platform?
- Could you share how often you actually end up using those training-focused pod templates and what types of tasks or experiments you're doing inside those pods?
- What do you think makes RunPod successful at keeping users? What's their strongest source of lock-in, in your view, not just for you but for developers more broadly?
- Have you found that their community help channels actually help you troubleshoot live production issues, or is it more about learning and experimentation use cases?
- RunPod is sometimes viewed as being more focused on smaller teams, researchers, and long-tail developers, while a company like Lambda might cater more to upmarket enterprise users. Do you think that's a fair distinction?
- Did security or data sensitivity influence your choice at all? For example, were HIPAA compliance or other industry concerns a factor for you?
- Have you ever encountered capacity issues with RunPod, for example, limited availability of certain high-end GPUs during work hours?
- In theory, GPU hours are fungible, but in practice, do you think it's easy or hard to switch from a platform like RunPod to something else? What would make that difficult?
- Has that ever affected you directly, like a time when you considered moving but decided not to because reformatting everything would have been too much work?
- Looking ahead 2-3 years, how do you see this space evolving? What changes or trends do you expect to shape the next generation of cloud GPU platforms?
- What's one misconception you think people have about this space, either cloud GPU platforms in general or RunPod specifically?
Interview
Which GPU providers are you primarily working with today?
I'm working with Concord, Lambda Labs and a few local cloud providers.
At a high level, what kinds of workloads are you typically running on those GPU platforms? And if you're able to share, what does your monthly GPU spend look like?
We use GPUs primarily for model inferences and secondarily for training. It costs us around $10,000 to $15,000 per month for GPU spend across all cloud providers.
When you say primarily inference, what kinds of models are you running in production?
All sorts of generative AI models. It can be text-to-image, text-to-video, image-to-video, or speech models—all sorts of generative AI models.
When you need to do training, what does that look like? Are you fine-tuning existing models or training from scratch? And how does that shape your infrastructure needs compared to inference?
We primarily train Lora models for image generation. We have image Lora training for SDXL image frames. We use fine-tuning only for images, and we do it on serverless because the cost of doing it on an on-demand server is huge. When users request fine-tuning, serverless work automatically starts, and when the job is done, it automatically shuts off. With on-demand GPUs, they keep running 24/7 whether there's a user request or not.
That's helpful. To make sure I fully understand, are you saying that inference runs persistently on something like an on-demand instance, while episodic fine-tuning jobs run in a serverless environment to manage cost?
There is no Concord here. For both inferences and fine-tuning, we use serverless. We don't use on-demand GPUs. Whatever we do, either inference or fine-tuning, we do completely on RunPod serverless.
What is it about RunPod's serverless setup that made it a good fit for your needs compared to setting up on-demand or dedicated instances?
We are an API-generated platform. Most of the models are for inferences, specifically focused on developers. We can't run these models on on-demand because the load keeps changing. With on-demand GPUs, scalability is an issue. In serverless, it automatically scales server cost based on the load. It's very easy and in terms of cost, it's very cheap. Once we get a request, the GPU starts automatically, serves the request, and shuts down once done. This is a perfect fit for our use case.
Have you tried or evaluated other serverless GPU providers?
Yes.
Which other serverless GPU platforms have you evaluated?
We tested Modal Labs. One issue is that RunPod has a nice user interface where it's really easy to handle all our endpoints. With Modal, it's a little complex and more developer-focused. RunPod's platform is much simpler to handle—any of our team can monitor it. The monitoring and logging are also very clear, so even if there's trouble, any team member can check it, even if they don't have much knowledge on how to deploy endpoints. This is very difficult with Modal Labs.
Is there anything specific that stands out about RunPod's dashboard or UI that makes it especially usable for you and your team?
Whatever endpoint you deploy shows as a separate card. If you have 5 different models and you deploy them as 5 different serverless endpoints, in your dashboard you see 5 different model cards. In each model card, you can see metrics like how many requests it got, the peak requests, and the RPM. You can see the execution time of each request, what the P90, P98, P99 latencies are for each request, and how many cold starts there are. You can even see the logs of each request, the region of your GPUs, and modify the GPUs. There are many more features like that.
You mentioned earlier that using RunPod serverless makes a big difference in managing costs. Can you talk more about how you compare pricing between providers? Is cost per request or per GPU second something you're closely monitoring? And was pricing a major reason you went with RunPod?
It's not only about pricing. The primary reason we went with RunPod is the availability of different kinds of GPUs. But yes, we do monitor per-second pricing. We compared it with various other platforms, but RunPod seems to be one of the cheapest in terms of pricing, and it has a very large pool of GPU availability in different categories. That's why we chose RunPod.
Let's go deeper on that availability point. When you say availability of different kinds of GPUs, can you elaborate on what types matter most for your workloads?
RunPod has a large pool of different kinds of GPUs. For example, it has GPUs ranging from 16GB VRAM, 24GB VRAM, 32GB VRAM, 48GB VRAM, 80GB VRAM, 100GB VRAM, 140GB VRAM, and 180GB VRAM. These ranges are enough to see where our model can fit easily. Let's say I deploy an open-source model which requires 30GB VRAM—I can deploy it on a 32GB VRAM model. Depending on the GPU above 32GB VRAM, I can choose based on a balance between cost and inference time.
In your experience, is this cloud GPU space commoditized where most people just shop on price across similar machines? Or do you think there's still room for meaningful differentiation between providers?
I think there are meaningful differentiations. One more thing that RunPod stands out for is they also provide pods where you can experiment with models or workflows. They have ready-made templates for both serverless and pods. If you want to host a serverless endpoint, RunPod might already have a template for that, making it much easier. They also have community templates. I think most GPU providers don't have this community template feature, which makes RunPod stand out.
Do those community and pre-built templates actually save your team time when spinning up new endpoints or experiments? Any particular example where it streamlined something that would have been more complex on another platform?
We use community templates for pods. We don't use them for serverless because we have very specific requirements, and our code needs specific kinds of monitoring and specific kinds of output in headers. But we use community pod templates a lot. We use them primarily for ComfyUI. It's a workflow-based tool that requires a specific environment. The pod templates have this environment set up ready—we just need to run that template and open ComfyUI. There are also ready-made pod templates for training.
Could you share how often you actually end up using those training-focused pod templates and what types of tasks or experiments you're doing inside those pods?
We do Lora fine-tuning, depending on client requirements and our own experiments. We mostly do image Lora fine-tuning. For example, if there's a specific use case where we have a bunch of images and want to generate images on a particular theme, we train a Lora of that theme. Instead of manually setting up or reinstalling all the things, including training libraries, we just use a pod template, put our data there, and train it. It's very straightforward and saves us a lot of time using these pod templates for training.
What do you think makes RunPod successful at keeping users? What's their strongest source of lock-in, in your view, not just for you but for developers more broadly?
I really appreciate RunPod's support team. They have a dedicated Discord channel that's very easy to access. They also have a specific support bot that answers most questions or queries. If the support bot can't answer, you can raise a ticket. The community might help with answers, or someone from the RunPod team will get back to you. For us, this is a big advantage of community, which no other cloud providers has—either in building templates or getting support. So community is a big advantage for them.
Have you found that their community help channels actually help you troubleshoot live production issues, or is it more about learning and experimentation use cases?
I think both. If you face any issue in RunPod, most likely some other users have faced it too. So if you post a query, another user might answer immediately. That's really helpful.
RunPod is sometimes viewed as being more focused on smaller teams, researchers, and long-tail developers, while a company like Lambda might cater more to upmarket enterprise users. Do you think that's a fair distinction?
No, it's not a good distinction. I think RunPod is suitable for all kinds of teams and companies. Lambda is suitable for enterprises, but not for smaller teams. That's not the case with RunPod—it's suitable for small teams or large teams. You can use it either way.
Did security or data sensitivity influence your choice at all? For example, were HIPAA compliance or other industry concerns a factor for you?
That was not a factor for us in considering RunPod. As a startup, we wanted to move as fast as possible, so we chose the best one in the market. I think RunPod is the best one, and that's why we went for it. Security compliance was not our primary barrier.
Have you ever encountered capacity issues with RunPod, for example, limited availability of certain high-end GPUs during work hours?
There are things called storage volumes in RunPod. If you're operating on a storage volume, you might face unavailability of GPUs. But if you're using a global region, there won't be any issues with GPU availability. If you're using a specific region where your storage volume is located, then you might face availability issues of the GPUs.
In theory, GPU hours are fungible, but in practice, do you think it's easy or hard to switch from a platform like RunPod to something else? What would make that difficult?
RunPod has a particular format of deployment. If you want to move to another cloud provider, you have to change that format to suit that provider. I think the same code can't be run on other providers. RunPod has a specific format for serverless.
Has that ever affected you directly, like a time when you considered moving but decided not to because reformatting everything would have been too much work?
We've never considered moving away from RunPod. But even if we think about moving away, it's not easy because of how dependent we are on it. It would clearly take us months to completely move from RunPod if we decided to. It wouldn't be a one-day or one-week shift.
Looking ahead 2-3 years, how do you see this space evolving? What changes or trends do you expect to shape the next generation of cloud GPU platforms?
Overall, I think most cloud providers might enter the serverless inference space. RunPod has already started selling endpoints—serverless endpoints of particular models directly. So developers don't need to use serverless templates or deploy their own models. Instead, they can directly go to the serverless endpoints of particular models provided by RunPod. I think most GPU providers might also start providing inferences directly instead of requiring all the setup.
What's one misconception you think people have about this space, either cloud GPU platforms in general or RunPod specifically?
I think people might feel these providers are expensive, but they're not really expensive. They're actually cheaper than normal cloud providers. One major misconception is that cloud providers like RunPod and Modal are much cheaper than large organization cloud providers like AWS.
Disclaimers
This transcript is for information purposes only and does not constitute advice of any type or trade recommendation and should not form the basis of any investment decision. Sacra accepts no liability for the transcript or for any errors, omissions or inaccuracies in respect of it. The views of the experts expressed in the transcript are those of the experts and they are not endorsed by, nor do they represent the opinion of Sacra. Sacra reserves all copyright, intellectual property rights in the transcript. Any modification, copying, displaying, distributing, transmitting, publishing, licensing, creating derivative works from, or selling any transcript is strictly prohibited.