Fireworks AI customer at Hebbia on serving state-of-the-art models with unified APIs

Background

We spoke with a former product manager at Hebia who used Fireworks AI to integrate state-of-the-art open models into their enterprise platform. The conversation explores how Fireworks' unified API, rapid model availability, and concurrency guarantees enabled Hebia to differentiate their offering with model optionality while maintaining consistent performance across various inference patterns.

Key points via Sacra AI:

Fireworks AI gave Hebbia access to state-of-the-art open models with OpenAI-style endpoints, enabling rapid deployment of buzzy models like DeepSeek without special integrations. "We primarily used Fireworks.ai for getting access to recent state-of-the-art models like DeepSeek or Llama models... Fireworks gave us an abstraction where we could get those models online... The most important thing for us was model turnaround time... They would expose any new model to us within the same day, which made it easier for us to capitalize on the buzz for any new model releases."
Hebbia's inference patterns included high-concurrency chat products for analysts, token-heavy batch document processing, and model choice flexibility—all requiring different latency profiles that benefited from Fireworks' throughput guarantees and observability tooling. "Our inference patterns were high-frequency chat products, which was the bread and butter for analyst and deal teams... We also had offline batch document evaluations. Users would point us at a data room with hundreds of thousands of documents... We also gave our users the choice for which models they used. Certain models performed better at certain tasks."
Model optionality became a key differentiator in Hebbia's enterprise sales, with Fireworks enabling them to offer both closed and open models while addressing security concerns about data retention—positioning them against competitors limited to just OpenAI and Anthropic. "For the enterprise buyer and our champion, who were CIOs, that's where I think the value of Fireworks came in. We'd highlight our breadth of models and also the fact that we could deploy open models securely... Model optionality became part of our pitch. We're not locking you into OpenAI and Anthropic—you'll get access to everything and whatever model best powers your workload. On the go-to-market side, it was quite differentiating."

Questions

At Hebia, what Fireworks AI features did you actually use? For example, serverless endpoints or fine tuning? And what did your typical inference patterns look like?
What did your typical inference usage look like in terms of pattern? Were you doing synchronous chat-style interactions, batch doc processing, RAG pipelines, eval loops?
For that first chat-style product with high-frequency analyst interactions, can you give a ballpark on your typical QPS range or concurrency needs? Did you hit any performance ceilings with Fireworks during peak periods?
What did your team really value about Fireworks in terms of developer experience? Were there particular features that stood out or friction points you remember?
Did you evaluate any alternatives, like Together, Replicate, or Bedrock for this role? If so, what tipped the needle in Fireworks' favor?
You mentioned Fireworks got models live fast. Did they reach out proactively when new models dropped? Or was your team monitoring and asking for them? What did that interaction loop look like?
You mentioned Together.ai worked early on but showed higher latency tail. Did you ever use Together just for inference or also for raw compute or cluster level access?
Do you think of inference platforms like Fireworks and raw GPU providers like Lambda as fully separate categories? Or did your team ever evaluate them against each other? In what scenarios would you compare them?
What made Fireworks sticky for your team? Was it the API compatibility, orchestration, observability, something else entirely? And if you had to migrate off, how hard would it actually be?
On pricing, how did Fireworks compare in total cost to someone like OpenAI or Bedrock? And what actually moved your cost? Was it token pricing, concurrency guarantees, dev-ops time, streaming costs?
What's the biggest broad misconception you've seen about inference platforms like Fireworks, especially from people outside of day-to-day AML operations? What's the reality you saw at Hebia?
If you could shape Fireworks' roadmap, what's one thing that would have made your team's developer velocity or economics even stronger?
Could you walk through a representative project at Hebia where you used Fireworks, including the requirements, how you evaluated providers, why Fireworks won, and what lessons you took away from that experience?
What would it take for a company like Hebia to ever graduate off a platform like Fireworks to run inference entirely on their own raw GPUs or inside a hyperscaler? Or do you see managed inference staying in place long term?
What did your team actually emphasize to Hebia users when it came to the underlying models and hosting platforms? For example, did you highlight that you were using buzzy models like DeepSeek hosted on Fireworks? Or was the abstraction such that end users didn't really care? How did your model infrastructure choices show up in your GTM or product positioning?
Do you think we're headed toward a convergence between GPU providers like Lambda, inference platforms like Fireworks, and even hyperscalers like AWS? Where do you see the space going in the next 2-3 years?
If you were building a new ML platform startup today, where would you focus? Application layer, managed inference, GPU scheduling, or something else altogether? What's the power position over the next 24 months?
Do you see orchestration frameworks like LangChain or custom agent stacks built in-house fitting into this picture long term? Do they remain standalone layers on top of inference platforms like Fireworks? Or do you think model providers absorb more of that logic themselves?
Did Fireworks' catalog play a big role in provider selection? Did you ever fine-tune open models there or stick to inference only?
When would you pick Fireworks over a hyperscaler product like AWS Bedrock and vice versa?
Where did Fireworks outperform or possibly underperform competitors like Together.ai for your workloads?
Do you draw a clear line between inference platforms like Fireworks and GPU providers like Lambda? In which scenarios did you evaluate them against each other? Are these categories converging?

Interview

At Hebia, what Fireworks AI features did you actually use? For example, serverless endpoints or fine tuning? And what did your typical inference patterns look like?

We definitely only used Fireworks for serving inference. We primarily used Fireworks.ai for getting access to recent state-of-the-art models like DeepSeek or Llama models. We weren't very interested in fine-tuning over at Hebia, but Fireworks gave us an abstraction where we could get those models online. Our customers were very interested in access to state-of-the-art models and serving them through OpenAI-style endpoints, plugging them into the rest of our suite, which primarily serves API-based models.

What did your typical inference usage look like in terms of pattern? Were you doing synchronous chat-style interactions, batch doc processing, RAG pipelines, eval loops?

We had a mix of all three, aside from eval loops mostly. That's why we went with Fireworks AI—all our abstractions for rate limiting based on tokens and serving inference to customers worked out of the box.

Our inference patterns were high-frequency chat products, which was the bread and butter for analyst and deal teams. They basically drag and drop their documents into our UI and ask questions. These are high concurrency because this is our most used product. Latency mattered a lot in this use case.

We also had offline batch document evaluations. Users would point us at a data room with hundreds of thousands of documents, add a few prompts, and then run it and come back later. These didn't need the same latency. They were very token-heavy jobs because we were basically passing in entire document context.

We also gave our users the choice for any workbooks which models they used. Certain models performed better at certain tasks. For example, DeepSeek saw some really interesting results with high-fidelity term extractions in documents. These are quite bursty with spikes, and these three different types of patterns all relied on Fireworks AI's guarantees around throughput and latency profiles for their models.

For that first chat-style product with high-frequency analyst interactions, can you give a ballpark on your typical QPS range or concurrency needs? Did you hit any performance ceilings with Fireworks during peak periods?

We did rate limiting across all of our models, and we did it on model families. We would have some certain token capacity or input token capacity for OpenAI, and Fireworks gave us that same guarantee.

We used Fireworks for hosting these buzzy, trendy models like DeepSeek. AI enthusiasts in our customer base would come in and experiment with them, but it never really drove insanely high throughput through Fireworks. We used Fireworks for out-of-the-box support for getting state-of-the-art open models up on our platform without any developer friction or burden. While we did have high concurrency generally in our platform, the models were more of a novelty. We wouldn't see any insane latency issues.

What did your team really value about Fireworks in terms of developer experience? Were there particular features that stood out or friction points you remember?

Candidly, I'm a product manager, and I worked on this project. Maybe I'm not the expert on developer experience, but I can give you the decision criteria for going with Fireworks over other inference providers.

The first is what I mentioned earlier, which is just a unified API. They gave us basically OpenAI-style interfaces, which allowed us to plug open models into the same abstractions for rate limiting, serving inference, handling parsing of model responses, and using them in agented workflows just like we did with closed models and API models. This meant we didn't need special case integrations—we could plug the model in and get it onto our platform as soon as the Twitter buzz started.

The other was latency and observability tooling. They gave us pretty good visibility into token throughput and latency distributions. For our question-answering use cases, these metrics were important. We would know immediately and get alerted if a model lagged on our platform, especially when using agented workflows where dependencies impact the quality of the overall response.

Finally, they abstracted out a lot of GPU allocation work. We could set targets on concurrency and throughput and let Fireworks handle all the auto-scaling.

Did you evaluate any alternatives, like Together, Replicate, or Bedrock for this role? If so, what tipped the needle in Fireworks' favor?

We used Bedrock earlier before we had heard of Fireworks, and we did some side-by-side evaluations. Where Fireworks really shined and outperformed for us is, first, latency. For workflows with high concurrency, which we wanted to support even if we didn't expect it, Together AI inference was fine at a smaller scale, but we saw long-tail latency there.

Also, Fireworks has been around for a while with stronger guarantees around uptime and reliability broadly. We were selling across the world and Fireworks had better multi-region failover.

The most important thing for us was model turnaround time. Compared to Bedrock, which had a much smaller catalog of state-of-the-art open models, Fireworks could get the newest checkpoints into production faster. That was really their bread and butter. They would expose any new model to us within the same day, which made it easier for us to capitalize on the buzz for any new model releases.

You mentioned Fireworks got models live fast. Did they reach out proactively when new models dropped? Or was your team monitoring and asking for them? What did that interaction loop look like?

It was the latter. We would hear from our CEO or marketing team that there was Twitter buzz going on around a certain model like DeepSeek, which is the best example I have. We would literally check Fireworks' catalog—it's entirely self-served. Keeping up with the process of hosting these models is very straightforward.

You mentioned Together.ai worked early on but showed higher latency tail. Did you ever use Together just for inference or also for raw compute or cluster level access?

This is going to be completely non-scientific evaluations based on forums and our inference guys talking to other people they knew who were doing self-hosting. We never actually used Together. We primarily used Bedrock before, and then used Fireworks.ai later on.

Do you think of inference platforms like Fireworks and raw GPU providers like Lambda as fully separate categories? Or did your team ever evaluate them against each other? In what scenarios would you compare them?

Something like Lambda to us was an entirely different category. We didn't really need the flexibility in building out all of those primitives that Lambda gave us. We were totally fine with giving concurrency and token-level targets and serving the models out of the box. We never had any interest in fine-tuning workflows. Lambda was never even an option for us. It really came down to Fireworks versus Bedrock.

What made Fireworks sticky for your team? Was it the API compatibility, orchestration, observability, something else entirely? And if you had to migrate off, how hard would it actually be?

There are two things that made it sticky for us. The first is that our app talks to models through an abstraction layer, which is like a model router that works across things like OpenAI, Anthropic, Gemini, etc. Being able to plug Fireworks with its unified OpenAI API into our infrastructure was a huge win for us.

The other thing was just their catalog. They had a vast catalog with every state-of-the-art text model, and it let us experiment very quickly. Upon request during prototypes or data POCs, we could give our users access to the models they'd been hearing about. Technically, it wouldn't be very hard—it was just a couple hours' turnaround for us to get a new model from Fireworks into our platform.

Going over to another provider that gave us all the same primitives at a lower price point would have been difficult. We were generally quite happy with Fireworks. We might have gotten more friction with larger workloads, but we never really reached that point on any of the open models.

On pricing, how did Fireworks compare in total cost to someone like OpenAI or Bedrock? And what actually moved your cost? Was it token pricing, concurrency guarantees, dev-ops time, streaming costs?

With different providers like OpenAI and Anthropic, what we cared about the most were guarantees around throughput. Fireworks, at a very low level, gave us explicit concurrency targets. We weren't subject to the same rate limits as OpenAI, though with OpenAI we had custom contracts and high guarantees.

For the main providers like OpenAI, we cared the most about concurrency guarantees. The streaming cost we cared very little about with Fireworks. The throughput was just so small compared to the 50 to 100 billion tokens we passed through the other model providers.

Fireworks is priced very transparently on token throughput, so it wasn't really an issue for us. We didn't have any concerns about either pricing or latency regarding Fireworks versus Bedrock. I think Bedrock was entirely metered on tokens, but for our key workflows, we looked to spend in batches. Fireworks' concurrency-driven model let us cap usage more freely.

What's the biggest broad misconception you've seen about inference platforms like Fireworks, especially from people outside of day-to-day AML operations? What's the reality you saw at Hebia?

This is less about Fireworks in general and more about open models. It was quite interesting that non-AI-familiar users were terrified that even though we were essentially self-hosting the model through Fireworks, there would be some issue with data retention making its way to the original DeepSeek company.

The other thing is that everyone is optimizing for raw compute, but raw compute is less important. It's handling concurrency that matters a lot more. We have bursty workloads streaming with sub-second latency—this is an orchestration problem, not a raw machine problem.

If you could shape Fireworks' roadmap, what's one thing that would have made your team's developer velocity or economics even stronger?

It would be workload-specific scheduling. We had a single unified interface, and we have things like batch doc workflows and very latency-sensitive chat and agent-type workflows. If we actually started to drive a lot of throughput through Fireworks, we would need different latency profiles across those workflows, and Fireworks didn't give us the granularity to control these specific scheduling profiles.

Could you walk through a representative project at Hebia where you used Fireworks, including the requirements, how you evaluated providers, why Fireworks won, and what lessons you took away from that experience?

We used Fireworks for hosting open models. That was a category of project that came to us with DeepSeek. We knew we were going to have a new requirement: we needed an integration that allows us to get any model up within some reasonable SLA based on requests from our marketing team or CEO. We needed all the same access to primitives around inference and concurrency, and we needed to ensure that our rate limits supported the kind of workload we'd pass through Fireworks.

We started out with just DeepSeek. Eventually, we added a few of the Llama models. But it became a plug-and-play platform that we could use as part of our go-to-market motion. Our sales team would tell us that some technical user at a firm loved some specific open model, and ask how quickly we could get it on our platform. We could put it behind a feature flag and give it to them as part of the POC. The answer became "very quickly"—within minutes. We'd add a tag to the model dropdown, add Fireworks to our model registry, and start to route compute through it.

What would it take for a company like Hebia to ever graduate off a platform like Fireworks to run inference entirely on their own raw GPUs or inside a hyperscaler? Or do you see managed inference staying in place long term?

Managed inference starts to lose out as soon as you introduce something that necessitates flexibility, like domain-specific adaptation or any post-training techniques. When you're absolutely hammering GPUs with massive workloads that are extremely bursty, and you need raw control over how they're orchestrated on your cluster—that's when you would move over from managed compute to something like VoltageParc.

What did your team actually emphasize to Hebia users when it came to the underlying models and hosting platforms? For example, did you highlight that you were using buzzy models like DeepSeek hosted on Fireworks? Or was the abstraction such that end users didn't really care? How did your model infrastructure choices show up in your GTM or product positioning?

Fireworks was totally abstracted away, but what we did boast about was our ability to get models onto the platform and a larger breadth of models much faster than our competitors. Our whole value proposition was agnosticism of the models within the workflows that we serviced.

It somewhat depended on who the audience was. For end users like analysts, associates, and deal teams, it was all abstracted away. They cared about the outcomes, and whether it was Fireworks or we just had DeepSeek on our platform didn't really matter to them.

But for the enterprise buyer and our champion, who were CIOs, that's where I think the value of Fireworks came in. We'd highlight our breadth of models and also the fact that we could deploy open models securely. In this environment, we owned all their data, and nothing was being retained by any other entity like DeepSeek.

Model optionality became part of our pitch. We're not locking you into OpenAI and Anthropic—you'll get access to everything and whatever model best powers your workload. On the go-to-market side, it was quite differentiating. Compared to competitors like Frodo and Harvey who only had OpenAI and Anthropic models, we had both the scaffolding (our workflows) and the fine-grain precision that we extended to users. Users could pick their models.

Do you think we're headed toward a convergence between GPU providers like Lambda, inference platforms like Fireworks, and even hyperscalers like AWS? Where do you see the space going in the next 2-3 years?

I think the ideas are going to blur a little bit over the next 2-3 years. In the enterprise AI space, it was very involved to talk about the application layer or the model layer, and we abstract everything away at the utility layer. Every problem is solved with retrieval, every problem is solved with agent tech workflows and decomposing large tasks into smaller tasks. But I think that's going away now. The models are getting smarter, and what's differentiating is domain adaptation and training models in smaller action spaces.

I think the convergence can be explicit for the GPU providers. Currently, they sell raw compute, but I think they're going to be moving up in the stack and adding managed inference layers and serving frameworks, maybe even endpoints for models. Those things give them access to sell higher-margin managed services.

For inference platforms like Fireworks that abstract away GPU headaches, they're going to move down the stack. They're converging with each other to offer more control for workflows, SLAs, more configurability, and all those abstractions become more flexible. They'll position themselves as a new, better option.

The hyperscalers like AWS or Azure will take the full top-to-bottom of the stack and emphasize things like caching and retrieval with their native platforms. For AWS, it might be S3 and their indexes like Kendra. But it's going to be slower for them to have a broader base.

If you were building a new ML platform startup today, where would you focus? Application layer, managed inference, GPU scheduling, or something else altogether? What's the power position over the next 24 months?

I would position somewhere between managed inference plus application layer. There's this new wave of startups investing in the model layer right now, and the application layer is less interesting. Fine-grained control over how they ingest and parse data is less interesting to them. What's more interesting is what's differentiating about their model, the modalities, and what inputs it can reason over.

That's where I would focus—on a platform where you can use any of these models, maybe even serving instances to OpenAI and Anthropic, and host models you've developed yourself. You could push and pin your checkpoints, and it all plugs into a single unified interface along with the data layer and application layer where you can configure agent tech workflows over any of these models.

On the data layer side, you just drop your data into it, and it handles retrieval over structured/tabular and unstructured text-based data, as well as other modalities.

Do you see orchestration frameworks like LangChain or custom agent stacks built in-house fitting into this picture long term? Do they remain standalone layers on top of inference platforms like Fireworks? Or do you think model providers absorb more of that logic themselves?

I think orchestration is in the same convergence arc, but the dynamics are a little different. LangChain, PineconeDB, and retrieval bots like Elasticsearch are accelerators for developers and abstract away routing, tool calling, and agent logic. They're good for prototyping but less for production hardening, and the primitives don't introduce flexibility for truly useful, authentic workflows.

Teams like ours built leaner, more communicative routers that let us normalize APIs and enforce client-side typing. The inference platforms stay mostly at the single-call inference level, though some of them come with caching and model routing, and prompt templates. Bedrock, for example, is beginning to add that orchestration layer.

Where it's heading is that model providers are trying to absorb all orchestration. This is true for both the serving open model space like Bedrock but also OpenAI with their framework. They have lightweight commoditized features like computation, caching, retrieval, and deterministic routing, which reduces friction for most developers.

I think the inference platforms will go a little deeper. They'll offer workflow-aware primitives like stateful agents, retries, fallbacks, and observability. The end-state is that startups and lean teams will fully rely on these inference platforms—you don't need to run LangChain for agent workflows when you get it out of the box.

For larger enterprises and teams like Hebia, orchestration will be a customized control layer, which is mostly custom. You'll want to arbitrarily orchestrate workflows across providers and enforce governance and security at the application layer. I don't know if the LangChains of the world are going to end up here, but the inference providers are like cloud providers in the early days.

Did Fireworks' catalog play a big role in provider selection? Did you ever fine-tune open models there or stick to inference only?

Inference only. We had the philosophy that fine-tuning is a waste of compute. Domain adaptation is less important and can be solved with techniques like prompting and in-context augmentation. More sophisticated reasoning capabilities come from step-function improvements in each new model release from providers.

When would you pick Fireworks over a hyperscaler product like AWS Bedrock and vice versa?

We used Bedrock at one point but chose Fireworks for open models. The key advantage was the speed at which Fireworks made access to open weights models available. Bedrock generally had a smaller catalog. For example, with DeepSeek, there was emphasis on getting that into our platform a day after it launched so we could capitalize on the social media buzz.

The other advantage was observability. We found that Fireworks was best-in-class for token logging and latency metrics, similar to what we got from OpenAI, Anthropic, and Gemini. Regarding integration with the AWS ecosystem, connectivity to services like S3 or Athena wasn't important to us because we had already built abstraction layers in our infrastructure. We didn't need out-of-the-box connectivity to hook models up to fetch papers from S3, for example. That boilerplate code already existed, and our team had already handled it.

Where did Fireworks outperform or possibly underperform competitors like Together.ai for your workloads?

We never used Together.ai. The only other platform we used was AWS Bedrock. We went with Fireworks largely based on latency. For chat workloads with high concurrency where we were slamming requests concurrently, Fireworks gave us much lower latency.

We had also heard about Fireworks' reliability and guarantees around uptime. Finally, it was completely plug-and-play with new models. Fireworks had all checkpoints ready to go, and we could get models into production much faster, all exposed through the same API surface.

Do you draw a clear line between inference platforms like Fireworks and GPU providers like Lambda? In which scenarios did you evaluate them against each other? Are these categories converging?

These are totally separate types of platforms. We never really considered GPU providers like Lambda. While GPU providers give you raw horsepower, our investors always viewed API-based models as state of the art, and we didn't need that maximum level of control over serving frameworks or tuning. Using raw GPU providers means your team has to handle everything like observability and cost optimization.

Inference providers like Fireworks abstracted away those complexities. Fireworks provided the concurrency and throughput profile we needed and handled all the GPU scheduling for us. This meant less time getting a new interesting model like DeepSeek online.

Disclaimers

This transcript is for information purposes only and does not constitute advice of any type or trade recommendation and should not form the basis of any investment decision. Sacra accepts no liability for the transcript or for any errors, omissions or inaccuracies in respect of it. The views of the experts expressed in the transcript are those of the experts and they are not endorsed by, nor do they represent the opinion of Sacra. Sacra reserves all copyright, intellectual property rights in the transcript. Any modification, copying, displaying, distributing, transmitting, publishing, licensing, creating derivative works from, or selling any transcript is strictly prohibited.

Fireworks AI customer at Hebbia on serving state-of-the-art models with unified APIs

Background

Questions

Interview

Disclaimers

Read more from

Hebbia

Danny Wheller, VP of Business & Strategy at Hebbia, on vertical vs horizontal enterprise AI

Hebbia revenue, growth, and valuation

Read more from

Fireworks AI

Voltage Park customer at robotics company on GPU pricing and robotics computing needs

Read more from
#ai

$147M/year GarageBand for AI music production

Replit at $253M ARR growing 2,352% YoY

Video scarcity to video abundance

Read more from
#b2b

Michael Grinich, CEO of WorkOS, on AI startups getting enterprise-ready at launch

Retool vs Replit

Alex Bouaziz, CEO of Deel, on Deel's bundle-unbundle strategy

Read more from
#cloud-gpus

RunPod customer at Segmind on GPU serverless platforms for AI model deployment

Lambda customer at Iambic Therapeutics on GPU infrastructure choices for ML training and inference

Lambda's IPO

Create a free account, or log in.

Free article limit reached.

Standard membership required.

Standard membership required.

Background

Questions

Interview

Disclaimers

Read more from Hebbia

Danny Wheller, VP of Business & Strategy at Hebbia, on vertical vs horizontal enterprise AI

Hebbia revenue, growth, and valuation

Read more from Fireworks AI

Voltage Park customer at robotics company on GPU pricing and robotics computing needs

Read more from #ai

$147M/year GarageBand for AI music production

Replit at $253M ARR growing 2,352% YoY

Video scarcity to video abundance

Read more from #b2b

Michael Grinich, CEO of WorkOS, on AI startups getting enterprise-ready at launch

Retool vs Replit

Alex Bouaziz, CEO of Deel, on Deel's bundle-unbundle strategy

Read more from #cloud-gpus

RunPod customer at Segmind on GPU serverless platforms for AI model deployment

Lambda customer at Iambic Therapeutics on GPU infrastructure choices for ML training and inference

Lambda's IPO

Read more from

Hebbia

Read more from

Fireworks AI

Read more from
#ai

Read more from
#b2b

Read more from
#cloud-gpus