AI engineer at Indeed on TurboPuffer vs. Vespa vs. Elasticsearch at scale

Background

We spoke with a senior AI/ML architect at a large consumer-facing technology company who oversees retrieval infrastructure across RAG pipelines and agentic workflows at scale.

The conversation covers how TurboPuffer, Vespa, and Elasticsearch are used in complementary roles, the tradeoffs between cost, latency, and ranking sophistication, and how the team approaches observability, evaluation, and data governance across millions of daily users.

Key points via Sacra AI:

A company running external products for millions of daily visitors chose TurboPuffer over Pinecone and Vespa for generic agent workloads because its three-tier storage hierarchy automatically optimizes both cost and latency at scale, keeping frequently queried data in fast memory while cold data sits in cheap object storage until needed. "For some of our external-facing products, we get millions of customers every day visiting our website and mobile apps, so we require latency in milliseconds. That's also where TurboPuffer really shines, because of its three-tier storage hierarchy — separating cold, warm, and hot data — which lets you optimize both for cost and for low latency. It's a combination of lots of cold storage data and spiky traffic. If an individual customer is mostly interested in their own data and records, there's no point in paying for retrieval and memory processing of all the other non-relevant data in that session."
When this team benchmarked TurboPuffer against Vespa and Elasticsearch on retrieval quality metrics, they found no meaningful difference, meaning the choice between backends comes down to cost and latency rather than retrieval quality — which removes a key objection to adopting TurboPuffer in production. "We did not see a very noticeable difference on those relevance metrics. The majority of the distinguishing factors came down to latency and cost. Especially at scale, cost efficiency becomes very important. TurboPuffer's options are much cheaper than always-in-memory or always-SSD architectures because less-used data can remain in cold object storage, which is not the case for Vespa or Elasticsearch."
For workloads requiring heavy personalization and custom machine learning ranking, this team uses Vespa instead of TurboPuffer, suggesting TurboPuffer's natural ceiling is generic agent question-answering rather than high-value recommendation and personalization products. "Vespa in particular lets you define custom ranking algorithms and hybrid retrieval algorithms — it has really advanced recommendation logic. For anything requiring very customized machine learning recommendation systems, we prefer Vespa. But TurboPuffer is good for the majority of use cases, because not all use cases require those advanced custom algorithms. If it's a very high-value customer-facing product that requires a heavy degree of personalization, we typically use Vespa. But for most generic use cases where it's just a matter of an AI agent answering customer questions, we typically go with TurboPuffer."

Questions

To start—what retrieval and vector search tools are you actively using or evaluating right now, and has TurboPuffer come up in that context?
Across those RAG and agent workflows, what does the retrieval stack look like end to end—what's doing indexing, search, and evaluation?
For TurboPuffer specifically, what external-facing product workload made serverless vector search attractive—was it cost, scaling behavior, operational simplicity, or something else?
When you say low latency was a requirement, what were the actual latency targets or user-facing thresholds that TurboPuffer had to meet?
When TurboPuffer separates hot, warm, and cold data, where have you actually seen friction—cold-start latency, cache misses, freshness, or something else?
On those uneven traffic cases, what did the failure look like to users—slower responses, older results, lower recall, or incorrect ranking?
When latency suffered from object storage fetches, how did you detect and debug that—TurboPuffer metrics, application traces, or user-facing alerts?
How granular are TurboPuffer's own metrics for diagnosing those cases—can you see cache behavior and per-query causes, or mostly aggregate latency?
When you evaluate retrieval quality in those pipelines, what signals matter most—answer accuracy, recall, latency, cost, or something more specific to your agents?
For hybrid search specifically, where does TurboPuffer stand today versus Elasticsearch or Vespa—is it good enough natively, or do you layer ranking elsewhere?
What makes a use case cross the line into needing Vespa's custom ranking—is it personalization, multi-modal signals, business rules, or relevance quality?
For those generic customer-answering agents on TurboPuffer, how tightly coupled is the retrieval layer to the agent orchestration—could you swap stores without redesigning the workflow?
What abstraction lets you swap stores cleanly—is it a common retrieval API, LangGraph tool wrapper, or something custom your team built?
With that LangGraph abstraction, when retrieval fails in production, what does failure usually look like—wrong documents, stale data, latency spikes, or tool-routing mistakes?
When a tool failure causes an incorrect response, how do you trace whether the root cause was retrieval quality, result formatting, or LangGraph orchestration?
For those traces, can you see why a specific document ranked where it did, or mainly just the returned context and similarity scores?
What about the ranking explanation itself—do Vespa, Elasticsearch, or TurboPuffer give you enough transparency into why documents ranked that way?
For access control and permissions in these retrieval systems, is that mostly handled before indexing, at query time with filters, or somewhere in the agent layer?
When permissions are enforced at the agent layer, what's the biggest risk you watch for—leakage through retrieved context, tool misuse, or inconsistent policy checks?
For TurboPuffer specifically, did its namespace or metadata filtering model create any issues for permissioning, or was that abstracted away enough?
For freshness, how do you manage updates into TurboPuffer—batch reindexing, streaming writes, or some hybrid approach?
For those immediate writes, how do you validate freshness in production—are you measuring index lag directly, or only catching stale results through evals and user reports?
How are you using eval frameworks like LangSmith or Datadog Evaluations—do they actually drive retrieval configuration changes, or mostly validate after launch?
When evals suggest retrieval changes, what's a concrete example you've adjusted—chunking, embedding model, top-k, filters, reranking, or the backend itself?
When you changed chunking or embedding models, did that ever push you toward a different vector backend, or were the stores mostly interchangeable?
On that cost advantage, how do you quantify it internally—storage cost per namespace, query cost, avoided cluster ops, or total cost per product workload?
When you compare that total product cost against Vespa or Pinecone, what are the hidden costs that matter most—engineering ops, overprovisioning, data movement, or migrations?
For TurboPuffer adoption, what was the hardest part of getting it production-ready—migration, security review, performance validation, or developer familiarity?
What did performance validation look like before production—did you run shadow traffic, offline benchmarks, synthetic workloads, or a limited rollout?
During those phased rollouts, what thresholds decided go or no-go—retrieval accuracy, p95 latency, error rate, cost, or user feedback?
For the TurboPuffer rollout specifically, did p95 or p99 latency become a hard gate, or was retrieval quality more important than tail latency?
When retrieval quality was the hard requirement, what was the evaluation set built from—historical queries, labeled relevance judgments, synthetic questions, or production feedback?
For labeled relevance judgments, what did "good" mean in practice—exact document match, sufficient context for the answer, or downstream answer correctness?
For TurboPuffer specifically, did those relevance metrics expose any quality gap versus Vespa or Elasticsearch, or was the difference mainly latency and cost?
For workloads where TurboPuffer won on cost, what corpus shape made that true—lots of namespaces, very large cold data, spiky traffic, or low query frequency?
For that per-customer or per-tenant shape, are you isolating data with namespaces, metadata filters, or separate indexes—and where do limits show up?
Where have metadata filters or namespaces hit practical limits for you—high-cardinality fields, query latency, permission logic, or just operational complexity?
When governance constraints get that complex, do you prefer enforcing them outside TurboPuffer entirely, or do you need the retrieval backend to support more policy-aware filtering?

Interview

To start—what retrieval and vector search tools are you actively using or evaluating right now, and has TurboPuffer come up in that context?

We have evaluated TurboPuffer, Vespa, and Pinecone, and we are actively using TurboPuffer and Vespa in our vector search databases.

Across those RAG and agent workflows, what does the retrieval stack look like end to end—what's doing indexing, search, and evaluation?

TurboPuffer is used for some of our external-facing products, especially because of its unique capabilities around serverless vector search. We also use Elasticsearch for search-related work as part of a broader search and observability stack. Vespa is used for more advanced machine learning and custom ranking features.

For TurboPuffer specifically, what external-facing product workload made serverless vector search attractive—was it cost, scaling behavior, operational simplicity, or something else?

Cost was an important factor, as was scaling and operational simplicity. TurboPuffer offers a serverless vector model, which means we do not have to manage any clusters or nodes. It stores data on object storage and can automatically move active data to faster cached memory so that queries remain very low latency, which was a requirement for us. Cold data can stay cheap. Those were the main factors around TurboPuffer.

When you say low latency was a requirement, what were the actual latency targets or user-facing thresholds that TurboPuffer had to meet?

For some of our external-facing products, we get millions of customers every day visiting our website and mobile apps, so we require latency in milliseconds. Whatever solution we use on the back end, low latency has always been a very critical requirement across all architecture and technical design. That's also where TurboPuffer really shines, because of its three-tier storage hierarchy—separating cold, warm, and hot data—which lets you optimize both for cost and for low latency.

When TurboPuffer separates hot, warm, and cold data, where have you actually seen friction—cold-start latency, cache misses, freshness, or something else?

There can be cases where freshness of the data is compromised, especially with very spiky or uneven traffic. The platform really has to work hard to automatically scale data between hot, warm, and cold tiers. In some cases with very uneven traffic, it can be hit or miss. But overall the performance has been really powerful, which is why we are still using it.

On those uneven traffic cases, what did the failure look like to users—slower responses, older results, lower recall, or incorrect ranking?

It would be a combination of almost all of those. In cases where it has a hard time getting the right results, it actually has to fetch data from storage, and that's when latency really suffers. I have not seen examples of incorrect results, which has been a positive. Even with this weakness, accuracy of the results has not suffered. But in rare cases, latency can suffer when data has to be fetched from object storage in the cold data tier.

When latency suffered from object storage fetches, how did you detect and debug that—TurboPuffer metrics, application traces, or user-facing alerts?

It was mostly a combination of TurboPuffer metrics and application tracing. We use very comprehensive logging and tracking capabilities—dedicated software like Datadog to track and trace every AI agent's interactions, tool calls, and the results being fetched. All the RAG pipelines are very heavily monitored using these observability tools. Typically it would show up in our own internal tooling, and we can also correlate it to TurboPuffer's metrics to validate the results.

How granular are TurboPuffer's own metrics for diagnosing those cases—can you see cache behavior and per-query causes, or mostly aggregate latency?

It's mostly aggregate latency that they provide in their standard reporting. There are some options for getting more granular data through their APIs, but we haven't looked into that much. We still rely more on our own internal tracking, because that's the most reliable and something we can fully customize and control. We have comprehensive dashboards that report on these metrics and send proactive alerts based on any unusual behavior.

When you evaluate retrieval quality in those pipelines, what signals matter most—answer accuracy, recall, latency, cost, or something more specific to your agents?

Accuracy, recall, and latency would be the most important factors. Accuracy is the most important—you want very high-quality search results that take into account the semantic meaning of the content, not just keyword search. Platforms that have some kind of hybrid algorithm combining keyword search and semantic reranking are usually preferred. At the same time, maintaining low latency alongside high accuracy is important. A little compromise on latency is acceptable for internal products, but for external customer-facing products, low latency is very important.

For hybrid search specifically, where does TurboPuffer stand today versus Elasticsearch or Vespa—is it good enough natively, or do you layer ranking elsewhere?

Hybrid search is not TurboPuffer's main specialization. Vespa and Elasticsearch are much more powerful there. Vespa in particular lets you define custom ranking algorithms and hybrid retrieval algorithms—it has really advanced recommendation logic. For anything requiring very customized machine learning recommendation systems, we prefer Vespa. But TurboPuffer is good for the majority of use cases, because not all use cases require those advanced custom algorithms. That's why TurboPuffer is a good alternative for most customer-facing use cases.

What makes a use case cross the line into needing Vespa's custom ranking—is it personalization, multi-modal signals, business rules, or relevance quality?

Personalization is an important criterion. When you're giving recommendations to customers, you don't want to rely on generic keyword search-based recommendations—you want more advanced custom machine learning recommendation models and some kind of hybrid search. If it's a very high-value customer-facing product that requires a heavy degree of personalization, we typically use Vespa. But for most generic use cases where it's just a matter of an AI agent answering customer questions, we typically go with TurboPuffer.

For those generic customer-answering agents on TurboPuffer, how tightly coupled is the retrieval layer to the agent orchestration—could you swap stores without redesigning the workflow?

Swapping is possible. We use different frameworks at the agent layer, often relying on open-source frameworks rather than getting tied to one vendor. We have designed the architecture so that the vector store is one part of the overall workflow and framework, and it can easily be swapped out with another platform as long as you don't compromise on quality, latency, and cost.

What abstraction lets you swap stores cleanly—is it a common retrieval API, LangGraph tool wrapper, or something custom your team built?

It's a combination of LangGraph tool wrappers and some custom functionality we have built. Separating out the abstraction of the agent itself is a good design principle—you keep all the orchestration logic in one layer and handle memory, routing, and orchestration in a single place. LangGraph provides a very strong ecosystem of third-party libraries, API calls, and retrieval libraries. We're using LangGraph very heavily, with LangChain as the underlying framework, and that lets us cleanly separate the agent layer from the storage layer.

With that LangGraph abstraction, when retrieval fails in production, what does failure usually look like—wrong documents, stale data, latency spikes, or tool-routing mistakes?

It could be any of those. If you're using some of the latest large language models, they have very powerful tool-calling features, but they also require very specific formatting of retrieval results and very specific structuring of the chains and graphs in the orchestration layer. If any of those pieces get broken, you would typically see a tool failure cascade downstream—resulting in either high latency or incorrect responses.

When a tool failure causes an incorrect response, how do you trace whether the root cause was retrieval quality, result formatting, or LangGraph orchestration?

We have very detailed logging and tracing set up in our own internal software. We use Datadog for detailed logging of every single step the agent performs—what the response was after the RAG step, what context was returned, what the similarity score was between the returned context and the user query. Based on those individual traces, we can clearly see which step resulted in the bug or error—whether retrieval quality from the vector database was not good enough, or whether the agent itself made a mistake in using that context.

For those traces, can you see why a specific document ranked where it did, or mainly just the returned context and similarity scores?

In the traces, we can see all the detailed context that was returned. We own the whole agent workflow, so all the actions the agent takes, all the tool calls, any thinking it performs, and all the orchestration and routing steps can be logged. That's why we rely on our own observability stack—so we can have full visibility into end-to-end tracking.

What about the ranking explanation itself—do Vespa, Elasticsearch, or TurboPuffer give you enough transparency into why documents ranked that way?

They give some visibility, but not a lot, since much of it is covered by proprietary algorithms. Out of the three, Elasticsearch is open source, so that's the one we can configure and optimize the most—you can be very flexible in how you design the Elasticsearch stack. Vespa provides the most advanced algorithms, powerful real-time ranking and filters, and machine learning-driven retrieval. They don't give full visibility into what's happening under the hood, but they do provide a good amount of customization and options.

For access control and permissions in these retrieval systems, is that mostly handled before indexing, at query time with filters, or somewhere in the agent layer?

For a lot of our products, we handle it at the agent layer itself. Security and authentication can be applied at any stage, but if you handle it at the vector storage layer, you still have to handle it at other steps as well, so it doesn't fully serve the purpose. Handling it at the agent layer keeps it more systematic and easier to manage.

When permissions are enforced at the agent layer, what's the biggest risk you watch for—leakage through retrieved context, tool misuse, or inconsistent policy checks?

The biggest risk is a single point of failure. If any policy is not applied consistently or correctly, you can end up with improper permissions or incorrect authentication. If you do checks at multiple places, it reduces that risk because you have multiple layers of security—but it also introduces more complexity and maintenance overhead. It's a balance between scalability and security that really depends on the application.

For TurboPuffer specifically, did its namespace or metadata filtering model create any issues for permissioning, or was that abstracted away enough?

For TurboPuffer, a lot of those things are abstracted away enough, which is good. Downstream consumers don't have to worry about all of that infrastructure management—they can use the product as-is and a lot of those concerns are handled at the product level itself. We just worry about the security we've implemented at the agent level, which is a good balance.

For freshness, how do you manage updates into TurboPuffer—batch reindexing, streaming writes, or some hybrid approach?

We mostly use hybrid approaches. In some cases there are batch writes, but there's also real-time indexing. TurboPuffer vectors are incrementally indexed—with something like an LSM-fresh vector index—which supports filtering plus immediate visibility of writes in search results. It provides both options, and we use a hybrid approach to utilize the strong features of each.

For those immediate writes, how do you validate freshness in production—are you measuring index lag directly, or only catching stale results through evals and user reports?

For now, we are mostly catching stale results through evals and logging. We have a detailed logging mechanism set up that helps catch those cases. Longer term, we might look into utilizing additional features directly from the vendor, but for now we have visibility through our own logging.

How are you using eval frameworks like LangSmith or Datadog Evaluations—do they actually drive retrieval configuration changes, or mostly validate after launch?

They do drive configuration changes. LangSmith evaluations are very detailed, and since we're using LangChain and LangGraph as the foundational blocks of our agents, LangSmith is a natural choice for tracing and logging. We have dedicated feedback sessions and regular reporting mechanisms so we can look at the quality of results, and if there are improvements to be made, we take those results into account and go back and modify the configurations.

When evals suggest retrieval changes, what's a concrete example you've adjusted—chunking, embedding model, top-k, filters, reranking, or the backend itself?

Chunking is an important one. Sometimes important context gets missed because of the chunking size or the algorithm being used, so we've gone ahead and modified the chunking strategy. In some cases it's also been the embedding model—sometimes multilingual embedding models are needed, or larger, more powerful embedding models. We've tried a couple of different approaches; it really depends on the individual case.

When you changed chunking or embedding models, did that ever push you toward a different vector backend, or were the stores mostly interchangeable?

The stores are mostly interchangeable for those factors. The reason for continuing to use TurboPuffer is less about chunking and more about cost—the three-tier hierarchy. If you want to design a product at scale, you don't want to be paying uniformly for cold storage. With a large amount of data, you can separate your most in-demand hot data from the less-utilized data that can stay in cold storage. That is the main unique feature of TurboPuffer.

On that cost advantage, how do you quantify it internally—storage cost per namespace, query cost, avoided cluster ops, or total cost per product workload?

Typically total cost for the product, which is the easiest to look at. Most of these are custom pricing packages, so user count alone is not enough to estimate cost savings. Cost is usually built on data stored, data queried, and write volume. Depending on how large the corpus is and how large the query volume is, we look at the overall cost of the product to get some level of justification for using it.

When you compare that total product cost against Vespa or Pinecone, what are the hidden costs that matter most—engineering ops, overprovisioning, data movement, or migrations?

Engineering ops is an important factor. TurboPuffer's operational simplicity is a strong factor in its favor—there's a much lower operational burden, no cluster sizing, no shard planning, and not much node management. They take care of automatic scaling and multi-tenancy. With Vespa, because you need customized models and customized machine learning algorithms to make it worthwhile, there is obviously more operational complexity.

For TurboPuffer adoption, what was the hardest part of getting it production-ready—migration, security review, performance validation, or developer familiarity?

Developer familiarity is important since the platform will be used across many applications, so ease of development is a factor. Lower operational burden and operational complexity were also important. Most of these platforms come with robust enterprise security and authentication features, so that's typically not a distinguishing factor. The lower operational burden, cost efficiency at scale, and automatic scaling are the strong factors in TurboPuffer's favor.

What did performance validation look like before production—did you run shadow traffic, offline benchmarks, synthetic workloads, or a limited rollout?

We used a lot of synthetic data for benchmarking and also compared against real data. For the rollout, we did limited phased rollouts with periodic testing at each interval to see how results were looking. Based on that, we would go back, make configuration improvements, and do another rollout.

During those phased rollouts, what thresholds decided go or no-go—retrieval accuracy, p95 latency, error rate, cost, or user feedback?

Retrieval quality and error rates would be important factors. User feedback is also important, but it's tricky because it doesn't tell you which step failed or where the issue was—you need to look at the detailed logging and evaluations to figure that out. Depending on the scenario, you could look at a combination of those options to identify any issues.

For the TurboPuffer rollout specifically, did p95 or p99 latency become a hard gate, or was retrieval quality more important than tail latency?

Retrieval quality was more important. There cannot be any sacrifice on retrieval quality, especially in a complex system, because any compromise at the retrieval stage can compromise subsequent steps in the agentic workflow.

When retrieval quality was the hard requirement, what was the evaluation set built from—historical queries, labeled relevance judgments, synthetic questions, or production feedback?

We used quite a bit of historical data to build those test metrics. We also utilized some vendors for labeled data, which really helps with red teaming, post-training labeling, annotation, and testing. Depending on the scale of the product, we used a variety of those approaches—in some cases relying on only one or two, in others using a combination of all of them.

For labeled relevance judgments, what did "good" mean in practice—exact document match, sufficient context for the answer, or downstream answer correctness?

Exact document match can be a useful metric, but typically we go with similarity scores between the retrieved context and the user prompt. You can then have a second metric for retrieved context versus the final agent response, and another for the agent response versus the user prompt. All of these give you different views of the whole pipeline and help you narrow down whether the issue is at the retrieval stage, the agent level, or the application level.

For TurboPuffer specifically, did those relevance metrics expose any quality gap versus Vespa or Elasticsearch, or was the difference mainly latency and cost?

We did not see a very noticeable difference on those relevance metrics. The majority of the distinguishing factors came down to latency and cost. Especially at scale, cost efficiency becomes very important. TurboPuffer's options are much cheaper than always-in-memory or always-SSD architectures because less-used data can remain in cold object storage, which is not the case for Vespa or Elasticsearch. That really helps for large-scale external-facing products, and latency is also a plus on top of that.

For workloads where TurboPuffer won on cost, what corpus shape made that true—lots of namespaces, very large cold data, spiky traffic, or low query frequency?

It's a combination of lots of cold storage data and spiky traffic. If an individual customer is mostly interested in their own data and records, there's no point in exposing or paying for retrieval and memory processing of all the other non-relevant data in that session. That's where the TurboPuffer advantage really shows up—you only need the relevant data available in RAM as hot data, and all the irrelevant data can stay in cold storage, automatically promoted between tiers only if needed.

For that per-customer or per-tenant shape, are you isolating data with namespaces, metadata filters, or separate indexes—and where do limits show up?

Metadata filters is an option we have tried, along with different namespaces. Typically there are enough identifying filters and tags to help identify the right data shape, and that really helps improve search result accuracy and relevance.

Where have metadata filters or namespaces hit practical limits for you—high-cardinality fields, query latency, permission logic, or just operational complexity?

Permission logic can be a tricky one. There are sometimes strict requirements around PII and data sharing, and for a global company you have to deal with data governance requirements and privacy laws across different countries and continents. That does introduce operational complexity—you have to design your policies around that. But it's a compromise you have to make between cost, latency, and operational complexity.

When governance constraints get that complex, do you prefer enforcing them outside TurboPuffer entirely, or do you need the retrieval backend to support more policy-aware filtering?

If the back end could support more policy-relevant filtering, that would be helpful, because not everything can be done outside—if it can be handled at the data layer itself, it really helps reduce complexity in other components of the system. But sometimes it's just not possible to put that complex logic at the vector store level, so we handle it at the application layer. It's a balance of the architecture.

Disclaimers

This transcript is for information purposes only and does not constitute advice of any type or trade recommendation and should not form the basis of any investment decision. Sacra accepts no liability for the transcript or for any errors, omissions or inaccuracies in respect of it. The views of the experts expressed in the transcript are those of the experts and they are not endorsed by, nor do they represent the opinion of Sacra. Sacra reserves all copyright, intellectual property rights in the transcript. Any modification, copying, displaying, distributing, transmitting, publishing, licensing, creating derivative works from, or selling any transcript is strictly prohibited.

AI engineer at Indeed on TurboPuffer vs. Vespa vs. Elasticsearch at scale

Background

Questions

Interview

Disclaimers

Read more from
#vector-databases

Turbopuffer revenue, growth, and valuation

AI engineer at Meta on evaluating Turbopuffer vs. Pinecone vs. Weaviate

$100M/year PostHog of vector databases

Read more from
#ai

Arena revenue, growth, and valuation

$100M/year Nielsen of LLMs

$20M/year Replit for GCs

Create a free account, or log in.

Free article limit reached.

Standard membership required.

Standard membership required.

Background

Questions

Interview

Disclaimers

Read more from #vector-databases

Turbopuffer revenue, growth, and valuation

AI engineer at Meta on evaluating Turbopuffer vs. Pinecone vs. Weaviate

$100M/year PostHog of vector databases

Read more from #ai

Arena revenue, growth, and valuation

$100M/year Nielsen of LLMs

$20M/year Replit for GCs

Read more from
#vector-databases

Read more from
#ai