Turbopuffer as First Pass Retriever

Diving deeper into

AI engineer at Meta on evaluating Turbopuffer vs. Pinecone vs. Weaviate

Interview
We separate retrieval and ranking, use caching pretty heavily on retrieval
Analyzed 3 sources

This setup turns the vector database into a candidate fetcher, not the place where the final answer is decided. Retrieval is the step that pulls back a broad set of possible documents, and ranking is the step that sorts those candidates down to the few chunks that actually enter model context. In this workflow, heavy caching stabilizes retrieval output and latency across repeated eval runs, while the real relevance logic lives in a separate ranking layer.

  • In the Meta workflow, Turbopuffer is used primarily for retrieval. The engineer describes separating retrieval from ranking, caching retrieval heavily, and running the same eval many times to measure a distribution, because ANN search and index updates can make repeated retrieval runs vary.
  • That matches Turbopuffer's practical sweet spot. It can return dense, sparse, BM25, filter, and hybrid candidates, but multiple interviews place its strongest value in cheap large scale candidate generation, while custom reranking, personalization, and advanced ranking logic are handled elsewhere or pushed toward systems like Vespa and Elasticsearch.
  • The main tradeoff is that once retrieval and ranking are split, teams need good observability and feature handoff between the two layers. Engineers want raw retrieval scores, stable cache behavior, and richer ranking features, because pulling 1,000 candidates is easy, but choosing the best 20 for an agent context window is the hard part.

Going forward, Turbopuffer is best positioned as the low cost retrieval tier under a broader search stack. If it expands feature generation, hybrid controls, and debugging around candidate scoring, it can move closer to the ranking boundary. But the clearest path is still owning first pass retrieval for very large corpora, while separate rankers decide what the model actually sees.