Retrieval Quality as Agent Foundation

Diving deeper into

AI engineer at Indeed on TurboPuffer vs. Vespa vs. Elasticsearch at scale

Interview
There cannot be any sacrifice on retrieval quality, especially in a complex system, because any compromise at the retrieval stage can compromise subsequent steps in the agentic workflow.
Analyzed 2 sources

The key point is that retrieval is the quality floor for the whole agent, so this team treated recall and relevance as a release gate and treated latency as something to optimize after the right documents were consistently coming back. In practice, the evaluation set was built mostly from historical queries, then strengthened with vendor labeled data for annotation, red teaming, and testing, which let the team measure whether retrieved context actually matched the user prompt and supported the final response.

  • The eval stack was multi stage, not just document matching. Exact match could matter, but the main metrics were similarity between query and retrieved context, then between retrieved context and final answer, which helped isolate whether failure came from search or orchestration.
  • That choice fits the broader architecture. The system logs every RAG step, tool call, context return, and similarity score in Datadog and LangSmith, because a weak retrieval result can trigger downstream tool failures, higher latency, or wrong answers later in the graph.
  • A useful comparison is that another large scale team evaluating Turbopuffer emphasized repeated eval runs, retrieval and ranking separation, and distribution based benchmarking because stochastic or weak retrieval makes the whole benchmark noisy. Both cases put retrieval correctness ahead of raw speed alone.

Going forward, agent stacks are likely to standardize on richer retrieval evals that start with real historical traffic, add labeled judgments, and score each handoff in the pipeline. As agents call more tools and branch through longer workflows, teams that can prove retrieval quality step by step will ship faster and break less often in production.