Retrieval Consistency as Evaluation Problem
Diving deeper into
AI engineer at Meta on evaluating Turbopuffer vs. Pinecone vs. Weaviate
It's less of an issue for user-facing retrieval and more of an issue for evals.
Analyzed 2 sources
Reviewing context
The key implication is that retrieval consistency is mostly a measurement problem, not a product problem. In a live app, slight variation in which near neighbor documents show up usually gets washed out by reranking and the model’s final answer. In evals, that same variation corrupts the baseline, because repeated runs can make the backend look better or worse without any real system change.
-
The engineer describes nondeterminism as coming from ANN search, index update timing, and cache state. That matters less when retrieval is just candidate generation for a user query, but it matters a lot when the goal is to compare versions and isolate whether a retrieval tweak actually improved quality.
-
The practical workaround is to split retrieval from ranking, cache retrieval heavily, and run the same eval set many times to measure a distribution instead of one score. That turns noisy retrieval into a stable benchmarking process, even if the index itself is not perfectly repeatable.
-
This also explains why backend choices often come down to cost and latency once relevance is in the same range. In another large scale evaluation, TurboPuffer, Vespa, and Elasticsearch looked similar on retrieval quality, so the deciding factors became serving behavior and economics, not benchmark wins.
Going forward, retrieval stacks will be judged less by a single recall number and more by whether they produce stable candidate sets that let teams run clean offline experiments. As agent workflows mature, the winning systems will be the ones that separate cheap candidate generation from controlled ranking and repeatable eval infrastructure.