Unified Embeddings and Metadata in ClickHouse
AI program manager at AstraZeneca on running self-hosted ClickHouse
This shows why real time RAG in regulated settings is shifting toward operational analytics databases, not separate vector stores. At AstraZeneca, the retrieval step is not just nearest neighbor search. It also needs row level metadata checks, source grounding, and compliance filtering before an answer reaches clinicians. Keeping embeddings and metadata together lets one query do both jobs, which cuts hops, speeds response time, and improves retrieval quality because the filter and the match run on the same data model.
-
The practical accuracy gain came from better retrieval, not just faster hardware. AstraZeneca said the old setup made it hard to reliably exceed 90% accuracy, while the unified ClickHouse setup improved prompt refinement and retrieval together. That matters in oncology workflows where the wrong document, or the right document without the right metadata, can fail compliance review.
-
This fits ClickHouse's broader wedge versus Snowflake and Databricks. Snowflake remains AstraZeneca's governed system for transformations and reporting, while ClickHouse is the speed layer for sub second analytics and agentic AI orchestration. In other words, vector search is extending ClickHouse's existing low latency analytics role, not replacing the warehouse.
-
The competitive point is that a standalone vector database adds another retrieval system to secure, synchronize, and filter. For a self hosted pharma deployment, that means more moving parts around access control, auditability, and data freshness. ClickHouse's appeal is that the same engine already handling large scale analytical filters can also execute vector search inside that governed workflow.
Going forward, the winning architecture for enterprise RAG will look less like a separate AI sidecar and more like a database that can handle search, filtering, and analytics in one path. That favors ClickHouse in latency sensitive, compliance heavy workloads, especially where teams already need a fast operational store alongside a warehouse such as Snowflake.