AI program manager at AstraZeneca on running self-hosted ClickHouse

Jan-Erik Asplund
View PDF

Background

We spoke with an AI program manager at a major global pharmaceutical company who leads a 400-person data science and AI organization running ClickHouse on self-hosted infrastructure for oncology and drug development workloads. Our conversation covers why her team migrated agentic AI workloads off Databricks to ClickHouse—citing sub-200ms query latency on petabytes of electronic health records versus minutes on Databricks, and a roughly 75% reduction in the engineering headcount required to operate those workloads—how ClickHouse and Snowflake divide responsibilities within a regulated enterprise data stack, and how AstraZeneca is using ClickHouse's vector search capabilities as the retrieval layer for compliance-critical RAG pipelines serving oncologists and healthcare professionals.Add to Conversation

Key points via Sacra AI:

  • AstraZeneca moved agentic AI workloads from Databricks to self-hosted ClickHouse because real-time retrieval from petabytes of patient records required sub-200ms latency that Databricks couldn't deliver, reducing the team needed to operate those workloads by roughly 75%. "When we moved into the generative AI and agentic AI space, we began building agent-based models on data coming in from various sources. Databricks became difficult to work with given the cost, query retrieval latency, and other factors... For complex groupings on billions of rows across our several petabytes of data, it takes less than 200 milliseconds at most—subsecond even at massive scale. This is blazing fast... At the program level I lead, if Databricks requires a team of 100 people, ClickHouse achieves the same output with around 25."
  • ClickHouse and Snowflake coexist for clearly different purposes at AstraZeneca—Snowflake as the foundational governance layer for complex transformations and regulatory reporting, ClickHouse as the real-time speed layer for agentic AI orchestration—suggesting ClickHouse's enterprise TAM is additive rather than replacement. "Snowflake is essentially an enterprise data lakehouse—excellent for complex SQL and multi-stage transformations, especially for financial data and commercial analytics for marketing and sales. ClickHouse is our high-performance OLAP engine for real-time analytics—sub-second metrics, logs, customer-facing analytics, and the native agentic AI orchestration layer... Snowflake has been with us longer and operates as our foundational governance layer, while ClickHouse functions as the speed layer within a broader data architecture."
  • ClickHouse's vector search capabilities are being used for compliance-critical RAG workflows in oncology, where the unified approach of storing both metadata and vector embeddings in one system improved accuracy and latency over standalone vector databases while meeting regulatory standards. "We used ClickHouse vector embeddings instead of a standalone vector database, and it significantly helped with response times, latency, and prompt refinement. Previously, we would refine prompts after receiving a response, and it was difficult to consistently hit 90% or above accuracy. We needed high accuracy, low latency, cost efficiency, and fast retrieval. Considering all those factors together, the unified ClickHouse approach was the right choice."

Questions

  1. Can you tell me about your professional background and what kind of work you do?
  2. How does ClickHouse fit into your work—is it something you're actively using, used at a previous job, or evaluated for AI and data science use cases?
  3. Is this a ClickHouse Cloud deployment, or are you self-hosting the open source version within AstraZeneca's infrastructure?
  4. What else is in your data stack alongside ClickHouse—are you also using tools like Snowflake, Databricks, or Datadog?
  5. What specifically lands on ClickHouse versus Databricks or Redshift—is it a matter of query latency, the type of data from electronic health records, or how AI agents access data in real time?
  6. What kind of latency difference did you see for agentic AI queries on patient records when moving from Databricks to ClickHouse?
  7. What specific ClickHouse features drove that performance for your patient records, and what were the real operational trade-offs when setting up a self-hosted cluster at this scale?
  8. How large a team does it take to keep the ClickHouse cluster running, and what does operational overhead look like compared to Databricks?
  9. With such a smaller team managing petabytes of data in ClickHouse, are you building custom tooling for monitoring and ingestion that Databricks provided out of the box, and are you considering moving any Datadog workloads into ClickHouse?
  10. Was Elasticsearch ever in the mix for the conversational search workload, and are you using ClickHouse's vector search capabilities or primarily using it for metadata retrieval for the generative AI agents?
  11. How does ClickHouse's vector search performance compare to dedicated vector databases, and does the unified approach—handling both metadata and vector embeddings in one place—help with compliance checks?
  12. ClickHouse recently launched a native Postgres integration allowing users to query Postgres tables directly from ClickHouse—does that kind of unification of transactional and analytical workloads appeal to you given your patient records and compliance requirements?
  13. ClickHouse ships monthly releases with many new features—does that pace of innovation excite your team, or is it a source of instability for your production cluster given your regulatory framework?
  14. What is the most important thing ClickHouse could build or improve to make it even more valuable for your oncology and AI agent workloads?
  15. How do you think about the division of labor between ClickHouse and Snowflake—is there a clear line between real-time AI workloads and broader warehousing, or could ClickHouse eventually take over more of what Snowflake does?
  16. How does the developer experience compare between ClickHouse and Snowflake, and where did your engineers find the most friction building on ClickHouse?
  17. What were the specific technical gotchas your engineers hit when first modeling patient records or genomic data in ClickHouse—did they struggle with merge tree settings, primary key selection, or join behavior?
  18. When surgeons and scientists are simultaneously querying patient records and genomic data, does ClickHouse handle the concurrency well, and how does its horizontal scaling compare to Snowflake under heavy load?
  19. At what point would a significant ClickHouse pricing increase cause you to seriously evaluate alternatives, or is the performance advantage for agentic AI workloads significant enough that you'd stay regardless?
  20. Is the ClickHouse community a meaningful resource for your team when tuning clusters, or have you had to engage ClickHouse support directly for any reliability issues?

Interview

Can you tell me about your professional background and what kind of work you do?

I'm a strategic and results-oriented senior leader at AstraZeneca. I've been working in data science, artificial intelligence, and transforming data analytics use cases into real-time value for end users.

How does ClickHouse fit into your work—is it something you're actively using, used at a previous job, or evaluated for AI and data science use cases?

We are heavily dependent on various machine learning use cases and generative agentic AI use cases that run on ClickHouse. The reason is that it provides super-fast analytics on large amounts of data—specifically for processing electronic health records from EHR systems, understanding patient journeys and HCP journeys, and helping us diagnose critical cases related to oncology.

Is this a ClickHouse Cloud deployment, or are you self-hosting the open source version within AstraZeneca's infrastructure?

We are using the open source version on our own infrastructure. Because AstraZeneca is a regulatory company, we need to follow strict compliance standards, which is why we have not used their cloud.

What else is in your data stack alongside ClickHouse—are you also using tools like Snowflake, Databricks, or Datadog?

We use several tools; as a large company, we don't rely on just one. We use Databricks and Amazon Redshift for processing SQL queries related to machine learning models and generative AI use cases. We also use Datadog for observability.

What specifically lands on ClickHouse versus Databricks or Redshift—is it a matter of query latency, the type of data from electronic health records, or how AI agents access data in real time?

A few years ago, we started with traditional machine learning models and used Databricks as our primary partner for processing large amounts of data, and that worked well. When we moved into the generative AI and agentic AI space, we began building agent-based models on data coming in from various sources. Databricks became difficult to work with given the cost, query retrieval latency, and other factors. We evaluated alternatives and moved to ClickHouse for faster query generation, because it is primarily a columnar database rather than a traditional data warehouse like PostgreSQL.

What kind of latency difference did you see for agentic AI queries on patient records when moving from Databricks to ClickHouse?

The key difference is that Databricks is optimized for machine learning workloads and large-scale data engineering. For ClickHouse, even on generative AI workloads, simple aggregations run in less than 30 to 40 milliseconds. For complex groupings on billions of rows across our several petabytes of data, it takes less than 200 milliseconds at most—subsecond even at massive scale. This is blazing fast, and it comes from ClickHouse's vectorized execution engine, built on C++, and its columnar storage. On real-time dashboards, the same workloads take minutes on Databricks by comparison.

What specific ClickHouse features drove that performance for your patient records, and what were the real operational trade-offs when setting up a self-hosted cluster at this scale?

I can walk through the trade-offs across five major buckets: latency, cost, complexity, flexibility, and ecosystem fit.

On latency: ClickHouse delivers ultra-low millisecond latency, while Databricks operates at the second level. On cost: ClickHouse is extremely cost-efficient for analytical queries with no expensive orchestration for the ML layer, whereas Databricks charges for Delta Lake, MLflow, notebooks, and more. On simplicity and flexibility: ClickHouse is SQL-based, whereas Databricks supports multiple languages—SQL, Python, R—making it more complex but more flexible. On real-time analytics: ClickHouse excels at time-series and streaming data for customer-facing dashboards, understanding KOL influence by region, or identifying high-priority HCPs for a specific tumor area—workloads where Databricks is not well-suited. On operational overhead: ClickHouse is lightweight and horizontally scalable, while Databricks requires enterprise-grade governance and is a much heavier platform overall.

How large a team does it take to keep the ClickHouse cluster running, and what does operational overhead look like compared to Databricks?

This is one of the most practical differences we've seen. For the data engineering layer alone, where Databricks might require 15 data engineers, ClickHouse needs only 3 to 4. For DBAs and storage engineers, Databricks needs 4 to 5 people, while ClickHouse needs 1 to 2. Looking at the total team including machine learning engineers and data scientists—where Databricks might need 25 to 30 people—ClickHouse needs just 2 to 3. At the program level I lead, if Databricks requires a team of 100 people, ClickHouse achieves the same output with around 25. That is a massive reduction in cost and overhead.

With such a smaller team managing petabytes of data in ClickHouse, are you building custom tooling for monitoring and ingestion that Databricks provided out of the box, and are you considering moving any Datadog workloads into ClickHouse?

We have plans to explore that, but we have not made any moves yet. Right now we're focused on the use cases that need fast retrieval most urgently. One key example is conversational search for HCPs. We had a requirement to move beyond keyword search and provide intent-driven conversational search—understanding not just the keywords in a prompt but the intention behind it and how to retrieve data from multiple sources. That is one reason we started working with ClickHouse. On the observability side, Datadog operates at enterprise level—it provides no-code or low-code automation, auto-remediation, auto-routing, and integrates with upstream and downstream systems like Slack and Jira. ClickHouse, by contrast, provides storage, a query engine, and real-time analytics. Cost-wise, Datadog runs us at least $10 to $12 million per year at our scale, whereas ClickHouse's virtualized compute engine and cheaper storage make it 10 to 20 times less expensive.

Was Elasticsearch ever in the mix for the conversational search workload, and are you using ClickHouse's vector search capabilities or primarily using it for metadata retrieval for the generative AI agents?

We are using ClickHouse's vector search capabilities—that is exactly why I brought up that use case. As a regulatory organization, all information passed to scientists, surgeons, or healthcare professionals must be vetted through compliance and meet regulatory standards. We pull from both internal and external data sources, where the LLM crawls those sources, retrieves information, ensures it meets compliance standards using RAG, and grounds that information before delivering a response to HCPs. ClickHouse is what makes that grounding fast and reliable.

How does ClickHouse's vector search performance compare to dedicated vector databases, and does the unified approach—handling both metadata and vector embeddings in one place—help with compliance checks?

We used ClickHouse vector embeddings instead of a standalone vector database, and it significantly helped with response times, latency, and prompt refinement. Previously, we would refine prompts after receiving a response, and it was difficult to consistently hit 90% or above accuracy. We needed high accuracy, low latency, cost efficiency, and fast retrieval. Considering all those factors together, the unified ClickHouse approach was the right choice.

ClickHouse recently launched a native Postgres integration allowing users to query Postgres tables directly from ClickHouse—does that kind of unification of transactional and analytical workloads appeal to you given your patient records and compliance requirements?

That is news to me. But regardless, we would keep those workloads strictly separate. We would never mix transactional data with analytical data for regulatory compliance purposes, so we would want to maintain them in distinct systems.

ClickHouse ships monthly releases with many new features—does that pace of innovation excite your team, or is it a source of instability for your production cluster given your regulatory framework?

We do not adopt updates at speed, because we have dependencies across the systems feeding our data. We follow a particular cadence—typically one stable release per month—testing each release and ensuring it meets our standards before rolling it out. We appreciate that ClickHouse is innovating quickly to stay competitive, but at enterprise grade, we need a predictable patching schedule, a clear upgrade path, and validation that every release meets our cost and performance benchmarks.

What is the most important thing ClickHouse could build or improve to make it even more valuable for your oncology and AI agent workloads?

There are many potential directions for ClickHouse in oncology. ClickHouse has the ability to analyze massive longitudinal, multimodal datasets, and I see a lot of potential for genomic data, imaging metadata, computer vision, electronic health record events, treatment timelines, biomarkers, and outcomes. All of these use cases benefit from fast retrieval of semi-structured clinical data at scale. From a cost standpoint, ClickHouse can compress and store real-world evidence data, claims data, and EHR data at 10 to 20 percent of the cost of competitors like Snowflake or Databricks.

How do you think about the division of labor between ClickHouse and Snowflake—is there a clear line between real-time AI workloads and broader warehousing, or could ClickHouse eventually take over more of what Snowflake does?

Snowflake is essentially an enterprise data lakehouse—excellent for complex SQL and multi-stage transformations, especially for financial data and commercial analytics for marketing and sales. It handles regulatory reporting, clinical operations, and cross-domain data integration. ClickHouse is our high-performance OLAP engine for real-time analytics—sub-second metrics, logs, customer-facing analytics, and the native agentic AI orchestration layer. Snowflake wins for complex transformations, compliance, and cross-domain work. ClickHouse wins for fast analytics at scale. We run both for different purposes. Snowflake has been with us longer and operates as our foundational governance layer, while ClickHouse functions as the speed layer within a broader data architecture.

How does the developer experience compare between ClickHouse and Snowflake, and where did your engineers find the most friction building on ClickHouse?

We collect anonymous real-time feedback from our team—about 400 people in my department—and the picture over the past two months has been fairly consistent.

On ease of onboarding, Snowflake is significantly easier than ClickHouse. Snowflake does not require infrastructure knowledge and the tooling is cloud-native. ClickHouse can also run locally, which is excellent, but it requires engineers to understand cluster management, auto-scaling, and backup strategies—making the operational overhead medium compared to near-zero for Snowflake.

For real-time analytics, ClickHouse is the clear winner. There is no other platform on the market with the same distinction in real-time data ingestion and querying. Snowflake is not ideal for those workloads.

On cost transparency, Snowflake is simpler to understand. ClickHouse requires performance tuning and indexing work to optimize costs and debug issues.

On integrations, ClickHouse is actually the stronger integrator across other platforms, while Snowflake lags slightly.

On SQL, both are excellent, though ClickHouse is more specialized toward OLAP.

To summarize: ClickHouse delivers a frictionless, SQL-first experience with extreme performance and flexibility, but requires more engineering investment. Snowflake is ideal for enterprise data engineering and business intelligence. ClickHouse is ideal for real-time OLAP and customer-facing analytics.

What were the specific technical gotchas your engineers hit when first modeling patient records or genomic data in ClickHouse—did they struggle with merge tree settings, primary key selection, or join behavior?

The persona split tells the story. Snowflake is ideal for SQL-first BI developers, analytics engineers, and data engineers with a data warehouse background—it has a very low learning curve. ClickHouse requires backend engineers, site reliability engineers, and performance-focused engineers who are comfortable with OLAP systems. The steeper learning curve comes from needing to understand indexing strategies, table engine selection, and performance tuning. The key distinction is that ClickHouse is easier for performance-oriented backend engineers, while Snowflake is easier for analytics-oriented teams. In our organization, different teams use each platform for different purposes.

When surgeons and scientists are simultaneously querying patient records and genomic data, does ClickHouse handle the concurrency well, and how does its horizontal scaling compare to Snowflake under heavy load?

Snowflake and ClickHouse handle concurrency very differently. Snowflake provides invisible scaling through multi-cluster warehouses that auto-scale automatically. ClickHouse is based on a cluster model that requires manual fine-tuning to achieve extreme performance. We tune it for maximum performance, and Snowflake hides that complexity entirely.

For concurrency spikes, Snowflake absorbs them automatically by adding clusters, which keeps performance stable but increases cost. ClickHouse absorbs concurrency spikes through techniques like materialized views, replication, and sharding—if you engineer it well, you get extreme performance at low cost, but you have to design for it. The best summary: if you have a well-engineered team, use ClickHouse. If your analytics team is not deeply engineering-oriented, Snowflake is the effortless choice.

At what point would a significant ClickHouse pricing increase cause you to seriously evaluate alternatives, or is the performance advantage for agentic AI workloads significant enough that you'd stay regardless?

Since we use ClickHouse and Snowflake for different purposes, each has its own ROI story. If we see significant return on investment, we would absorb a cost increase. If the ROI is not there, we would evaluate alternatives. It comes down to a holistic view: the ROI, the developer experience, and whether the cost increase is justified by what we are getting in return.

Is the ClickHouse community a meaningful resource for your team when tuning clusters, or have you had to engage ClickHouse support directly for any reliability issues?

Initially, yes—we engaged directly with the ClickHouse team to understand how they could support our engineering team in optimizing our setup. We went through about a two-week learning curve, co-created some of our database structures with them, and worked through how to link those to downstream systems.

Disclaimers

This transcript is for information purposes only and does not constitute advice of any type or trade recommendation and should not form the basis of any investment decision. Sacra accepts no liability for the transcript or for any errors, omissions or inaccuracies in respect of it. The views of the experts expressed in the transcript are those of the experts and they are not endorsed by, nor do they represent the opinion of Sacra. Sacra reserves all copyright, intellectual property rights in the transcript. Any modification, copying, displaying, distributing, transmitting, publishing, licensing, creating derivative works from, or selling any transcript is strictly prohibited.

Read more from

Read more from

SingleStore revenue, growth, and valuation

lightningbolt_icon Unlocked Report
Continue Reading

Cockroach Labs revenue, users, growth, and valuation

lightningbolt_icon Unlocked Report
Continue Reading

Read more from

Mintlify revenue, growth, and valuation

lightningbolt_icon Unlocked Report
Continue Reading