Product manager at Firebolt on on scaling challenges and ACID compliance in OLAP databases
Jan-Erik Asplund
Background
We spoke with a data infrastructure expert who previously managed a hosted ClickHouse service and now works with a competing analytics database that was originally forked from ClickHouse. Our conversation explores ClickHouse's strengths in observability workloads, its compression advantages over alternatives like OpenSearch, and the architectural trade-offs between open source self-hosting and cloud offerings.
Key points via Sacra AI:
- ClickHouse excels at observability workloads due to superior compression and performance, enabling 3x longer log retention at half the cost of alternatives like OpenSearch. "We were able to reduce the size of the cluster we were using to a third of its OpenSearch size in CPU and memory because of the performance of the system. But then we were able to increase the retention of logs available to our end users because the compression was so much better. We were able to go from offering 7 days of historic logs for analysis to our customers to 30 days. We were still saving—I think we were paying half as much as we had been when we were trying to do it with OpenSearch."
- ClickHouse Cloud's decoupled storage and compute architecture doesn't fully solve the elasticity problem, with customers often finding the cloud offering too expensive compared to self-hosting. "ClickHouse has done a better job and they have built a decoupled storage and compute, but you still struggle with resource allocation. Scaling up and down is not as fluid as some of the other services... Fundamentally, the challenge with the cloud offering for ClickHouse is that it's too expensive. We have customers who have used the open source version, and the barrier to open source ClickHouse is around the complexity of management. But the cloud offering, at the scale they're running at, is just too expensive, so they are looking for alternatives."
- ClickHouse's append-only architecture struggles with transactional workloads that require ACID compliance and upserts, creating an opening for competitors like Firebolt that handle both analytical and transactional patterns. "Upserts are a first-class citizen for us. The design of ClickHouse specifically is append-only and then delete later. So you just add more files, compact them down, and add data. You're not doing inserts and upserts or changes to existing content—you essentially recreate the whole thing and add a new row to add data... We handle that as a primary primitive mechanism, so you can do that on Firebolt and that's why we can handle transactional workflows in a way that ClickHouse can't."
Questions
- Can you summarize your professional background and the types of database and big-data work you do?
- How does ClickHouse fit into your experience—are you using it in production now, did you use it previously, or have you only evaluated it?
- When you ran the hosted ClickHouse service at your previous company, what were the primary use cases—real-time user-facing analytics, observability and logs, or something else?
- Was that previous ClickHouse setup entirely self-hosted on the open-source version, or were you using ClickHouse Cloud or a BYOC model?
- Why did you choose ClickHouse for your observability and logs—was it query speed, cost per GB, better handling of high-cardinality data, or something else compared to Elasticsearch or Datadog?
- After moving logs from OpenSearch to ClickHouse, how did the troubleshooting experience change for engineers—did they miss full-text search or benefit from ClickHouse's analytical speed?
- Regarding the intermittent performance you mentioned during tuning, what specific pain points did you hit—high-frequency small inserts, MergeTree background merges, or did you rely on features like Bloom filters and materialized views?
- How do you see ClickHouse positioned today versus the fork you work on and cloud warehouses like Snowflake and BigQuery—what workloads is ClickHouse clearly best for, and where does it lose its edge?
- Where do teams usually hit a wall with ClickHouse's complexity—moving from single node to distributed clusters, data modeling challenges like joins and late-arriving data, or something else—and can you give a concrete example of a workload that outgrows ClickHouse?
- Does ClickHouse's native Postgres integration actually solve the two-database complexity between transactional and analytical workloads, or is it more of a Band-Aid compared to a unified engine like Firebolt?
- Does ClickHouse Cloud genuinely deliver Snowflake-style decoupled storage and compute elasticity, or are the roots of the tightly coupled open-source architecture still visible?
- What is it about ClickHouse Cloud's pricing that becomes punitive—storage costs, compute units, or simply a disproportionate markup compared to efficient self-hosted open source—and where do customers usually go when they leave?
- When you say scaling in ClickHouse Cloud isn't fluid, what does that look like in practice—slow node additions, coarse autoscaling forcing overprovisioning, and for less technical teams, what's the 'oh no' moment that makes self-hosting harder than expected?
- Do you see ClickHouse as a viable platform for AI workloads like vector search and RAG pipelines, or is it stretching itself too thin by pursuing too many features beyond being a fast OLAP engine?
- Beyond the marketing, what is ClickHouse's durable technical advantage that competitors will struggle to replicate, and what risks—if any—threaten its long-term viability as a standalone company?
- When migrating logs from OpenSearch to ClickHouse, was there a specific data volume or latency threshold where OpenSearch failed, and what, if anything, does Elasticsearch still do better such as dashboarding or handling unstructured text?
- How did you find ClickHouse's documentation and community when tackling tuning issues—was it helpful or did you mostly rely on trial and error and source code dives—and how does the developer experience compare to Hadoop?
- How does ClickHouse handle high concurrency and noisy neighbors—will hundreds or thousands of simultaneous dashboard users cause degradation, and how do you mitigate a single massive poorly optimized query?
- When Postgres and ClickHouse are kept in sync, what's the usual failure mode—CDC pipeline breakage or difficulty joining across systems—and how does your product eliminate ClickHouse's eventual consistency problem architecturally?
- When teams try to force transactional patterns into ClickHouse using approaches like ReplacingMergeTree, where does that pattern break down—do background merges and slow vacuuming cause problems?
- How does Firebolt's pricing model compare to ClickHouse Cloud—is it credit-based or tied to underlying infrastructure, and in bake-offs do you argue lower compute usage, cheaper storage, or both?
- For your 30-day logs in ClickHouse, did you implement tiered storage to move cold data to object storage, and if not, how did you keep query performance acceptable for older logs?
- Where is ClickHouse most vulnerable—competitive pressure from products like Firebolt that handle transactional workloads, or from cloud warehouses like Snowflake and BigQuery improving real-time performance?
- Are ClickHouse and Firebolt ready to act as plug-and-play query engines over Iceberg, and is there a performance penalty compared to using their native managed storage formats?
- What do people consistently misunderstand about ClickHouse—its marketing pitch versus technical reality—that you wish more users appreciated?
- If you were starting a high-concurrency analytics project today, would you choose open-source ClickHouse first, or would managed options like Firebolt or ClickHouse Cloud make you skip self-hosting?
- Why is self-hosting open-source ClickHouse still your first move for new projects—is it purely cost control and flexibility to avoid vendor lock-in, or is there another reason?
- When starting on open-source ClickHouse, does the lack of cloud-only features like S3 tiered storage become a problem quickly, or does ClickHouse's compression let you go far on local or attached disks?
Interview
Can you summarize your professional background and the types of database and big-data work you do?
I've been working in big data for my whole life. I used to work in Hadoop, which is the original big data. CDNs, distributed data systems, big data in finance. More recently, I've been working with graph databases, then as head of databases covering analytical OLAP databases like ClickHouse, and Firebolt where I work now. And relational databases such as Postgres, MySQL, and NoSQL databases such as Cassandra and key-value pair databases such as Redis or Valky. Combined with Neo4j, I've covered and managed work in product engineering across all the biggest types of databases.
How does ClickHouse fit into your experience—are you using it in production now, did you use it previously, or have you only evaluated it?
Fundamentally, ClickHouse is a competitor to the product I currently work with. In a previous role, I managed hosted ClickHouse as a service based on the open source product, and we used it internally in my previous organization. Here at my current company, we use our own product, which was a fork of the open source ClickHouse five years ago. The products are fundamentally very similar and we compete with them. I have deep understanding of the differences between the two services, how ClickHouse is positioned, where it works, where it doesn't work, why customers are using it, and why they are moving away from it.
When you ran the hosted ClickHouse service at your previous company, what were the primary use cases—real-time user-facing analytics, observability and logs, or something else?
Internally, it was used for powering internal analytics and also observability, but the majority was observability. We used ClickHouse internally for all our logs for the managed service and our troubleshooting.
Was that previous ClickHouse setup entirely self-hosted on the open-source version, or were you using ClickHouse Cloud or a BYOC model?
We were very much open source focused at my last company, so we only used the open source version, and we used our own hosted open source version of it. Internally, it was used for powering internal analytics and observability, but the majority was observability. We used ClickHouse internally for all our logs for the managed service and our troubleshooting.
Why did you choose ClickHouse for your observability and logs—was it query speed, cost per GB, better handling of high-cardinality data, or something else compared to Elasticsearch or Datadog?
We wouldn't use Datadog because it's just too expensive. I've worked with and built products at Neo4j where we had to avoid using Datadog in our product design because it made it unfeasible. We were actually moving off OpenSearch, which is a fork of Elastic when they made their license change. So we moved from essentially Elasticsearch/OpenSearch to ClickHouse.
The advantages it gave us were that performance was better, but the real winner was that it's very good at compaction. Observability logs are very similar and lend themselves to useful compaction. We were able to reduce the size of the cluster we were using to a third of its OpenSearch size in CPU and memory because of the performance of the system. But then we were able to increase the retention of logs available to our end users because the compression was so much better. We were able to go from offering 7 days of historic logs for analysis to our customers to 30 days. We were still saving—I think we were paying half as much as we had been when we were trying to do it with OpenSearch.
After moving logs from OpenSearch to ClickHouse, how did the troubleshooting experience change for engineers—did they miss full-text search or benefit from ClickHouse's analytical speed?
I don't know that they found it any easier. There was an adaptation period, but the people who were doing the troubleshooting were our platform engineers. They were super technical, managing thousands of databases at scale for customers. Managing our infrastructure was not a massive challenge to them. We had very technical people using the system to troubleshoot the system, and they could adapt to using anything really quickly.
The tuning of ClickHouse to get it to work properly took longer than they thought. We gained all the business advantages, and everyone was very happy with that. But initially, it was quite a painful process to get ClickHouse optimized. The initial design was pretty easy, but the performance could be intermittent for queries. We had to make some alterations, but once we got it working, it was good.
I would question whether a company that was so reliant on OpenSearch or Elastic and full text search might face more of a barrier switching to ClickHouse than our team did. ClickHouse now has full text search indexes, so I don't think that'll be a blocker for anyone picking it up now or migrating from OpenSearch.
Full text search indexes, when you're troubleshooting, are a kind of secondary step in the process. You will know that there was an issue around a certain time, and you'll hone in on the time period first, then look at and filter and walk through what was going on or specific errors. We'd set materialized views over specific error messages to make sure we could track them down quickly.
Regarding the intermittent performance you mentioned during tuning, what specific pain points did you hit—high-frequency small inserts, MergeTree background merges, or did you rely on features like Bloom filters and materialized views?
We always used materialized views. The team was quite technically savvy, so they weren't trying to use it naively. They would leverage all the capabilities of the service from the beginning. We heavily leveraged materialized views to make sure people could see what they needed when they went to look.
There wasn't really a concurrency user issue because the team was quite small, so there were never that many people troubleshooting logs at any one time.
The ingestion rate was a problem, but we would batch it up and ingest it, which left a couple of minutes of latency for the end users. When there was an urgent issue, which was rare, there were some slight performance issues due to that minute or two gap, but the team was generally able to work around it. It didn't happen often enough to need to engineer our way out of it.
Specifically, we were running ClickHouse open source on Aiven. Some of the limitations of the Aiven way of hosting were more of a problem than ClickHouse's specific capabilities or nuances. Aiven's database-as-a-service offering was designed around Postgres and how that works, and all the other databases were fitted into that deployment pattern, which wasn't necessarily the most appropriate for all databases. ClickHouse wasn't too bad, but there were some edge cases around backups that required tweaking.
Specifically, I mean the way we took and managed nodes within the cluster, because clustering and scaling databases is the hardest thing to do when doing it yourself. That's where more of our problems came from, not specifically from ClickHouse. We found ClickHouse to be very adaptable to our needs without any negative service impact or business impact. We weren't bound by specific ClickHouse issues; it was more about how we hosted and managed the cluster scale for production.
How do you see ClickHouse positioned today versus the fork you work on and cloud warehouses like Snowflake and BigQuery—what workloads is ClickHouse clearly best for, and where does it lose its edge?
ClickHouse is a great database for large-scale observability, where you need high throughput and ACID compliance is not a hard requirement. ACID compliance is the database standard, essentially much like Postgres. Whenever a transaction completes, the data is there and available for everyone. ACID compliance is that commitment to something being there when it's committed.
ClickHouse operates on an eventually consistent model. The data might take a while to be present and reliable across all the nodes in a cluster, especially in the cloud offering. That means it's not suitable for some more transactional workloads. Now, it's an OLAP database, so it's not designed to be a transactionally ACID compliant database with high or full consistency. But increasingly, customers need both—they need to be able to run transactional workloads and analytic workloads in the same platform.
At Firebolt, we've built our infrastructure in the cloud with separate storage and compute to gain transactional ACID compliance and high consistency. That can't be achieved in ClickHouse Cloud. We pick up workloads that need high throughput and consistency. They also need to do transactional workloads that they can rely on the responses from. That's where Firebolt plays, and we are winning those deals.
We've also built in more aggressive indexing and optimizations at every step of the query so that your workload is fast out of the box and continues to be fast. Essentially, ClickHouse exposes everything, but you need to get really good at it to tune it to be brilliant. We give you a high level of performance with standard built-in optimizations, and then we also expose the levers to give you deep control that you can do the same thing as you could in ClickHouse if you wanted to.
The differentiation is that Firebolt is ACID compliant and fully consistent. ClickHouse isn't, but it gains higher throughput and concurrency because of it. We find that there are more workflows that fit ours than theirs, and we continue to win when we face them in bake-offs.
The other difference is they have an open source offering, and we don't. The open source offering provides them with a large pool of engineers who are already using it or have picked it up and hit the buffers of complexity that come with managing it. At that point, they can switch them over to the cloud offering with minimal effort. They've got a great business model and mechanism for go-to-market there that works really well, and that's kind of their main superpower at the moment.
When you compare them to things like BigQuery, fundamentally, it's cost. BigQuery and Databricks and others don't do the high concurrency and throughput and super-fast analytics that either Firebolt or ClickHouse do, just by the design of the infrastructure and architecture of the product itself. With BigQuery, when you run a query, it has to scan everything to know where to target the query to find the data. That initial read is very expensive and slow. So you can't run ad hoc exploratory analytics queries on BigQuery without them costing a fortune and taking a long time.
That is not the case with ClickHouse and Firebolt. They have sparse indexing and aggressive indexing so their query only ever reads the amount of data it needs to. It's very targeted to the data it needs to read straight off the bat. That's an efficiency of CPU, memory, and IOPS down to the disk, whereas in BigQuery, you have to read everything and then target down, but you've had two massive reads in there. ClickHouse and Firebolt are winning against BigQuery for cost, speed, and efficiency when it comes to analytics.
Where do teams usually hit a wall with ClickHouse's complexity—moving from single node to distributed clusters, data modeling challenges like joins and late-arriving data, or something else—and can you give a concrete example of a workload that outgrows ClickHouse?
We have a customer who's a smaller startup where people getting started don't want the complexity of two databases. They don't want to have transactional workloads and analytical work separately. They don't have a volume of analytical workflows that is so vast and critical to their company that a super-fast dedicated ClickHouse engine is necessary. So they might pick it up to start with alongside their Postgres database.
Their relational workloads aren't vast or critical enough for them to need dedicated Postgres and ClickHouse together. We've seen people run those two things and find the complexity of them together to achieve what they need and the duplication of data too much of an issue, so they move to Firebolt.
Now they can run it locally on-core or in the cloud. Initially, they ran it locally, it sort of works, grows to scale, and then you hit the normal management overheads of "I don't want to be managing this, I don't want to look at the clusters, I don't want to be validating engine types, I don't want to be turning it off and on all the time."
Features in Firebolt, like engines turning off and turning on when a query arrives, are critical. Sizing them correctly so that they are only using the resources that you need when you need them—those are all critical things.
Regarding where people hit the wall with ClickHouse complexity—it's the really deep and complex queries and joins. ClickHouse isn't designed effectively for that; it has always struggled with complex queries.
Firebolt is the same under the hood at the core, so it would also struggle with that same problem. However, we build in query optimizers at the planning stage so that when things are not looking effective, we rewrite the queries and we cache subparts of the queries so that those things are not as slow or as expensive as they would have been. Where we know there's a better way to write it, we will just execute it that way, ensuring you get the same results.
People fall out with ClickHouse once the initial workload is satisfied and they need to do something else with it. That moving on and doing more complex things with your data and the schema you've designed is where people can trip up.
Does ClickHouse's native Postgres integration actually solve the two-database complexity between transactional and analytical workloads, or is it more of a Band-Aid compared to a unified engine like Firebolt?
I think it's a Band-Aid, and it's clearly following a trend in the market rather than actually solving the customer's needs. It's more for getting your foot through the door than actually solving the problem.
Fundamentally, Postgres still remains the number one database for developers. It has a vast ecosystem of extensions and connectors. Everyone knows how to use it and understands it. At this point, it's ancient in database terms, and for that reason, it's very mature.
When you're attacking a market and you want to win workloads from customers, you need to lower the barrier to adoption by showing that it's compatible and can integrate well with your main database, which will 99 percent of the time be Postgres. So it's more of a business strategy than a technical solution.
I've not seen it work well for them. I've not seen it yet work well for Databricks either. Snowflake bought Postgres companies and is trying to integrate it into their system. So I think that's more of a sign of the competitiveness of the market and trying to make yourself relevant than actually solving the two-in-one issue.
Does ClickHouse Cloud genuinely deliver Snowflake-style decoupled storage and compute elasticity, or are the roots of the tightly coupled open-source architecture still visible?
It's still prevalent. They still haven't got it fully decoupled. Aiven did not have decoupled storage and compute. We were just hosting the ClickHouse open source instances as it was intended as an open source product in nodes that were fully resourced on a per-node basis.
ClickHouse has done a better job and they have built a decoupled storage and compute, but you still struggle with resource allocation. Scaling up and down is not as fluid as some of the other services.
To be transparent, Snowflake and Databricks do not offer great decoupled storage and compute these days. That design of architecture is 12 years old now. Things have moved on dramatically, and people are still paying too much.
Fundamentally, the challenge with the cloud offering for ClickHouse is that it's too expensive. We have customers who have used the open source version, and the barrier to open source ClickHouse is around the complexity of management. But the cloud offering, at the scale they're running at, is just too expensive, so they are looking for alternatives.
ClickHouse Cloud has improved, but it's not modern enough. We've done a better job, and others are doing better jobs with decoupled storage and compute.
What is it about ClickHouse Cloud's pricing that becomes punitive—storage costs, compute units, or simply a disproportionate markup compared to efficient self-hosted open source—and where do customers usually go when they leave?
They go back to self-hosting or they come to us, but they want to host us as well. We don't see those people in those scenarios; the cloud offering just isn't working for anyone.
Often the use case and infrastructure design of the customer doesn't solve the problem. The pricing model doesn't work at any scale. They don't add a slight premium on storage, which works well because ClickHouse is very good at compression, so storage is generally not the main driver of pain.
It's the jump from open source costing you relatively little, and the payoff of the management pain—how much value you put on not managing it compared to managing it and paying nothing essentially.
And then it's the inelasticity of the engines and over-provisioning. Even in the cloud, you defer management to them, but you still have to put a lot of effort into managing the cost and scaling up and down. At that point, why bother?
When you say scaling in ClickHouse Cloud isn't fluid, what does that look like in practice—slow node additions, coarse autoscaling forcing overprovisioning, and for less technical teams, what's the 'oh no' moment that makes self-hosting harder than expected?
If they're not a DevOps-type team, it's just basic cluster management, failing over, version upgrades, migration—all that stuff is a massive pain. The first time you need to upgrade because there's either a feature you want in the latest version or the version you're running is being retired, that's when the pain lands because then the migration becomes challenging. It's fundamentally tricky.
ClickHouse Cloud engines are not particularly well fit for most workloads. We have customers that have tried ClickHouse and don't really get what they needed from the offering as it sits. It's quite inflexible in that regard.
Do you see ClickHouse as a viable platform for AI workloads like vector search and RAG pipelines, or is it stretching itself too thin by pursuing too many features beyond being a fast OLAP engine?
I think it's a fairly standard play that everyone's doing. They're all monetizing AI workloads, so they're good at observability, and why would they not do observability in AI workloads? Vector search uses the vectorized nature of the underlying engine of ClickHouse and Firebolt. Vector indexes are relatively easy to do now, and they're becoming ubiquitous in all databases.
I don't think it's a bad move. I think they need to do it. If you're working with the fundamental architecture and infrastructure of these AI systems such as LLMs, then you're going to lose out if you don't have those capabilities. So I think it's a perfectly good strategic move for them. They'll remain a functional OLAP database in the observability space, regardless of whether they extend into LLM support. Fundamentally, everyone will be well-connected to AI workloads as the market moves in that direction.
Beyond the marketing, what is ClickHouse's durable technical advantage that competitors will struggle to replicate, and what risks—if any—threaten its long-term viability as a standalone company?
I don't really see any risks for them at the moment. The core technology is strong, they have a good brand, they are building the right things. They employ very good engineers. Their "engineers first, marketing second" approach is capturing the market, and they have a good open source funnel. So I don't see any near-term issues or risks for them. I think their growth rate is good. They'll continue to steal workloads from Elasticsearch. I'm not sure that their Postgres effort will pay off, but we'll see.
When migrating logs from OpenSearch to ClickHouse, was there a specific data volume or latency threshold where OpenSearch failed, and what, if anything, does Elasticsearch still do better such as dashboarding or handling unstructured text?
I don't know the performance of the full-text search in ClickHouse well enough to know whether it will achieve the same performance as OpenSearch.
With OpenSearch, the cost was the main issue. OpenSearch instances run on nodes with coupled storage and compute. You are limited by the VM size and the disk footprint you can get on a VM, and that's what drives the cost and scale. It doesn't do compression as well as ClickHouse, so when you need to scale up, you need additional nodes for the additional disk in a way that you don't with ClickHouse. We were limited by the cost of that cluster, which is why we moved to ClickHouse.
How did you find ClickHouse's documentation and community when tackling tuning issues—was it helpful or did you mostly rely on trial and error and source code dives—and how does the developer experience compare to Hadoop?
ClickHouse is built for a power user. But I think that's part of what makes it attractive—it's powerful, but you've got to work to get it to work well. There's a satisfaction that the best engineers get from that. Fundamentally, given its prevalence, if you take time to get good at ClickHouse, there will be jobs for you in companies because lots of people are running the open source version of it.
It's miles ahead of where Hadoop was. I think they are doing great content, they have a good drumbeat, but they employ a lot of people to do it. So I'm not surprised it works well.
How does ClickHouse handle high concurrency and noisy neighbors—will hundreds or thousands of simultaneous dashboard users cause degradation, and how do you mitigate a single massive poorly optimized query?
The noisy neighbor issue is a problem. You don't have isolation like we have in Firebolt, where we isolate engines for specific jobs—so ETL is running on one engine and analytics is running on another. So it is a problem you have to manage, but most companies have some kind of mechanism internally to manage that. You'd rather not have to manage it if you didn't have to, which is why people can get annoyed with it.
Specifically for day-two problems—database upgrades are a pain for everyone. If you're self-hosting, you're going to have a painful couple of weeks, staying up late and working overnight to avoid impact to customers. When you're self-hosting, there'll be a gap in service when you do that upgrade for the majority of things. So you will have downtime, which is suboptimal. Most managed services have ways around that to avoid it.
For concurrency, I don't know the exact queries per second that ClickHouse Cloud can handle, but it's high. I have not heard of people leaving because of concurrency issues.
When Postgres and ClickHouse are kept in sync, what's the usual failure mode—CDC pipeline breakage or difficulty joining across systems—and how does your product eliminate ClickHouse's eventual consistency problem architecturally?
It's a different way of handling the write-ahead log. We have a separated metadata layer that allows us to be 100 percent sure where everything is all the time, and things always write to that. That's where the ACID compliance comes from. We have a very different architectural design.
Upserts are a first-class citizen for us. The design of ClickHouse specifically is append-only and then delete later. So you just add more files, compact them down, and add data. You're not doing inserts and upserts or changes to existing content—you essentially recreate the whole thing and add a new row to add data.
Whereas transactional work generally relies on a customer or request changing something like a date of birth or an order number—going into an existing table and changing a value from one thing to another. That's not how ClickHouse works. We handle that as a primary primitive mechanism, so you can do that on Firebolt and that's why we can handle transactional workflows in a way that ClickHouse can't.
When teams try to force transactional patterns into ClickHouse using approaches like ReplacingMergeTree, where does that pattern break down—do background merges and slow vacuuming cause problems?
The background merges take a while. Vacuuming and cleaning up of the deletes, which is how you clear and reclaim space, could then impact you. So it's the management of the disk footprint that starts to become cumbersome.
How does Firebolt's pricing model compare to ClickHouse Cloud—is it credit-based or tied to underlying infrastructure, and in bake-offs do you argue lower compute usage, cheaper storage, or both?
Our storage layer is cheaper because we don't add a premium on top of it. We just pass through the cost of the underlying cloud infrastructure, so there's no additional markup on our part there.
It's about efficiency—those optimizations, those fast indexes, those efficient queries reduce CPU usage. The sparse indexing means we target the specific footprint of the underlying data, so you're not reading everything. We reduce at every point the number of resources that you're using—IOPs, CPU, memory.
Our subquery cache also reduces the amount of CPU you're using because if we've already got the result in cache from a previous query or similar query, you won't need to read it from disk. So that offers both performance and cost savings.
Fundamentally, we are faster and cheaper because we optimize the amount you read, the amount you process, and the amount that the CPU is working on. All of that is minimized to make things faster than anything else.
For your 30-day logs in ClickHouse, did you implement tiered storage to move cold data to object storage, and if not, how did you keep query performance acceptable for older logs?
No, we didn't implement tiered storage. Tiered storage is one of the mechanisms that ClickHouse keeps proprietary; they haven't pushed that down to the open source product. Our product was based on the open source product, so we didn't have access to tiered storage. That is one of their selling points for their cloud offering—you can't get tiered storage if you're running it open source.
But it wasn't really a massive impact for us because the compression on the SSDs we had running was still good, so we were fine. What we did was attach slower disks to the nodes alongside SSDs to mimic tiered storage. That allowed us to massively expand the disk footprint. So it replicates that behavior, just with two different types of attached disk storage to the nodes in the cloud. We achieved similar results without using object storage.
Where is ClickHouse most vulnerable—competitive pressure from products like Firebolt that handle transactional workloads, or from cloud warehouses like Snowflake and BigQuery improving real-time performance?
It's definitely not the big cloud warehouses. There are architectural restrictions in Databricks and Snowflake that mean they'll never achieve the same kind of performance that ClickHouse and Firebolt do.
The real risk for ClickHouse and Firebolt is not meeting enterprise feature requirements and therefore not being able to protect that market. I'm thinking of all the less interesting RBAC and compliance features—if those capabilities don't reach the maturity of the other offerings, then enterprises will stay with the incumbents.
One thing we haven't talked about is Iceberg. As Iceberg emerges and people move their data from proprietary storage within Snowflake and Databricks to Iceberg, it will open up the door for Firebolt and ClickHouse. They can be treated just as query engines then—the data doesn't have to move. If you want to migrate from one system to another, you can try ClickHouse on your current workload without removing any data. Then if the technical outcome is good, you can just switch a single workload from one of those bigger offerings into these engines.
Are ClickHouse and Firebolt ready to act as plug-and-play query engines over Iceberg, and is there a performance penalty compared to using their native managed storage formats?
You lose some element of performance. However, it's more critical that the data is written effectively to Iceberg than the query engine itself. If you write the content to Iceberg in a suboptimal way, it doesn't matter which query engine you bring to it.
There's a segment of optimization for both ClickHouse and Firebolt with their managed storage that you don't get with Iceberg, because Iceberg is just an open table format. But companies are working hard at the moment to bring smarts to Iceberg. I don't think it will be long before it will be effective and high-performing.
What do people consistently misunderstand about ClickHouse—its marketing pitch versus technical reality—that you wish more users appreciated?
I don't have one strong opinion on that. ClickHouse should be picked up for more workloads. These technologies, both ClickHouse and Firebolt, are super powerful with their vectorized engines. That leads them to be very useful for many different workloads. Even if it isn't great at everything, it's still powerful and quick enough to handle many more workflows than people currently use it for. They should be more open to trying these technologies with different types of workloads rather than just limiting them to observability or specific work.
If you were starting a high-concurrency analytics project today, would you choose open-source ClickHouse first, or would managed options like Firebolt or ClickHouse Cloud make you skip self-hosting?
I'd go with self-hosted first.
Why is self-hosting open-source ClickHouse still your first move for new projects—is it purely cost control and flexibility to avoid vendor lock-in, or is there another reason?
Nothing to do with trust. If you're starting a new business, it's a very volatile market. I don't know if I'm going to be around in a year. I can run ClickHouse on a small server and go from there. If I succeed, I'll move to cloud. But if I don't, I won't have a complex cloud bill or vendor lock-in.
When starting on open-source ClickHouse, does the lack of cloud-only features like S3 tiered storage become a problem quickly, or does ClickHouse's compression let you go far on local or attached disks?
The compression is so good that the majority of use cases will be fine without tiered storage.
Disclaimers
This transcript is for information purposes only and does not constitute advice of any type or trade recommendation and should not form the basis of any investment decision. Sacra accepts no liability for the transcript or for any errors, omissions or inaccuracies in respect of it. The views of the experts expressed in the transcript are those of the experts and they are not endorsed by, nor do they represent the opinion of Sacra. Sacra reserves all copyright, intellectual property rights in the transcript. Any modification, copying, displaying, distributing, transmitting, publishing, licensing, creating derivative works from, or selling any transcript is strictly prohibited.