Growth Rate (y/y)
Note: All information taken from public sources.
Databricks hit $800M annual recurring revenue at the end of 2021, and Sacra estimates they grew that to $1.24B in 2022. It makes money by charging the customers for using the platform and professional services to help them set up Databricks. We estimate that more than 90% of its revenue comes from the platform, with the rest coming from professional services.
Databricks has a pay-as-you-go model and bills customers depending on their tier, how much processing power of the software they use, and for how long. The more expensive premium and enterprise tiers offer more security, governance, higher speeds, and data processing features. Databricks works on top of Microsoft Azure, Google Cloud, and AWS, each having slightly different charges across tiers and computing power.
Databricks sells primarily to large enterprises, with contracts touching millions of dollars annually. It has more than 7,000 customers and a net retention rate of more than 150%. Some of its large customers include Shell, CVS Health, Regeneron, T-Mobile, HSBC, and Comcast.
Note: All information from public sources. Size of the bubble indicates valuation. Vertical axis is on log scale for visual clarity.
Databricks has raised $3.5B from investors such as Franklin Templeton, Counterpoint Global, and Andreessen Horowitz. It is valued at $38B, making it one of the most valuable private companies globally.
At its 2021 revenue, its valuation/revenue multiple is 47.5x, comparable to its key competitor Snowflake, a public company with a revenue multiple of 45x.
However, most enterprise data management/AI companies have lesser revenue multiples, such as Teradata (1.6x), C3 (6x), Alteryx (8.3x), and DataRobot (4.5x). Enterprise AI platforms lost considerable valuation amidst the market volatility, with the average revenue multiples falling to 4.9x from their peak of 28.9x in 2020.
The enterprise AI company cohort consists of C3, Palantir, Alteryx, and Veritone.
Databricks builds open source software for data processing and AI applications and then offers a paid version with additional proprietary features, which companies cannot replicate easily on their own. While open source software gives companies flexibility of not getting locked into a proprietary architecture, most companies don’t usually have the engineering talent to manage its complexity.
This is where Databricks comes in and sells enterprises a fully managed version of its open source software, with additional utilities like SaaS tools to write queries and connectors to connect data sources. This aspect of Databricks is similar to AWS, which also provides managed services for open source software, but Databricks makes all the open source software it manages, giving it an edge over others.
Databricks started with Apache Spark for running queries on large raw datasets in data lakes. It then expanded its revenue by launching products that tapped into adjacent markets such as AI lifecycle management/MLOps (MLFlow), data warehouse (Delta Lake), data visualization (Redash), and BI and analytics (Databricks SQL).
It has both bottom-up and enterprise sales GTM. For its bottom-up sales motion, Databricks offers a free forever community edition, a small slice of the large software. Half of its leads are from community edition customers who want to use the full software or when SDRs notice heavy usage patterns and pass leads to sales teams. Databricks also offers a Twilio-like self-service model where anyone can just swipe their card and start a free trial without talking to a sales rep. Databricks also provides free training and workshops to get such users started.
When Databricks pitches to CIOs through traditional enterprise sales motion, on many occasions it is endorsed by the data scientists/engineers already using it, shortening the sales cycle. Among the bottom-up GTM software companies, Databricks has one of the highest sales headcounts to support its enterprise sales motion.
Databricks was created by the same team that made Apache Spark, open-source software for running queries on data lakes used to store large amounts of raw data cheaply. When Spark was launched in 2009, most data lakes were hosted on-premise on Hadoop, the first OS for data centers. However, running large queries on Hadoop was cumbersome and took a lot of time. Spark found its initial product market fit by making it easier and faster to run queries on top of data lakes. While it was originally designed to run on top of Hadoop but can now run on any cloud storage like AWS, Google Cloud, or Microsoft Azure.
Databricks’s core offering is managed Spark clusters, groups of machines used for running data analysis. It gives a web-based portal to data scientists to create these Spark clusters for running their data analysis workloads. The portal also consists of a notebook-like workspace for data scientists to collaboratively write queries in SQL, Python, etc., and a scheduler for running data pipelines on a regular schedule that data engineers can use as a replacement for Airflow or Prefect.
Databricks launched several new software in the last few years and bundled them as part of its core offering.
It’s a platform for a machine learning lifecycle where data scientists build ML models, track their experiments, deploy them to production, and monitor their performance.
Delta Lake is a layer that runs on top of data lakes and speeds up data queries, similar to query speeds of data warehouses. Delta Lake is the key component of Databricks’s push to move into the BI and data analytics category to compete with data warehouse companies such as SnowFlake, Amazon, and others.
It’s a place for admins to manage users, set access privileges, define and enforce data access policies, and check the usage levels of different users to prevent costs from snowballing.
This lets users run SQL on top of Delta Lake, create visualizations, and build/share dashboards aimed toward data analysts in organizations that are used to running queries in SQL.
As the data management market evolves, Snowflake and Databricks are trying to eat each other’s lunch. Snowflake is traditionally known as a data warehouse for business analysts and data engineers, and Databricks as a way to query data lakes for data scientists and ML engineers. 70% of Databricks customers use Snowflake as their data warehouse for business intelligence and Databricks for running ML workloads. But these lines are beginning to blur.
Snowflake added data science offerings such as Snowpark, support for Python, and Snowflake for Apache Iceberg. On the other hand, Delta Lake, Databricks SQL, and Unity Catalog wrapped inside the Data Lakehouse position Databricks in the data warehouse market. In the next few years, it is expected that the market will continue to grow fast enough for both Databricks and Snowflake to grow without losing market share, something we see in the Cloud market with AWS, Google, and Azure.
Databricks also competes against specialized solutions in data management and data science spaces that run specific tasks. For instance, Databricks’s scheduler is similar to Airflow, and its MLFlow offering competes with Datarobot and Alateryx. Databricks has an advantage by owning the whole pipeline from data coming in all the way to deploying ML models. But it’s also more expensive than specialized applications.
COVID forced legacy companies and startups to start selling digitally. They are racing to make their personalized consumer experience as good as Amazon or TikTok by doubling down on AI. This means pulling in data from all sorts of customer touch points like front-end clickstreams from Segment, payments from Stripe, and conversion data from Google Ads, running them through ML models, and feeding back the data streams into your website and app.
With cheap cloud storage and fast networks, most companies are shifting from analyzing organization data in ERP and customer data in Salesforce to putting all data in a central data store. This helps them better understand what happened (business intelligence) and what will happen (predictive analytics). Data centralization is a tailwind for Databricks as it is built on data lake technology, where you can throw in all data without worrying about the type/source of data.
Taken together, these tailwinds point towards a future where all data processing happens on a single platform. However, currently, most companies use data warehouses for running real-time business intelligence operations and data lakes for ML/data science projects. Databricks is betting that as more things go digital, the explosion of data will make it impossible for companies to run two parallel large-scale data stores, which will converge into one.
The bull case for Databricks is that it will capture the lion’s share of this future, with its Data Lakehouse offering becoming the industry standard for data centralization as Salesforce became the industry standard for customer data.
One of the major risks for Databricks is that as COVID wanes, consumers and enterprises may start returning to their pre-pandemic behavior, and the shift in digital consumption doesn’t turn out to be as massive as expected. Due to this, enterprises may slow their spending on multi-million dollar software projects like Databricks.
Stickiness of data warehouse
Databricks is betting on a future where companies will stop using separate software for data warehouses and instead shift to Databricks for all data processing/storage requirements. However, data warehouses are highly sticky products, like ERPs, and not easy for a large enterprise to rip out. This can make its sales cycles very long and put a cap on its serviceable market.
This report is for information purposes only and is not to be used or considered as an offer or the solicitation of an offer to sell or to buy or subscribe for securities or other financial instruments. Nothing in this report constitutes investment, legal, accounting or tax advice or a representation that any investment or strategy is suitable or appropriate to your individual circumstances or otherwise constitutes a personal trade recommendation to you.
Information and opinions presented in the sections of the report were obtained or derived from sources Sacra believes are reliable, but Sacra makes no representation as to their accuracy or completeness. Past performance should not be taken as an indication or guarantee of future performance, and no representation or warranty, express or implied, is made regarding future performance. Information, opinions and estimates contained in this report reflect a determination at its original date of publication by Sacra and are subject to change without notice.
Sacra accepts no liability for loss arising from the use of the material presented in this report, except that this exclusion of liability does not apply to the extent that liability arises under specific statutes or regulations applicable to Sacra. Sacra may have issued, and may in the future issue, other reports that are inconsistent with, and reach different conclusions from, the information presented in this report. Those reports reflect different assumptions, views and analytical methods of the analysts who prepared them and Sacra is under no obligation to ensure that such other reports are brought to the attention of any recipient of this report.
All rights reserved. All material presented in this report, unless specifically indicated otherwise is under copyright to Sacra. Sacra reserves any and all intellectual property rights in the report. All trademarks, service marks and logos used in this report are trademarks or service marks or registered trademarks or service marks of Sacra. Any modification, copying, displaying, distributing, transmitting, publishing, licensing, creating derivative works from, or selling any report is strictly prohibited. None of the material, nor its content, nor any copy of it, may be altered in any way, transmitted to, copied or distributed to any other party, without the prior express written permission of Sacra. Any unauthorized duplication, redistribution or disclosure of this report will result in prosecution.