Growth Rate (y/y)
Note: All information taken from public sources.
Sacra estimates that Databricks hit about $1.275B in annual recurring revenue (ARR) at the end of 2022 and is on track to hit $1.9B in ARR at the end of 2023.
Databricks reported crossing the $1B in annual recurring revenue (ARR) mark in August 2022 and bringing in $1B in total revenue for the year of 2022.
It makes money by charging the customers for using the platform and professional services to help them set up Databricks. We estimate that about 80% of its revenue comes from the platform, with the rest coming from professional services. In April 2023, their data warehousing product—Databricks SQL—hit $100M in ARR one year after launching.
Databricks has a pay-as-you-go model and bills customers depending on their tier, how much processing power of the software they use, and for how long. The more expensive premium and enterprise tiers offer more security, governance, higher speeds, and data processing features. Databricks works on top of Microsoft Azure, Google Cloud, and AWS, each having slightly different charges across tiers and computing power.
Databricks sells primarily to large enterprises, with contracts touching millions of dollars annually. It has more than 7,000 customers and a net retention rate of more than 150%. Some of its large customers include Shell, CVS Health, Regeneron, T-Mobile, HSBC, and Comcast.
Databricks has raised $4B from investors such as Franklin Templeton, Counterpoint Global, and Andreessen Horowitz. It was valued at $43B as of its last primary round in September 2023, making it one of the most valuable private companies globally.
Per our estimation of Databricks' last twelve months (LTM) revenue of $1.375B, its valuation/revenue multiple at the time of this round was roughly 31x, comparable to its key competitor Snowflake, a public company with a revenue multiple of about 40x.
However, most enterprise data management/AI companies have lesser revenue multiples, such as Teradata (2.5x), C3 (11x), Alteryx (3.1x), and DataRobot (4.5x). Enterprise AI platforms have lost considerable valuation amidst the market volatility, with the average revenue multiples falling to 4.9x from their peak of 28.9x in 2020.
The enterprise AI company cohort consists of C3, Palantir, Alteryx, and Veritone.
Databricks builds open source software for data processing and AI applications and then offers a paid version with additional proprietary features, which companies cannot replicate easily on their own. While open source software gives companies flexibility of not getting locked into a proprietary architecture, most companies don’t usually have the engineering talent to manage its complexity.
This is where Databricks comes in and sells enterprises a fully managed version of its open source software, with additional utilities like SaaS tools to write queries and connectors to connect data sources. This aspect of Databricks is similar to AWS, which also provides managed services for open source software, but Databricks makes all the open source software it manages, giving it an edge over others.
Databricks started with Apache Spark for running queries on large raw datasets in data lakes. It then expanded its revenue by launching products that tapped into adjacent markets such as AI lifecycle management/MLOps (MLFlow), data warehouse (Delta Lake), data visualization (Redash), and BI and analytics (Databricks SQL).
It has both bottom-up and enterprise sales GTM. For its bottom-up sales motion, Databricks offers a free forever community edition, a small slice of the large software. Half of its leads are from community edition customers who want to use the full software or when SDRs notice heavy usage patterns and pass leads to sales teams. Databricks also offers a Twilio-like self-service model where anyone can just swipe their card and start a free trial without talking to a sales rep. Databricks also provides free training and workshops to get such users started.
When Databricks pitches to CIOs through traditional enterprise sales motion, on many occasions it is endorsed by the data scientists/engineers already using it, shortening the sales cycle. Among the bottom-up GTM software companies, Databricks has one of the highest sales headcounts to support its enterprise sales motion.
Databricks was created by the same team that made Apache Spark, open-source software for running queries on data lakes used to store large amounts of raw data cheaply. When Spark was launched in 2009, most data lakes were hosted on-premise on Hadoop, the first OS for data centers. However, running large queries on Hadoop was cumbersome and took a lot of time. Spark found its initial product market fit by making it easier and faster to run queries on top of data lakes. While it was originally designed to run on top of Hadoop but can now run on any cloud storage like AWS, Google Cloud, or Microsoft Azure.
Databricks’s core offering is managed Spark clusters, groups of machines used for running data analysis. It gives a web-based portal to data scientists to create these Spark clusters for running their data analysis workloads. The portal also consists of a notebook-like workspace for data scientists to collaboratively write queries in SQL, Python, etc., and a scheduler for running data pipelines on a regular schedule that data engineers can use as a replacement for Airflow or Prefect.
Databricks launched several new software in the last few years and bundled them as part of its core offering.
In June 2023, Databricks acquired MosaicML for $1.3 billion, a move aimed at bolstering its capabilities in training large language models (LLMs) and image generation models. MosaicML has developed tools and infrastructure to simplify and reduce the cost of running LLMs from data preparation to training and managing infrastructure.
The training of LLMs like GPT-3, involves significant costs, but MosaicML claims to be able to train GPT-3 quality models for its customers for as little as $325K (compare to $368K for Google LaMDA, $1M for Bloom, and $841K for GPT-3).
These prices from MosaicML also include a full suite of ML Ops tools, thus further reducing the additional personnel required to train a model reliably.
It’s a platform for a machine learning lifecycle where data scientists build ML models, track their experiments, deploy them to production, and monitor their performance.
Delta Lake is a layer that runs on top of data lakes and speeds up data queries, similar to query speeds of data warehouses. Delta Lake is the key component of Databricks’s push to move into the BI and data analytics category to compete with data warehouse companies such as SnowFlake, Amazon, and others.
Databricks SQL is a data warehouse that lets users run SQL on top of Delta Lake, create visualizations, and build/share dashboards aimed toward data analysts in organizations that are used to running queries in SQL.
As the data management market evolves, Snowflake and Databricks are trying to eat each other’s lunch. Snowflake is traditionally known as a data warehouse for business analysts and data engineers, and Databricks as a way to query data lakes for data scientists and ML engineers. 70% of Databricks customers use Snowflake as their data warehouse for business intelligence and Databricks for running ML workloads. But these lines are beginning to blur.
Snowflake added data science offerings such as Snowpark, support for Python, and Snowflake for Apache Iceberg. On the other hand, Delta Lake, Databricks SQL, and Unity Catalog wrapped inside the Data Lakehouse position Databricks in the data warehouse market. In the next few years, it is expected that the market will continue to grow fast enough for both Databricks and Snowflake to grow without losing market share, something we see in the Cloud market with AWS, Google, and Azure.
Databricks also competes against specialized solutions in data management and data science spaces that run specific tasks. For instance, Databricks’s scheduler is similar to Airflow, and its MLFlow offering competes with Datarobot and Alateryx. Databricks has an advantage by owning the whole pipeline from data coming in all the way to deploying ML models. But it’s also more expensive than specialized applications.
Databricks has a few clear avenues for TAM expansion both through ongoing trends and through new products and services.
Databricks, through its acquisition of MosaicML, steps into a competitive arena against the likes of OpenAI by offering tools and infrastructure to companies for creating their own AI applications from scratch. Unlike OpenAI, which offers proprietary models, Databricks's approach empowers companies to harness their own data for AI model training, thus appealing to organizations keen on retaining control over their data and AI assets.
The revenue model of Databricks relies on subscriptions to its toolset and charges based on usage. The addition of MosaicML's cost-effective model training infrastructure could attract a wider customer base looking to reduce the financial barriers in training large models.
This potentially increases the subscription and usage revenue for Databricks. Moreover, the competitive pricing for training large models could become a distinctive selling point that drives additional revenue.
New partnerships with Nvidia (to optimize its software for Nvidia-powered servers) and Microsoft (to offer a version of Databricks's software through Azure) show promising movement in this direction, as well as suggest a growing ecosystem outside the walled garden of OpenAI and its proprietary models.
Databricks's positioning contra OpenAI positions Databricks favorably in a market where companies like SAP are exploring both closed-source and open-source models to avoid dependency on a single entity.
COVID forced legacy companies and startups to start selling digitally. They are racing to make their personalized consumer experience as good as Amazon or TikTok by doubling down on AI.
This means pulling in data from all sorts of customer touch points like front-end clickstreams from Segment, payments from Stripe, and conversion data from Google Ads, running them through ML models, and feeding back the data streams into your website and app.
With cheap cloud storage and fast networks, most companies are shifting from analyzing organization data in ERP and customer data in Salesforce to putting all data in a central data store. This helps them better understand what happened (business intelligence) and what will happen (predictive analytics). Data centralization is a tailwind for Databricks as it is built on data lake technology, where you can throw in all data without worrying about the type/source of data.
Taken together, these tailwinds point towards a future where all data processing happens on a single platform. However, currently, most companies use data warehouses for running real-time business intelligence operations and data lakes for ML/data science projects. Databricks is betting that as more things go digital, the explosion of data will make it impossible for companies to run two parallel large-scale data stores, which will converge into one.
The bull case for Databricks is that it will capture the lion’s share of this future, with its Data Lakehouse offering becoming the industry standard for data centralization as Salesforce became the industry standard for customer data.
Stickiness of data warehouse: Databricks is betting on a future where companies will stop using separate software for data warehouses and instead shift to Databricks for all data processing/storage requirements. However, data warehouses are highly sticky products, like ERPs, and not easy for a large enterprise to rip out. This can make its sales cycles very long and put a cap on its serviceable market.
This report is for information purposes only and is not to be used or considered as an offer or the solicitation of an offer to sell or to buy or subscribe for securities or other financial instruments. Nothing in this report constitutes investment, legal, accounting or tax advice or a representation that any investment or strategy is suitable or appropriate to your individual circumstances or otherwise constitutes a personal trade recommendation to you.
Information and opinions presented in the sections of the report were obtained or derived from sources Sacra believes are reliable, but Sacra makes no representation as to their accuracy or completeness. Past performance should not be taken as an indication or guarantee of future performance, and no representation or warranty, express or implied, is made regarding future performance. Information, opinions and estimates contained in this report reflect a determination at its original date of publication by Sacra and are subject to change without notice.
Sacra accepts no liability for loss arising from the use of the material presented in this report, except that this exclusion of liability does not apply to the extent that liability arises under specific statutes or regulations applicable to Sacra. Sacra may have issued, and may in the future issue, other reports that are inconsistent with, and reach different conclusions from, the information presented in this report. Those reports reflect different assumptions, views and analytical methods of the analysts who prepared them and Sacra is under no obligation to ensure that such other reports are brought to the attention of any recipient of this report.
All rights reserved. All material presented in this report, unless specifically indicated otherwise is under copyright to Sacra. Sacra reserves any and all intellectual property rights in the report. All trademarks, service marks and logos used in this report are trademarks or service marks or registered trademarks or service marks of Sacra. Any modification, copying, displaying, distributing, transmitting, publishing, licensing, creating derivative works from, or selling any report is strictly prohibited. None of the material, nor its content, nor any copy of it, may be altered in any way, transmitted to, copied or distributed to any other party, without the prior express written permission of Sacra. Any unauthorized duplication, redistribution or disclosure of this report will result in prosecution.