Conor McCarter is co-founder at Prequel. We talked to Conor to learn more about the competing data integration business models of Fivetran and Airbyte, better understand the tailwinds of the "modern data stack", and assess the stability of the Snowflake-dbt-Fivetran triad.
- Can you give us your brief history on the data integration space and the rise of Fivetran?
- Can you talk about a few concrete use cases for data integration?
- To what degree is the lack of a real-time pipeline a limitation for Fivetran? Does the lack of a “real-time data warehouse” pose a headwind here for some applications long-term?
- Can you speak to the technical challenge aspect of building the kinds of connectors between apps and data warehouses? What are the big challenges that make this difficult?
- Can you talk about the Fivetran model for data integration based on proprietary connectors? How does Fivetran maintain the connectors? What are the merits and challenges of this approach?
- Can you talk about the Airbyte model for data integration based on open source? Who are the maintainers and what incentivizes them to maintain the connectors? What are the merits and challenges of this approach?
- Is the problem of data integration heavily concentrated ted in the top ~5 or so SaaS apps, i.e. Salesforce? Or, is this a problem related to the long tail of SaaS apps that every company uses?
- What do you see as some of the key tailwinds propelling data integration and these companies like Fivetran?
- What makes it a better experience to do this as a built-in feature versus third party? And also what does that unlock for the SaaS companies to be able to do that themselves?
- What do you make of the thesis that top SaaS apps might bypass the data warehouse entirely by instead building their own analytics suites so business users can ask + answer questions without having to move data around?
- When people think of the ‘modern data stack’ they often think of Fivetran, a cloud data warehouse like Snowflake, and dbt for transformation. Can you talk about how that came to be and how stable you see that configuration going forward?
- What are the merits and limitations of centralizing your data integration processes into one tool like Fivetran or Airbyte instead of having them distributed across all of the SaaS vendors that you use?
- Do you see a world where Snowflake/Redshift/etc. build their own version of this to facilitate more tight vertical integration?
- In five years, if everything goes right for Prequel, what does it become? How will the world change as a result?
- Anything else interesting you think is important to talk about?
- I'd also like to get your take on this idea or thesis that every B2B SaaS app is going to be built on the data warehouse directly.
Can you give us your brief history on the data integration space and the rise of Fivetran?
The data integration space as a whole kind of came about because of two concurrent trends: one is SaaS fragmentation and the other is cloud data warehousing.
If you look at the rise of all these SaaS apps over the last several years, what we’ve got now is a lot of different applications solving a lot of specific use cases across the business. That was really great for solving those individual problems, but it’s made it more difficult to gather any cohesive data about everything—instead, your data is kind of spread out across all these different applications.
On the other hand, you’ve had cloud warehouses taking off, really helping companies process it all—but before Fivetran, the way that data got into the cloud data warehouse was via data engineers who would scrape the APIs that SaaS apps expose in an attempt to centralize that data into that warehouse.
What Fivetran realized was, "Hey, if every company is hiring data engineers to write the exact same code, it would make a lot more sense for us to write it once really well and sell it to every company."
Fivetran beat Airbyte to that by a number of years, and Stitch has been around for a while too. But a few years ago Fivetran really took the lead, and they’ve been the leader in the space for a while now.
Can you talk about a few concrete use cases for data integration?
Yeah, so there's a few common use cases that I like to throw out there.
For one, let's say you want to report on all the transactions that have been processed by your payment provider, for example, to report on your monthly revenue. Extracting payment data is a common use case that you'll see if you look at the Fivetran connectors or Airbyte connectors or even the SaaS companies that are offering first-party integrations with warehouses.
In the marketing category, let's say you want to analyze your ads that are performing best so you can make better marketing decisions—or maybe you’re doing analytics for your customer support team and you want to query the complaints and suggestions that are flowing through your help desk.
There are a number of popular verticals where getting data into the data warehouse is a really common problem, but those three are few of the marquee use cases that come to mind.
Most existing ETL infrastructure—Fivetran, Stitch, Airbyte—is actually all built on top of their sources’ APIs, so the actual data coming out is the same a lot of the time.
I think the use case is really the more important distinction here: whether you're trying to sync that data for analytics or whether it's more of a transactional use case.
If you're a customer of a Stripe or a Modern Treasury, you might use the API to initiate the transfer of money and you might use the GET endpoint on the API to check the status of that particular transfer.
But if you're reporting on the last month of transactions, you'd probably prefer to have a replica of that entire data set in your own environment for analysis. In both those cases, the actual data that is behind the scenes is exactly the same.
Theoretically, you can find the same data via either method, but the way that it's reported and the model of which it's delivered, I think that's where the big distinction is.
To summarize, it's really about the use case, whether it's a transactional use case or an analytics use case.
To what degree is the lack of a real-time pipeline a limitation for Fivetran? Does the lack of a “real-time data warehouse” pose a headwind here for some applications long-term?
This goes back to what we were saying earlier about the difference between the analytics versus the transactional use cases.
For analytics use cases, real-time doesn't seem to be a huge issue for data teams. If the data's going to end up in a report or a dashboard or a query in some sort of ad hoc analysis, a 15-minute delay is totally fine.
That said, when you get into some transactional use cases that straddle the line between what is transactional and what is an analytics query, I think there are specific use cases that are difficult without the real-time focused data warehouse that you were mentioning.
The Drizly team put out a really interesting article about a use case that they had that could not be met by their existing data warehouse for cart abandonment notifications.
For that, they actually had to find a specific, real-time focused data warehouse for that use case in particular.
There is a smattering of those use cases where you do need something that's a little bit more focused on the real-time problem, but what we're hearing from the market is that for almost all use cases, a few minute delay is totally fine.
Can you speak to the technical challenge aspect of building the kinds of connectors between apps and data warehouses? What are the big challenges that make this difficult?
Of course. If you think about these connectors, I think there's a few different categories of challenges that make this really difficult.
The first one is performance, making sure that the connectors that you have built can scale really well into very high volumes and fast frequencies.
The next one is obviously reliability. As the different systems are changing—whether it's the API schemas or the protocols to access them—these ETL providers have to think a lot and spend a lot of time making sure there won’t be any downtime.
The final one is edge cases.
Every API connector and every data warehouse connector is going to have these weird nuances that you see in 1/10 or 1/1,000 connection attempts, and constantly playing whac-a-mole with those challenges is something that Fivetran does a lot. I'm sure all the ETL providers do, and I know at Prequel, we spend a lot of time dealing with this when it comes to data warehouse connectors.
If I drill down a little bit on the API-specific approach—when Fivetran connects to an API exposed by a SaaS tool, they're heavily reliant on that API without a formal SLA between them and the vendor.
That creates an entirely new level of complexity because that API is not built for Fivetran to extract data into a data warehouse. That API was built over a number of iterations and dev cycles for their customers to do whatever their customers need to do, and rarely is it built for replication at the scale Fivetran is demanding from it.
When Fivetran decides to go ahead and start building this connector, they're at the mercy of whatever that API looks like to try to figure out how to coerce that data out of the API into something that their customers want in the data warehouse. That's really brittle. It takes a lot of time, and there's a ton of challenges there.
That’s one of the reasons that, at Prequel, we've decided to focus on providing SaaS vendors the tools to connect to their customers' data warehouses in a more native way. We can get to that 99th percentile of performance a lot quicker than if we're constantly battling APIs on new sources and playing that whac-a-mole game again.
Can you talk about the Fivetran model for data integration based on proprietary connectors? How does Fivetran maintain the connectors? What are the merits and challenges of this approach?
What Fivetran's doing is they're listening to their customers, and they're hearing what connectors need to be improved, what connectors need to be built for the first time, and that's how they prioritize their catalog of connectors.
When a critical mass of customers expresses interest in a certain connector, probably with a weight towards the bigger customers, they decide to build that connector. They release it to a select number for a private preview, and ultimately, they'll move that connector to GA for everyone to enroll and start using.
When the connector graduates into GA, I think what they're basically doing is monitoring that connector for breakage around the clock and if something breaks, rushing to get a fix out as soon as possible.
At the same time, they're probably trying to cultivate a relationship with that vendor, hoping that they can get some higher API rate limits or at least a heads up when the API's about to change or when some new feature is being added so they can adapt to those changes quickly.
The merits of this approach are proven by their success: they have a limited amount of connectors that companies can really trust and rely on. The marketing or data teams that use them put a lot of trust in those connectors, and Fivetran is able to command a premium for them because their customers feel like they’re the most reliable and accurate connectors that exist on the market.
The challenge is that it’s really expensive to maintain those connectors to the level of quality that Fivetran does, and so they can charge more, but I'm sure it also costs them a lot more to keep building and maintaining this long list of connectors around the clock all hours of the day.
Can you talk about the Airbyte model for data integration based on open source? Who are the maintainers and what incentivizes them to maintain the connectors? What are the merits and challenges of this approach?
Airbyte is taking the open source approach to building the pretty much exact same product as Fivetran down to some of the terminology they used to describe the product.
I think what they're trying to do is make the bet that there are just too many SaaS companies from which their customers want data from for Fivetran to get to all of them, so they're marketing themselves as the ETL provider for the long tail of connectors.
If I think about the merits of this model, the main one is that Airbyte can support more connectors than the catalog of 150 to 200 that Fivetran has out there.
The challenges are obviously kind of inherent: whereas Fivetran's taking a lot of time to develop their connectors and make sure there's a lot of integrity there, Airbyte's value proposition is the number of connectors—not necessarily the quality of those connectors.
It’s cool, and it’s early, but to date, they just haven't seen the same level of quality out of each of those connectors that Fivetran provides. The challenge everybody's going to have is just maintaining those connectors to the same level of quality using community contributions, to the extent where you can actually rely on them for that long tail.
Based on the conversations I've had with data teams, if a company uses Airbyte and they're noticing that they can't get data out of one of their SaaS vendors, they might have to write the connector themselves. If they build that connector and it works internally, there's a small chance they'll try to open a pull request to publish it to the open source repo.
I'm not sure, however, that they have the incentive to keep that connector updated with the latest improvements to the API if it's not top of mind. Whereas with a company like Fivetran, I do believe that they are constantly thinking, "How do we make these connectors reliable for the customers that are using them, but continue improving them for our new customers who are adopting them?"
Is the problem of data integration heavily concentrated ted in the top ~5 or so SaaS apps, i.e. Salesforce? Or, is this a problem related to the long tail of SaaS apps that every company uses?
I do believe that the long tail is very important for data integration. Fivetran focuses on their 150-200 connectors, but if you look at their incentives, they obviously want to focus on the products with the biggest customers (with the biggest bank accounts) and the largest customer base.
That said, if you go talk to the Seed or Series A or Series B companies, it becomes pretty clear that their customers do want data warehouse integrations—whether they have customers that just set up their warehouse for the first time or whether they're trying to sell into those enterprise markets and trying to compete with the Salesforces or the other companies with the connectors already built in.
Data warehouse integration is an extremely important problem and increasingly becoming something that companies are concerned about, and I think we'll continue to see companies investing in integration as a capability, whether it's via some sort of outsourced connector via Airbyte or Fivetran or by building the connector themselves.
What do you see as some of the key tailwinds propelling data integration and these companies like Fivetran?
There’s about three or four different things all happening in parallel.
Obviously, there are new SaaS applications everyday, and more data that exists in these little pockets of apps all over the business.
Then there’s the Snowflakes, BigQuerys, Redshifts, and Databricks of the world, which are doing a really good job of getting a foothold earlier and earlier in companies.
Thirdly, I think companies are increasingly actually getting value out of their data warehouses. Once they get that bug, hire a data person, and the data team starts actually delivering value, they get hungry for other potential sources of data that they can get value from.
Finally, I think as companies start offering this—whether it's via a third party like Fivetran or as a built-in feature, a la companies like Segment or Heap—there is a sense of, "Hey, this is a really delightful experience, and we’re able to do really great things with that data feed that these companies are giving us."
Other companies either see that and try to build that to compete, or data teams start demanding it because they saw it in one of their products.
What makes it a better experience to do this as a built-in feature versus third party? And also what does that unlock for the SaaS companies to be able to do that themselves?
SaaS vendors know their customers really well. They're building a product to deliver value to their customers, and they are the ones who know exactly what data their customers want and how they want to analyze it, how they want to observe it, and how they want to query it.
If you think about the fact that that data already exists on the SaaS vendor servers, the next step is just to help them deliver that to their customers so that their customers can make better use of the software they have purchased.
If you assume that they can get it to their customer reliably with integrity, just like best-in-class ETL tools can do, it makes so much more sense for the vendor to offer that as a service versus to pass that responsibility off to some third party who maybe doesn't know the tool, doesn't know the roadmap, doesn't know the customer, and how the customer uses that SaaS product.
On the incentives side, data warehouse connectors are a massively profitable SKU for the companies that choose to offer them. This data exists already on the servers. It's not like they have to invest a ton of time into figuring out how to generate this data. It's there. All they need to do is figure out a way to deliver that to the customer.
It's a really easy bump to both activation and retention. We think it's a great low-hanging fruit product to offer to customers of these SaaS tools.
What do you make of the thesis that top SaaS apps might bypass the data warehouse entirely by instead building their own analytics suites so business users can ask + answer questions without having to move data around?
To me, the existence of a company like Fivetran implies that it isn't enough just to put these dashboards and analytics in the tool.
I would bet that if you go look at all the websites of the connectors that Fivetran has built, maybe 90%, 95% would have a reporting analytics feature somewhere in their product.
I think the reason for that is there are just use cases where the analytics suites and the in-app dashboards just won't work no matter how good the embedded tool is: for example, any use case where that data needs to be joined across datasets from other tools or anytime the query or the analysis is kind of specific to the company and maybe doesn't generalize well across the entire customer base.
Finally, these data teams have their own tools and their own specific ways of processing data, and a lot of them are just much more comfortable querying and analyzing that data and their own tools versus logging into an in-app dashboard or some sort of embedded BI to do that analysis.
Pretty often, we'll hear stories of companies that expose that in-app analytics feature but still get bombarded with requests for a deeper, more complex analysis. Or they’ll say the in-app tool just isn't working and ask for some way to get the raw data, whether that means dumping it in S3, loading it into the data warehouse for them, or trying to get some third party to build a connector for them.
When people think of the ‘modern data stack’ they often think of Fivetran, a cloud data warehouse like Snowflake, and dbt for transformation. Can you talk about how that came to be and how stable you see that configuration going forward?
The cornerstone of the stack is that data warehouse, and I really don't think the data warehouse is going anywhere anytime fast, especially if you look at specifically Snowflake’s performance as a public company, and also their reputation as a warehouse that people really like to use.
With dbt, I think the paradigm they've introduced of building a framework based on SQL and soon Python that processes data in place in the warehouse just makes a ton of sense given the architecture of what a data warehouse does and what it enables. I think those two are pretty solidly in place.
I don't see Fivetran going away anytime soon, but I could imagine them getting nervous whenever a SaaS company behind one of their more popular connectors decides to offer a native data warehouse connection and impair that revenue stream.
Fivetran's going to be around, but I do think they're going to begin facing a little bit of pressure on the connector front from these SaaS vendors—who are going to start saying, “It would actually be best for us just to offer this integration directly with the warehouse, to start facilitating those native connectors with Snowflake and other warehouses, and steal the revenue stream while we're at it.”
What are the merits and limitations of centralizing your data integration processes into one tool like Fivetran or Airbyte instead of having them distributed across all of the SaaS vendors that you use?
Yeah, I think there are a couple ways to approach this.
For one, if you survey a hundred companies, my guess the number that are using just Fivetran or just Airbyte to ingest data from all these different sources will be about 5 percent.
Whether it's multiple ETL tools, whether it's native cloud provided connectors or something like Segment, there are really a lot of places from which companies are already ingesting data.
I do think there's some value in the early stages when you first spin up your warehouse to have that single pane where you can monitor all your upstream connectors, but I think the reality is that companies are going to have a number of these different places where they'll monitor and have to keep tabs on all of that syncing.
My perspective here is that, at some point, it's an entire tool’s job to do all this monitoring—not the job of the ETL provider. That’s because it’s probably not just the sources you're monitoring—it's also all the different transformations that are happening, sources being one part of that big pipeline. And so I actually think where that data comes from is not as important as just having something in place if this really matters to you to monitor this.
It sure would be nice to be able to just log in and see one page where it says, "Hey, everything's operating as usual” or “Hey, you got to do one thing here," but I think the reality for most teams is that they’re more consumed by annoying stuff like fixing manual API scraping jobs.
The odds that one ETL tool can or will ever provide that single view of all the connectors is very low.
Do you see a world where Snowflake/Redshift/etc. build their own version of this to facilitate more tight vertical integration?
All of the major data warehouses, including Snowflake, have built some version of a data-sharing product where they make it relatively easy to share data to other companies that are using the same data warehouse and are in the same region. And there’s a little bit more work to share data on the same warehouse logo within a different region.
What I can't imagine is any of these warehouses building integrations to their competitors, especially when this market is so competitive in the near term.
I think they see sharing to the same type of warehouse as a great marketing scheme and great customer acquisition scheme, and I don't see them opening this up to other warehouses for a while.
Strategically, at some point, it'll make sense for someone who's not the leader in the space to do it and make that a key differentiator of their warehouse.
Right now, though, I think they're all focused on their core product and the basic sharing features.
In five years, if everything goes right for Prequel, what does it become? How will the world change as a result?
Within five years, I can't imagine a world where every major B2B SaaS app does not offer native integrations to all the major data warehouses. If everything goes right, they'll all be using Prequel to offer those integrations.
I think what the data stack looks like at that point is that you have your major data warehouses, you have your tooling on top of that warehouse to process the data, and then data warehouse integrations are handled by all your SaaS vendors.
Unlocking those data warehouse integrations will probably move down from the enterprise tier to the growth tier, and eventually become table stakes to closing any major contract in B2B SaaS—I think ultimately, it begins to be lumped in with other table stakes features like SSO, and it just becomes a feature that every SaaS app needs to have to really compete and to close the deals that matter to them the most.
Anything else interesting you think is important to talk about?
One important point here is that within data replication and integration, SaaS integration is a subset, whereas data replication from the transactional databases is a huge market as well. A big portion of Fivetran’s business actually comes from just replicating data internally without even thinking about external SaaS vendors.
I'd also like to get your take on this idea or thesis that every B2B SaaS app is going to be built on the data warehouse directly.
Personally, I struggle to reason about how a modern transactional application could be built on top of someone else's database, where you don't have the ultimate authority over what that database does and how it performs. I think that would be really, really difficult problem to solve.
I think it's an interesting thing to keep an eye out for in the space—I haven't seen anything that has been successful using that architecture, but it’s definitely interesting.
This transcript is for information purposes only and does not constitute advice of any type or trade recommendation and should not form the basis of any investment decision. Sacra accepts no liability for the transcript or for any errors, omissions or inaccuracies in respect of it. The views of the experts expressed in the transcript are those of the experts and they are not endorsed by, nor do they represent the opinion of Sacra. Sacra reserves all copyright, intellectual property rights in the transcript. Any modification, copying, displaying, distributing, transmitting, publishing, licensing, creating derivative works from, or selling any transcript is strictly prohibited.