Leah Weiss, co-founder of Preql, on delivering clean data to LLMs

Jan-Erik Asplund
View PDF
None

Background

We've covered the modern data stack through interviews with dbt Labs CEO Tristan Handy, Julia Schottenstein at dbt Labs, and Sean Lynch at Census, tracking the rise and consolidation of the category from its 2021 peak through the ZIRP hangover.

To learn more about how AI is reshaping data infrastructure and what enterprises actually need to make AI work on their data, we reached out to Leah Weiss, co-founder & co-CEO of Preql ($7M raised, Bessemer).

Key points from our conversation via Sacra AI:

  • Modern data stack adoption—turbocharged by the 2020–2021 hype cycle and aggressive vendor marketing that pushed even smaller, non-technical companies to buy modular “best-of-breed” tooling—often failed to produce defensible ROI because organizations (1) lacked the internal expertise to correctly implement, maintain, and govern an increasingly specialized stack over time and (2) overestimated how directly monetizable their data would be while implicitly benchmarking themselves to Facebook/Google-style advantages. "The marketing dollars in that period went very far in convincing even small ecommerce shops and businesses without any technology function that data was the answer to their problems… we were convinced that we're all Facebook and Google, and most companies are not. You're never going to have that sort of internal skill set. You're never going to have data quite so easily monetizable in that way."
  • As enterprises have looked to LLMs and AI enterprise search apps like Glean & Hebbia to create answer-engine experiences & generate insights around company data, the problem of garbage in, garbage out has created the need for AI data engineer agents that clean data and model out the semantic layer of metric definitions at a fraction of the cost of human data engineers, flipping the ROI equation by driving down the cost of implementations, speeding them up from years to months and enabling a new, sticky data consumption surface for AI chat & copilot. "There are many tools at the application layer that are very powerful, but they assume you have already solved your data problems. They need clean data to be successful. We assume data chaos. We know your data isn't centralized, there is no single source of truth, and most of the cleanup happens in Excel. Building a trusted data layer from chaos is our core value proposition. We help you prep your data for AI on timelines that have been unheard of previously... We're talking months instead of multiple years."
  • As enterprises shift from dashboard-centric BI to interactive AI agents, the leap from “answers” to autonomous decisioning and action will require a step-change in deterministic data prep (cleanliness), semantic consistency (shared metric definitions), and governance—creating a foundational infrastructure wedge for Preql, Cube ($48M raised, Decibel), and Atlan ($206M raised, Insight Partners) to become the trusted data layer for the agentic enterprise. “[M]oving beyond data consumption, asking questions and getting answers, and more to autonomous decision making . . . is within reach, and we're actively thinking about how do we work with the ServiceNows of the world or the UiPaths to not just understand and trust what's going on with all the moving pieces in my business model, but to act on them, with people in the loop and with all of the proper controls, but in a more autonomous way. To get out of the cycle of trying to understand what's happened in the past and simply acting on it dynamically as things shift.”

Questions

  1. Can you start by walking me through the origin story of Preql, starting from your time at WeWork, through the consultancy, and into the product you have today?
  2. You lived through the whole modern data stack hype cycle that peaked around 2021. How would you describe that evolution and where we are today? Does it have to do with this gap between technical infrastructure growing out of scale with business needs?
  3. Has there been an evolution in what Preql does from when you started the company to today? Do you think of it as a different company or as an evolution on the same idea?
  4. The metrics layer seemed to be considered this pivotal category to win in the space for years. Why is it important, and why did all the attempts at it somewhat fail? Is it that AI makes it possible to succeed, or more urgent to buy a solution, or both?
  5. What's going on that makes you not get great answers when you ask a question of an LLM sitting on top of all your data? And what is Preql doing—how do your agents work to address that?
  6. Does Preql replace things like DBT in an organization or any ETL or ELT tools, or is it side by side with those kinds of tools?
  7. What are some of the most popular destinations for the data that people are cleaning in Preql? Is it into enterprise ChatGPT or are people building their own agents?
  8. What do you think about tools like Glean and Hebbia? I'm curious how you think about positioning Preql versus something like Glean, which feels like maybe a prepackaged form of some of this stuff.
  9. During the modern data stack era post-ZIRP, there was a challenge around proving ROI. With Preql now and what you're seeing with this new business, is it cost savings that you're selling, or is it more about revenue enablement for businesses that don't have costs to save on data?
  10. What are the 2026, 2027 ambitions that these kinds of companies are thinking about building? Are there clear paradigms or archetypes of things that they're trying to build that they come to Preql for help preparing to build?
  11. If everything goes right for Preql over the next five years, what do you think the company looks like, and how is the world different?

Interview

Can you start by walking me through the origin story of Preql, starting from your time at WeWork, through the consultancy, and into the product you have today?

I was an early employee at WeWork and arrived there before there was any data team or data infrastructure. I was a data team of one doing a lot of executive reporting. My background is liberal arts, so I wasn't there with a computer science and stats background. I was trying to learn and create value as best I could, and ended up building the data infrastructure and data team from first principles in those early days.

I went through the evolution of the company growth and maturity, but also the maturity of what we now call the modern data stack. It was early days of Fivetran, early days of Looker, and we eventually did a Redshift to Snowflake migration. All of these technologies were being implemented for the first time, and there was a lot of excitement around them. I built many versions of the data infrastructure and data team at WeWork.

I always really liked working with the business stakeholders, and there was a really interesting mix of them at a company like WeWork. You had architects and designers and geospatial technology—just a really interesting mix of skill sets. I tried to position myself as the data person who could actually be approachable and help solve business problems. I did a lot of work with my cofounder, Gabi, there to democratize data and apply the data we had to different problems in the business. That's always been my ethos.

When WeWork went the way that it did, Gabi and I had jobs that we loved. We were going around the business, teaching people how to think about data, solving real business problems in domains that were exciting, and we couldn't go back to normal jobs. We had to think about what would be engaging for the next part of our careers.

We ended up building a consultancy in a very organic way, just talking to folks in our network about the types of challenges they were seeing. We built a company called Data Culture, mostly focused on building data infrastructure and upleveling data teams, but always through the same lens that I'll probably harp on a lot, which is bridging the gap between business needs and technical infrastructure. There's a lot of friction and things get lost in translation where you have data people building infrastructure and business people who need answers to questions. These worlds don't often communicate well.

Preql is a product approach to this same problem. What if we abstracted away everything that’s challenging related to data preparation and encoding business logic and gave business users a seamless way of managing their data directly?

You lived through the whole modern data stack hype cycle that peaked around 2021. How would you describe that evolution and where we are today? Does it have to do with this gap between technical infrastructure growing out of scale with business needs?

When you typically hear analysis on this, you think about it in terms of the modularization of these technologies and then the necessary consolidation. That's definitely an aspect of it, but from my perspective as someone who was using these technologies and in the field on the consulting side, implementing them and selling them to our customers, it was not as much about building the stack or having these technologies across a few different vendors. It was the team and the expertise that you needed to implement them and maintain them well and eventually get to that business ROI.

The marketing dollars in that period went very far in convincing even small ecommerce shops and businesses without any technology function that data was the answer to their problems. To some degree, that potential exists, but Benn Stancil, the Mode founder who's also a friend and investor, always says we were convinced that we're all Facebook and Google, and most companies are not. You're never going to have that sort of internal skill set. You're never going to have data quite so easily monetizable in that way. You have to think about the strategy a little bit differently.

Some of what we're seeing is the hangover from that period where companies who had very little internal data resources were still spending hundreds of thousands of dollars on data infrastructure and then spending more to try to get use out of them. Even when you bring in expert consultants or hire really good data talent, you're not necessarily closing that gap, because the skill sets of excellent data practitioners and business folks who need to drive decision making are fundamentally incompatible.

Even within all of those companies, you saw the creation of new, even more specialized skill sets that then had to come to market. People had to get trained. You spend a lot of time convincing people that they need technology and specialized resources. At some point, the CFO is going to come to you and try to understand the ROI, and data people—and I put myself in this bucket—didn't have a good enough answer when that moment came.

Has there been an evolution in what Preql does from when you started the company to today? Do you think of it as a different company or as an evolution on the same idea?

Fundamentally, we're solving the same problem that I seem to be fixated on, which is bridging the gap between your data capabilities and your business needs. That is fundamental. But a few things have changed since we originally came to market.

In that world of the modern data stack, there was a switch from platform approaches to custom code that could be version controlled—that's the DBT revolution that was quite powerful in our field. Things went in that direction because this code needed to be human readable, and large teams needed to be able to interact with it. We're in a different era now where we're not optimizing for human readability. We're optimizing for AI to be able to make sense of things and deliver answers in a reliable way. That's one major shift in data transformation and data preparation.

On the data consumption side, dashboards have been the default consumption layer in our field for as long as I've been working on it, which is a surprisingly long amount of time. There's potential here to have a consumption layer that's a little bit more in line with how people think. AI as an interface or natural language as an interface reflects more of your internal thought process where you have a question, maybe you want to refine that question, maybe you want to find backup data to feel good about where that analysis is taking you. It's more of this refinement and Q&A approach. Whereas dashboards, all of that is implicit in how it was constructed, and it may or may not match your thought process. You end up with these really bloated BI implementations that have to answer every permutation of the question.

The last new ingredient is just a sense of urgency. A lot of the CEOs and CFOs that we talk to every day have these mandates to adopt AI and show ROI of those initiatives very quickly, but they're not necessarily well positioned to do so because their data is a mess or they don't have a consistent source of truth to point these LLMs at.

All of these forces came together and changed our approach to the same fundamental problem, and that's to move to a more agentic workflow both in the semantic modeling of your data—building that context layer, building your definitions, building all of the bespoke business logic that is unique to your business—and on the data quality side. How do we take some of these manual offline workflows that are happening in spreadsheets and around your organization and bring them into a map of your data assets that can be leveraged for LLMs?

The metrics layer seemed to be considered this pivotal category to win in the space for years. Why is it important, and why did all the attempts at it somewhat fail? Is it that AI makes it possible to succeed, or more urgent to buy a solution, or both?

I would say AI makes it newly relevant. First, maybe I'll give my perspective on what happened to the semantic layer when it first was being popularized. A lot of these tools were developer focused. You're asking analytics engineers, data engineers, data people in some form or another to now write and maintain more code that reflects business logic. This is a tricky proposition for most data teams. They're already managing hundreds of pipelines, hundreds of DBT models. Now you're saying you also have to write more code to take ownership of the business definition, the business context, which you're not well positioned to do given the mechanics of your organization. It's a lot more work and effort where the return isn't obvious.

The reason that the return wasn't obvious at that point is that the ecosystem of data tooling wasn't really set up to integrate with those metrics layers very well. Even in tools that have their own internal semantic layers, they're still really operating in the worlds of columns and rows. These metrics layers are trying to introduce this idea of a metric or a definition as the atomic unit of data, which if you've worked in data, you know is true. It is the experience of a practitioner that nothing is more important than that metric definition and making it repeatable and reusable. But it's not generally how people are used to interacting with data. You had BI tools that were stuck in the old paradigm, which is how business users typically interface with data, and a bunch of work being done by technical people that wasn't necessarily making its way to them.

The moment has changed pretty considerably. I'm always kind of surprised—we stopped talking about semantic layers for a long time, and now finance people and business people are using the term before we do. There's this resurgence of interest because people are aware that when you put an LLM on top of your data as it exists today and you ask it a question, you're not going to get good results. So there's an understanding that some sort of intermediary is required.

There's a growing understanding of why that is. This new technology is nondeterministic by nature, and you can ask it the same question 10 times and get different results. When you talk about critical business reporting, things need to be transparent. They need to be repeatable. There's something that is fundamentally incompatible about using a nondeterministic technology solution to solve deterministic problems. Yet there's a lot of value on the table, and these CFOs and CEOs are under a ton of pressure to implement it for these use cases.

What you need—and I'm not saying that every CFO thinks about it from this perspective, but they understand this intuitively—is you need some technology to bridge the gap between the LLM and your business-critical workflows to make them repeatable, to make them trusted, to make them governed and secure. You need that even for the most basic use cases. Today when people think about what can an AI do for my data, we're limited in our imagination to ask a question, get an answer. But even for that basic use case, if you don't have this intermediary data layer, you will lose trust incredibly quickly. People don't want to be left behind, and they need to show success quickly.

What's going on that makes you not get great answers when you ask a question of an LLM sitting on top of all your data? And what is Preql doing—how do your agents work to address that?

The fundamental problem, if you wanted to even go down this path of let's just put an enterprise license of OpenAI on top of my data and see what happens, is where would you point it? Data is really siloed and distributed, particularly in these enterprises. Often, there is data infrastructure, there are cloud warehouses, but they might not be housing the critical information, particularly for these business use cases and finance use cases in particular. There's no one place to point them to. You would have to do a huge consolidation effort to even get started.

When you start to get into it, you can ask a simple question, even how's revenue trending? Anyone who's worked with the data team knows that there are more ways than you can imagine to calculate simple metrics, and most organizations have these definitions coexisting depending on what tool you're getting the information from, what team is asking the question, whether it's in this dashboard or in this Excel file. Even validating the work of this is tricky because it's so context dependent.

You might say don't even point it at my data because my data is garbage, and that's something that we hear a lot. If I point this technology at my existing data, it will amplify everything that I already know is wrong, and I'm terrified of that prospect.

We think about solving this challenge in a few different ways. We've got two classes of agents. One is simpler, so I'll start with them. You can think about this as just data cleaning, data quality. This is like a proactive, really thoughtful data engineer who is looking at anything you bring into the system and saying, okay, does this line up with data that we already have? Are you introducing contradictory information that we'll have to reconcile? Are there inconsistencies in format? All of these really small things. Any data person will tell you that they spend too much of their time looking for leading spaces or trailing spaces or inconsistent formatting of IDs, these small things that can really increase the manual work required to do even simple work. They'll be proactive about surfacing those things and building pipelines as necessary to scrub that data and bring it into whatever your data storage solution is.

Once you feel better about the quality, then we have what we call semantic modeling agents. You can think of these as agents who are, again, proactive employees who are going around your company to try to figure out what do things mean, how are they defined, where do I need more context, what do I do if these two things don't quite line up between two teams. This class of agent is building that semantic model or essentially a map of all of your data assets and definitions and business context. Typically, we'll have an owner of a business area who can be pinged by the agent to add additional context or maybe clarify something if there's a conflict and obviously to approve any changes if they need to be made.

The combination of these two things is pretty powerful because then you have underlying data that you can trust that's clean, and we feel good about the quality. And then you have this holistic model of the business with all the underlying data context. When you go to ask a question, either via an interface that we provide or in your BI tool or via any sort of consumption layer that you can think of, you are going through this Preql intermediary layer that can guarantee the quality and the definition and the governance of that metric. It's a way of taking this AI technology that isn't used to producing the same results over and over again and bringing it into your system in a deterministic way.

Does Preql replace things like DBT in an organization or any ETL or ELT tools, or is it side by side with those kinds of tools?

Typically, we can accelerate any existing implementation. A lot of the organizations that we work with are on the larger side. There's typically some sort of central data solution and some team that manages that. As we all know, a data warehouse initiative is always in flight. It's never quite done. You probably have some tables that are ready, some that are coming, some marts that are published, some that aren't quite there yet, and that's the steady state. On the business side, you've got hundreds of different processes that access some data from the warehouse, some data from the systems, and there's all of these siloed systems and a proliferation of manual processes.

When we work with the business, what we can do is take these offline processes and move them into a holistic semantic model of the business. We can do that in whatever set of tools you use. If we need to build these pipelines in DBT, if you have something else, we are essentially, again, that employee who works on your team who can integrate with whatever toolset you have.

What are some of the most popular destinations for the data that people are cleaning in Preql? Is it into enterprise ChatGPT or are people building their own agents?

The vanilla use case for consumption is an integration with Teams or any sort of technology that they're using to communicate internally that we've seen be really effective. We don't want to necessarily introduce new data consumption interfaces if we don't have to. Teams and BI tool integrations still come up the most. But we also have our own interface where you can interrogate your data through a chat experience.

More and more, we're seeing folks who have on the road map, oh, we want to build an agent studio or we want to build our own internal AI consumption method, and that's easy enough for us to integrate with. We don't necessarily want to introduce new tools. We want to be the glue that makes all of your tools work better.

What do you think about tools like Glean and Hebbia? I'm curious how you think about positioning Preql versus something like Glean, which feels like maybe a prepackaged form of some of this stuff.

I think the key distinction here is there are many tools at the application layer that are very powerful, but they assume you have already solved your data problems. They need clean data to be successful. We assume data chaos. We know your data isn’t centralized, there is no single source of truth, and most of the cleanup happens in Excel. Building a trusted data layer from chaos is our core value proposition. We help you prep your data for AI on timelines that have been unheard of previously, even when organizations were hiring hundreds of data engineers. We’re talking months instead of multiple years.

The other piece goes to something you flagged earlier, which is the distinction between a finance buyer and someone who's representing the broader AI strategy. Our bias here, just given the background of our team, we come from the data world. We are building for finance audiences because we think they need a seat at the table and we don't want them left behind in the way that they were in the cloud era of data. They're the right people to own data governance, critical reporting workflows, all of those things.

That said, there's nothing about our technology—other than the fine tuning of our models—that is specific to a finance audience. We think this works really nicely with a broader perspective on data strategy. For us, it's really about getting the right people in the room to be a part of that strategy and to shape it to make it effective. We're a little bit less focused on what are the top two copilot experiences that finance people want to see. We're more about what is the right data strategy that will unlock every AI initiative that is on your road map today.

During the modern data stack era post-ZIRP, there was a challenge around proving ROI. With Preql now and what you're seeing with this new business, is it cost savings that you're selling, or is it more about revenue enablement for businesses that don't have costs to save on data?

The conversation tends to start with cost saving because the folks that we're speaking to who are suffering from whatever manual processes and overburdened with them can only really think about freeing up their own time. Efficiency is usually a part of the conversation initially. But very quickly when you zoom out, the question and the value becomes much more clear when you talk about, okay, what's on your AI road map, and how do you think about the value there? What if we could accelerate that from five to ten years after you get your warehouse ready versus starting tomorrow?

We're very focused on efficiency and cost saving and time saving from the perspective of alleviating pain. But when you're making the business case, I prefer the value-based approach. I think there's more meat there.

What are the 2026, 2027 ambitions that these kinds of companies are thinking about building? Are there clear paradigms or archetypes of things that they're trying to build that they come to Preql for help preparing to build?

The first use case that people tend to imagine is what if I could get answers on demand, but we try to push them to go beyond that because that's just the only form factor that some people have seen of these technologies. The other thing that comes up quite a bit is, we have all this data. We don't leverage it very well internally, but we know there's a ton of value there that we could productize.

Or at least if we could wrangle our data effectively and understand how demand is shifting, we could make quicker decisions when it comes to new products, offerings, that kind of thing. It's usually on the revenue side. And how do you find ROI in your data? An obvious way to do that is to think about what do you have, what's your unique perspective, and how could we package it if it was consistent and well understood.

If everything goes right for Preql over the next five years, what do you think the company looks like, and how is the world different?

We'd love to get to the other side of the "is my data ready" question and then measure ourselves by what can we do with our data once it's prepared. I would like to see a Preql implementation as synonymous with a company that has the most ambitious AI road map in their field because of what's been enabled and, you know, attracting the best talent for that reason. People are sold on the potential and stuck in yesterday's problems, which makes it harder and harder to imagine what they want to achieve. I would love to help our customers get to the other side of that.

The other part of that is moving beyond data consumption, asking questions and getting answers, and more to autonomous decision making. That is within reach, and we're actively thinking about how do we work with the ServiceNows of the world or the UiPaths to not just understand and trust what's going on with all the moving pieces in my business model, but to act on them, with people in the loop and with all of the proper controls, but in a more autonomous way. To get out of the cycle of trying to understand what's happened in the past and simply acting on it dynamically as things shift.

Disclaimers

This transcript is for information purposes only and does not constitute advice of any type or trade recommendation and should not form the basis of any investment decision. Sacra accepts no liability for the transcript or for any errors, omissions or inaccuracies in respect of it. The views of the experts expressed in the transcript are those of the experts and they are not endorsed by, nor do they represent the opinion of Sacra. Sacra reserves all copyright, intellectual property rights in the transcript. Any modification, copying, displaying, distributing, transmitting, publishing, licensing, creating derivative works from, or selling any transcript is strictly prohibited.

Read more from