David Mlcoch, co-founder & CEO of Asteroid, on browser automation and the last mile problem of AI

Jan-Erik Asplund
View PDF
None

Background

We've covered the rise of vertical AI from Harvey (legal) to Hebbia (finance) to Raycaster (life sciences). As browser automation becomes critical infrastructure that enables vertical AI to actually complete workflows end-to-end, we reached out to David Mlcoch, co-founder & CEO of browser automation platform Asteroid (YC W25), to learn more.

Key points from our conversation via Sacra AI:

  • While AI agents can search the web, reason & generate data, they bottleneck on last mile data entry for vertical AI in traditional industries running on 90’s-00’s legacy software without APIs, unable to update key systems of record without a human in the loop. "The voice agent is taking my call and says, ‘Okay, I want to have an appointment tomorrow, 4pm. My name is David.’ The receptionist would usually be on the call and then they would just type it in because their laptop's in front of them. . . The system that is in front of this [voice agent] is a legacy system that has no API, never needed to have an API, and the developer who probably wrote the system is not even alive now because it's so outdated.”
  • Browser automation enables AI agents to programmatically and robustly interact with legacy systems by using websites like a human to use apps and enter data, with MCP enabling AI voice agents & chatbots to chain browser automation together into end-to-end workflows that can replace humans. "There are some UIs—if you would see them, you would not want the human to touch them. They're so legacy and so outdated, but people can't really update them. . . MCP is just a layer on top of API to make it more readable for LLMs. The prerequisite is that the API exists. . . Instead of the API, we have the browser agent, and then there's an MCP on top of the browser agent. So you have this website accessible through an MCP for LLMs.”
  • First as developer infrastructure for browser use from companies like Browserbase ($67.5M raised, CRV & Kleiner Perkins), browser automation has expanded to non-technical operations teams via products like Asteroid (YC W25), with the opportunity to eat $2B+ of yearly spend on incumbent automation companies like UiPath (NYSE: PATH, $9B market cap) in enterprise. “It's a very early and growing ecosystem. As with every new tool, the early adopters are developers. So that's where most of the tools are. . . [If you’re non-technical] you can go and pay someone to use UiPath if you have $50,000 in your pocket. Where Asteroid is trying to be different is that even non-technical teams can go and automate and supervise complex browser tasks. . . The domain expert who knows how to do this healthcare process or insurance or supply chain process doesn't need to go and explain to the developer.”

Questions

  1. Can you tell us what Asteroid is and what inspired you to start the company?
  2. Can you give us a brief history of browser automation—from Selenium to Playwright to now AI-driven approaches? Has browser automation become more capable, or just easier to set up, or is it both?
  3. You mentioned voice agents as an important phenomenon. How are voice agents related to browser automation? Why does it make sense to have browser agents as part of voice AI?
  4. Paint us a picture of the future. What does enterprise software actually look like—are we still building web UIs for humans, or has everything shifted to agent-first interfaces?
  5. MCP promises to give agents structured access to tools and data. Does widespread MCP adoption make browser automation less necessary, or does it always remain as the fallback?
  6. What are some of the major use cases so far that you have come across?
  7. Are you building for those vertical-specific workflows at all, or are you building more horizontal?
  8. Have you done any forward-deployed engineering, like working inside an insurance company to understand their workflows better?
  9. Are you guys using off-the-shelf, whatever the best model is at the time, or is there some sort of multiplexing, any open source fine-tuned models that you guys are using?
  10. OpenAI and Anthropic are teaching models to browse the web directly. Can you talk about what web browsing use cases make sense for folks to use in a ChatGPT Agent or Claude vs. an Asteroid?
  11. The Browser Company has Dia, Perplexity has Comet, and OpenAI reportedly has a browser coming. Are these kinds of AI-native browsers complementary or competitive to what you're building?
  12. There's a tension between making the web more accessible to agents versus websites defending against automation. How does this resolve over the next few years?
  13. How do you map out the ecosystem or market between stuff like Asteroid, between Browserbase, Stagehand, some of these AI-native recent companies?
  14. Is Asteroid using Playwright or what is the underlying technology there?
  15. What makes it valuable to build on Playwright? Why is that a smart decision?
  16. When people use no code, it generates Playwright. Is that then exposed to the end user, or do they interface with it through natural language if they need to make edits or through a visual UI?
  17. Looking at the next five years, if everything goes right for Asteroid, what do you see Asteroid becoming and how is the world different?

Interview

Can you tell us what Asteroid is and what inspired you to start the company?

Asteroid is a platform where teams can build and run AI browser agents. These AI browser agents can automate complex browser workflows people still do by hand in web portals such as form filling, data entry, and scheduling. AI is now able to click, type, and navigate apps. Those legacy portals where people spend millions of hours every year—AI is able to do that now. That's what we do at Asteroid. We build those agents and enable teams to build those agents themselves as well.

For your second question, we actually started with a different idea. Our original idea was that as AI agents are getting more complex and are getting power from their users, there needs to be human supervision. Our first product, even pre-YC, was human supervision that you put on your AI agents. It escalates where it needs help with permissions and credentials. We noticed this is most useful for browser agents because browser agents can control anything on your computer. Technically, it has all the privileges that you have as a human. We need to have this human supervision baked into the browser agents.

There was also a classic correlation with the market and the technology. Computer use models were only becoming capable around January, February 2025. That was the time that OpenAI released computer use. Anthropic probably actually released their first version of Computer Use in October, November 2024, but it wasn't very usable. It was clear that the next wave of things after voice agents is going to be browser agents. Now is the time to start playing with the technology and trying to put it into production as much as possible.

Can you give us a brief history of browser automation—from Selenium to Playwright to now AI-driven approaches? Has browser automation become more capable, or just easier to set up, or is it both?

Definitely both. Let me walk you through the history. Early 2000s was Selenium. That was the first framework where web apps were becoming interactive, so you had to have a QA team to actually go and test these automatically. Browser automation started as QA. Selenium was the de facto standard for automation of QA. It was open source, widely adopted, and influenced basically everything that came after that.

In the 2010s, we had more modern testing frameworks like Puppeteer, usually coming from big tech because they had to test so many things. Puppeteer was from Google and was controlling specifically Chromium. Then we got Playwright, which is the standard framework to be used right now for browser automations. It was launched by Microsoft. What was interesting about it is it was cross-browser, so you could use Chromium, Firefox, Safari. It's now the gold standard of the frameworks to be used.

But the way they work is you need to define every HTML selector. It's very brittle. Every time the website changes, there is nothing you can really do about that. Those were the same paths that companies like UiPath used, which was quite brittle because you had to hard code every path. If the website changes or your process changed, you need to call your team of developers and they would go and create if-else statements to completely change that. Everyone who ever wrote anything in Playwright or Selenium will tell you they hate it and they want to jump out the window.

The promise of LLMs is that the LLM will navigate the website itself. It will be able on the fly to decide, "Oh, now I want to click on this. Oh, there is a new popup that I've never seen before. Obviously, I should close it." That simple thing as a popup appearing would completely crash previous automations. It's like one way LLMs are helping. But the second way we are actually seeing LLMs helping is writing this Playwright script or these automation scripts on the fly and then reusing them in the next automations. So it's enhancing behavior at runtime and the code writing capability offline.

You mentioned voice agents as an important phenomenon. How are voice agents related to browser automation? Why does it make sense to have browser agents as part of voice AI?

Voice technology basically started to be useful last year, where so many industries have been completely disrupted by voice agents. They usually automate front office—receptionists, people at call centers, people taking inbound calls, outbound calls. But the point of these calls usually is to get some data in. Let's say I want to go and book an appointment with my doctor, and now they put a voice agent in there. The voice agent is taking my call and says, "Okay, I want to have an appointment tomorrow, 4pm. My name is David."

The receptionist would usually be on the call and then they would just type it in because their laptop's in front of them. So great, the voice agent company has now automated inbounds of those calls. They wrote down this transcript. What's next? The system that is in front of them is a legacy system that has no API, never needed to have an API, and the developer who probably wrote the system is not even alive now because it's so outdated. The only way, actually quite often, to just go and input the data in there is to have a browser agent that'll go sign in using the credentials that the developer or someone else would provide and put in the data. To actually deliver on this end-to-end AI agent automation space, usually voice agents and browser agents need to work in tandem.

Paint us a picture of the future. What does enterprise software actually look like—are we still building web UIs for humans, or has everything shifted to agent-first interfaces?

There'll be some transitioning period, but there will be some UIs—if you would see them, you would not want the human to touch them. They're so legacy and so outdated, but people can't really update them. So yes, there'll be some UIs where we'll be using this legacy Internet infrastructure. We just put another new layer of abstraction, which is these browser agents running on it. And slowly we will remove the underlying infrastructure that has been built over time. But there is going to be a lot of it where there will just be browser agents so you don't have to interact with that. You will just tell your AI, "Go and do the thing," and then the browser agent will do it on the old Internet that you don't have to interact with.

MCP promises to give agents structured access to tools and data. Does widespread MCP adoption make browser automation less necessary, or does it always remain as the fallback?

MCP is just a layer on top of API to make it more readable for LLMs. The prerequisite is that the API exists. Usually you need to use browser agents where APIs do not exist. If the APIs don't exist, you build some browser agents, and what we are doing now in Asteroid is that you call our browser agents through an MCP. It is just a new layer of abstraction. Instead of the API, we have the browser agent, and then there's an MCP on top of the browser agent. So you have this website accessible through an MCP for LLMs.

If you can use APIs, use APIs. Browser agents are slow, inefficient, and expensive for now. It will get a bit cheaper, but if there's an API, it's always better to use an API. MCPs will be, seems to be the best standard so far to actually connect very easily many of those APIs together.

What are some of the major use cases so far that you have come across?

It's usually centered around end-to-end data extraction in various forms. Industries we see a lot of traction in are, for example, insurance and healthcare. There is something around supply chain. It's usually industries where you have a lot of legacy portals and a lot of various systems you don't really have ownership of, and you need to interact with them very often. The work itself is quite repetitive, but you need some type of intelligence to actually do the work. It was impossible to automate two years ago.

To give you an example: insurance quoting is the most commonly used example for this. That is because in insurance quoting, if you're a broker, you need to first get some information about the customer. Such as their names, and also things like, let's say it's a car insurance, their driving history and driving license details. Then you need to go and fill this complex form. But the form has quite a high branching factor. Imagine something like 150 questions. Sometimes you click yes here and something new will pop up there. Then you need to rely on your previous knowledge as a domain expert of an insurance broker. Like, "Oh, yeah, if I click yes here, then I should usually put no," but the no is not in the answer that you got from their customer. There's some domain knowledge that needs to be inputted into the browser agent's definition, into their program, to be able to automate this work end to end.

Actually, deploying those types of agents sometimes takes a bit of effort because there is trial and error, and it needs to escalate to the broker, and the broker goes back and fills it in. It's like, "Okay, this is what I'm supposed to do next time." So there's an iterative deployment.

Are you building for those vertical-specific workflows at all, or are you building more horizontal?

We are quite horizontal in terms of the technology we use—it can be applied for any of these industries. But we do go specifically for some verticals, which is definitely insurance and healthcare. Those are the most important ones for now. The moment you're able to automate the insurance quoting in some specific portals, then it's easier to go for a second one and third one.

Have you done any forward-deployed engineering, like working inside an insurance company to understand their workflows better?

We do. We work very closely. A lot of the things we are providing to them is not just the platform where you build browser agents, but also the services on top of it. Let's say a massive insurance or insurtech comes to us and says, "Please automate these 30 brokers." We need to learn how these workers actually work. Usually the easiest they can do is record a video or record a process of them explaining what they do, how they do it. They also sometimes have SOPs, standard operation procedures. It's usually some type of a PDF, sometimes a bit outdated, but they'll just give you what the LLM needs as context. And then we build the agent.

We have a Slack channel with them, and we work with those agents. We say, "Can you run this? Tell us if it works." And they run it and say, "Oh, yeah, it worked, but it was supposed to input this thing here. It's not critical, but it should put this thing here." And then we go and update a prompt. There are a lot of iterations, which people don't really understand about building with LLMs. They work, they just need to know how they should work, and it's a specification problem, which is what the forward deployment engineers are for.

Are you guys using off-the-shelf, whatever the best model is at the time, or is there some sort of multiplexing, any open source fine-tuned models that you guys are using?

We are, for now, committed to closed source models because they're moving the fastest when it comes to releasing better and better models. We're using primarily OpenAI models and playing a bit with Anthropic as well. When it comes to the multiplexing, we do use various types of models from those providers, mostly testing between different types of reasoning models on specific tasks and computer use capabilities.

We use think-of-like GPT-4o text models to understand DOM-based level websites. But then we also combine it with a hybrid approach of computer use where it's looking at screenshots. Some websites are better suited for the HTML-based approach and some websites are better for computer use, and we have a model for deciding which one should be used.

OpenAI and Anthropic are teaching models to browse the web directly. Can you talk about what web browsing use cases make sense for folks to use in a ChatGPT Agent or Claude vs. an Asteroid?

It's semi-complementary, but there is also some competition. But who isn't actually competing with OpenAI now? It was funny. I was fundraising, and Operator came out, and every investor was like, "Oh, Operator is going to kill you." Well, four months later, Operator is a deprecated product. It's been incorporated into the ChatGPT Agent. And the QUA-based model that was powering Operator didn't get any updates because it turns out computer use is actually very difficult.

I see it as mostly complementary, to be fair. The way it works now is you have the OpenAI agent, and the OpenAI agent behind the scenes is choosing between two types of models, or better to say two types of browsers. There's one what they call a Vision browser, which is where you would have computer use. The problems are it's slow and it's very expensive. That's why Operator was not a success, because you would need to wait 10 seconds for an action. That's why they also incorporated what they call the text browser, where you use something like preprocessing the HTML and you're using faster models like GPT-4o most likely in this case. You do all those clicks and automations.

Also, the difference in use cases is that ChatGPT Agent you would probably use for one-off tasks like "Go and find me something" or "Fill out this one form." Whereas where you would use Asteroid is "I need to fill out this form reliably 1,000 times a day." There's no way of doing this with ChatGPT right now.

The Browser Company has Dia, Perplexity has Comet, and OpenAI reportedly has a browser coming. Are these kinds of AI-native browsers complementary or competitive to what you're building?

I haven't even tested most of them because there are so many now. The way I'm thinking about is they will probably be very useful for those one-off tasks. Like, I am searching for something, I'm on this website already, and it already knows my name. And they all have a chatbot in the corner. It'll make my personal process of searching much better. I don't see a major point why use that and not then use ChatGPT and just speak to that and somehow do it behind the scenes. Why do I need to see the website? I don't know. So definitely useful, definitely very cool. But again, not the main thing, and they will probably enter the space Asteroid is operating in, so they are technically competitors in a way.

But what we are doing is the hosted browser, right? The browser that is, you know, you want to run 1,000 agents in parallel doing the same thing. You can't run 1,000 instances of Comet on your laptop. Like, you can, but no one will. We can scale infinitely because we're running hosted browsers. You have your workforce there. And if you need to, you can surface it and look at the recordings and live view in your own browser. This is how enterprises, like big companies, are going to want to run it. They don't want to install comment on everyone's laptop. They want to have a hosted browser where they have their insurance quoting workforce. They have this, whatever they were outsourcing to Philippines for 15 years, they just want to go and bring it in and have this browser outsourcing workforce.

Actually, I had a funny quote from one of our users who was on a call with us a few weeks ago, and they were like, "Oh, you're better than this Comet that I've been using for weeks." And I was like, "I never actually tried Comet." And we never even tried to do the things that he wanted to do with Comet. He was using us as you would use Comet—go to this website, download this file, upload this file here, now read me the file. We have the chat and you can interact with our browser and it will follow instruction. But we never really tried too much to index on those types of use cases. And he was like, "But it was still better." Comet obviously gets so much better. And I love Perplexity, so I hope they'll do great.

There's a tension between making the web more accessible to agents versus websites defending against automation. How does this resolve over the next few years?

With what we do, I should obviously be an advocate of free Internet so you can access everything in the browser and agents will not need permission from the agent police and centralized authorities, such as Cloudflare. But I do understand the sentiment completely because there are good agents and there are bad agents.

You know, OpenAI scraped the full Internet and then built a massive model on top of it. And now they're selling it for money, and people are angry about it. I get it. Similarly, there are websites that rely on a human looking at the ads in the corners.

But then there are a lot of good agents who are actually trying to do work on behalf of other companies, and they're bringing money. In our use cases, like the insurance quoting thing, I've been putting data on behalf of the customers. If you make your website easily accessible to agents, more and more upcoming insurtechs will just use you as a carrier because you're easy to integrate with.

Let's say you want to order food. I was meeting with the CEO of DoorDash a few months ago and he was telling us that they're now working very much about making DoorDash accessible to browser agents. So they have a specific place where they go so they can order, because it's bringing them money, right? They're placing an order for food.

Every company and domain owner will ultimately need to decide for themselves. They can probably do it in their Cloudflare settings. What I didn't like was that Cloudflare was setting these protections by default to yes or something. But I didn't look too much into it.

For us, a lot of our customers whitelist our IP address, so we are not doing any malicious scraping. But yeah, there are customers who come to products like ours, and they would definitely try to use it in that way. And that’s been happening for the last 30 years of web automation.

How do you map out the ecosystem or market between stuff like Asteroid, between Browserbase, Stagehand, some of these AI-native recent companies?

It's a very early and growing ecosystem. As with every new tool, the early adopters are developers. So that's where most of the tools are. Browserbase are doing hosting for Chromium instances, and they also have the open source tool Stagehand, which will help you to build those automations. There's frameworks like Browser Use, which are great if you're a developer and you want to just get something out there very quickly.

Where we saw a gap in the market, which hasn't really been addressed, is if you want to build reliable browser automation as a non-technical person, there's really nothing you can use.

You can go and pay someone to use UiPath if you have $50,000 in your pocket. Where Asteroid is trying to be different is that even non-technical teams can go and automate and supervise complex browser tasks.

The domain expert who knows how to do this healthcare process or insurance or supply chain process doesn't need to go and explain to the developer. Then the developer goes and uses Stagehand and spends two months iterating because LLM and browser agents are actually quite hard to steer. And then after three months, they still don't really have great browser automations because there's all this friction.

What we're focusing on is, okay, well, you're a non-technical person. Either you can use our tool immediately because we have this graph builder, no code, AI-assisted builder coming soon, which will help you to automate it, supervise it, control it, and you can automate serious work, not just QA testing.

Is Asteroid using Playwright or what is the underlying technology there?

We use Playwright. Primarily on the interaction layer, we use Playwright.

What makes it valuable to build on Playwright? Why is that a smart decision?

It's probably the easiest and the most well-maintained framework. When you're interacting with browsers, there's a lot of HTML and it's quite messy. So you're looking for some kind of abstraction layer with a lot of developers around it who are going to patch things as there's a new HTML version coming in.

By being maintained by someone like Microsoft, you know it's going to be around for years to come. The alternative would probably be something like Puppeteer, which I've heard is also a reasonable framework to use right now.

When people use no code, it generates Playwright. Is that then exposed to the end user, or do they interface with it through natural language if they need to make edits or through a visual UI?

We do both. We just released a feature where you can have your browser agent that will do the task. Let's say you want to fill a form. You tell the agent, "Go fill this form. This is the input data." The agent using AI will fill the form. And at the end, it will also give you the script that it used to fill out this form.

It will generate that and parametrize it in a nice way. The next time, you only call the element to fill the variables inside of the script. And instead of taking three minutes, it will take 20 seconds.

Looking at the next five years, if everything goes right for Asteroid, what do you see Asteroid becoming and how is the world different?

We really want to automate real hard work and enable people not to spend days—no one's dream was to be a data entry specialist. There are millions of people like that. And I'm sure there's more creative work that people can do. In healthcare, for example, focus on the actual healthcare. Nurses should not be spending three hours a day inputting data into terrible legacy systems. All of these people can be free of that. If Asteroid is successful in five years, people are not doing these terrible jobs, and they can do more creative things.

Also, when you'll be actually interacting with Asteroid in five years, it's going to be much more than just browser agents. Browser agent is the way we start with, but we're adding even non-browser interaction layers. Let's say you have your browser worker who is doing 95% of the task in the browser, but then you need to, for example—people ask a lot about Excel, Google Sheets. So you want to have a full team of agents able to do the task.

Disclaimers

This transcript is for information purposes only and does not constitute advice of any type or trade recommendation and should not form the basis of any investment decision. Sacra accepts no liability for the transcript or for any errors, omissions or inaccuracies in respect of it. The views of the experts expressed in the transcript are those of the experts and they are not endorsed by, nor do they represent the opinion of Sacra. Sacra reserves all copyright, intellectual property rights in the transcript. Any modification, copying, displaying, distributing, transmitting, publishing, licensing, creating derivative works from, or selling any transcript is strictly prohibited.

Read more from

Zach Lloyd, CEO of Warp, on the 3 phases of AI coding

lightningbolt_icon Unlocked Report
Continue Reading
None

CISO at F500 Company on automating security operations with AI agents

lightningbolt_icon Unlocked Report
Continue Reading