Hybrid DOM and Screenshot Automation

Diving deeper into

David Mlcoch, co-founder & CEO of Asteroid, on browser automation and the last mile problem of AI

Interview
we also combine it with a hybrid approach of computer use where it's looking at screenshots
Analyzed 4 sources

The hybrid screenshot approach is what turns browser automation from brittle scripting into software that can survive messy real websites. HTML and DOM access is faster and cheaper when buttons, forms, and fields are cleanly exposed in page code. Screenshot based computer use is the fallback for popups, canvas elements, remote desktops, and odd legacy portals where the page structure is incomplete or misleading. Asteroid is effectively routing each step through the cheapest method that still works reliably.

  • Traditional tools like Selenium and Playwright break when a selector changes or a new popup appears. Asteroid still uses Playwright underneath, but adds models that can look at the page like a human and decide what to click when the code level path is too brittle.
  • This matters most in insurance, healthcare, and supply chain workflows, where workers spend hours inside old portals with branching forms, hidden fields, and systems with no API. In those environments, screenshot understanding is less a nice feature and more the difference between partial automation and actually finishing the job.
  • The same split is now showing up across the market. OpenAI and others pair a faster text browser with slower vision based browsing, while Browserbase focuses on the hosted browser infrastructure layer. Asteroid is packaging that stack for repeated enterprise workflows, especially for non technical operations teams running large volumes in parallel.

Over time, more browser automation will compile repeated tasks into reusable scripts and reserve screenshot based reasoning for the weird edge cases. That pushes the market toward a two layer stack, with foundation models providing generic computer use, and workflow companies like Asteroid winning by deciding when to use vision, when to use DOM actions, and how to operate thousands of runs reliably inside enterprise processes.