Practice Field for Software Agents
Fleet
The hard part is turning an AI agent from a clever demo into something a company will trust to click, type, read, and update real systems without breaking them. In enterprise software, tasks unfold over many steps, the screen and underlying data keep changing, and one wrong edit can corrupt records or trigger the wrong downstream action. That makes realistic environments, repeatable resets, and strict scoring more valuable than better text output alone.
-
Fleet is selling a practice field for software work, not a chatbot. Its environments recreate app state, databases, interfaces, and task rules so agents can retry the same workflow many times, while humans label failure modes and feed that back into training and evaluation.
-
This is why incumbents like ServiceNow built WorkArena and BrowserGym. Their own research frames enterprise tasks as harder than consumer web actions because agents must navigate dense interfaces, forms, databases, and multipage workflows, and current systems still fall short of full task automation.
-
Anthropic's computer use stack points to the same bottleneck. The model can control mouse and keyboard, but it still needs a sandboxed environment, an execution loop, and tooling around actions and results. The model is only one layer, reliable workflow execution is the system around it.
The next phase of the market centers on who owns the testbed for real work. As labs and enterprises push agents into finance, insurance, support, and internal operations, the winning layer is likely to be the one that can turn messy software tasks into repeatable training, evaluation, and deployment gates across many systems, not just generate better answers in a chat window.