From Bespoke Environments to Benchmarks
Fleet
Fleet’s first revenue came from selling trust, not scale. Large banks and insurers paid for custom environments because they needed tightly controlled test worlds where an AI agent could handle claims, underwriting, compliance, or back office work without touching real systems. That work forced Fleet to build the hard parts early, including task workflows, scoring rubrics, and verifiers, which later became reusable building blocks as AI labs began buying environments in volume.
-
In practice, a bespoke environment is a fake but realistic software stack. An agent logs into a browser or internal style app, completes multi step tasks, and gets scored on whether it followed the right process and reached the right answer. That is especially valuable in regulated sectors where mistakes are expensive.
-
Those enterprise projects created repeatable templates for banks, asset managers, insurers, and adjacent GRC workflows. What starts as one custom build can turn into a standard benchmark suite for a whole vertical, which is why early services revenue can evolve into product revenue.
-
This also explains Fleet’s later position with frontier labs. Competitors like Mercor, Surge AI, and Turing came from human labeling and expert networks, while Fleet’s differentiation was environment quality and evaluation infrastructure, later extended by Harbor into broader agent testing and optimization.
The next step is turning those high touch regulated workflows into packaged benchmarks that enterprises can buy off the shelf and labs can use as standard training targets. If Fleet keeps converting custom bank and insurance work into common evaluation products, it can move from project revenue toward a more durable software and data infrastructure position.