Models Reduce Edge Case Checks 100x
Scale: the $290M/year Mechanical Turk of machine learning
This marks the point where data labeling stopped being a volume business and started becoming a precision business. When a strong pre trained model already gets most examples right, the job is no longer to have thousands of workers label everything from scratch. The job becomes finding the small set of failures, checking whether the model breaks on a new domain, and adding just enough human judgment to correct it. That is why foundation models shrink the need for broad labeling labor while raising the value of targeted review and evaluation.
-
In older ML workflows, teams often needed around 10,000 labeled examples before a model became useful. With transfer learning and large pre trained models, teams can often start with closer to 100 examples, then use cross validation and quick fine tuning to see whether the model works on their own data. That collapses human review volume by orders of magnitude.
-
The biggest change is in open ended domains like driving. A self driving car can fail on endless combinations of rain, glare, dusk, road markings, and unusual objects. Better pre training helps the model generalize across many of those combinations, so humans spend less time relabeling routine cases and more time hunting for rare failures that still matter for safety.
-
This does not remove humans from the loop, it changes which humans matter. Recent demand has shifted away from generic crowdworkers toward specialists used for red teaming, safety checks, cultural nuance, and domain specific evaluation. Scale itself followed that shift, growing from about $215M ARR in 2022 to about $760M in 2023 and $1.5B in 2024 as LLM related work replaced declining autonomous vehicle labeling.
Going forward, the winners are likely to be the companies that help customers find model failures fast, route them to the right experts, and turn small amounts of feedback into measurable improvement. Human data remains essential, but it is moving upmarket, from bulk annotation toward evaluation, alignment, and trust.