Surge Displacing Scale in RLHF

Diving deeper into

Invisible

Company Report
reportedly displacing Scale in some high-profile cases like Anthropic's RLHF work.
Analyzed 5 sources

This points to a shift in AI data labeling from scale and throughput toward narrower pools of better raters with tighter quality control. Scale won the first wave of RLHF by industrializing annotation across huge datasets, but newer work for labs like Anthropic increasingly depends on people who can judge subtle reasoning, safety, and writing quality, where vendor selection is closer to hiring elite evaluators than booking generic labelers.

  • Scale built its business on a broad labeling platform that could absorb massive volumes across autonomous driving and later LLM work. That model helped it reach $760M ARR in 2023 and $1.5B ARR in 2024, but it was optimized for operational scale before the market shifted toward smaller batches of harder judgment tasks.
  • Surge was built around expert curation from the start. Its workflow lets labs specify the exact rater profile needed, run live chat or transcript evaluations, track agreement scores, and automatically reassign weak work. That setup maps well to RLHF and red teaming, where one bad judgment can poison a training set.
  • The competitive takeaway is that frontier labs now spread work across vendors and increasingly reward neutrality and quality. Recent reporting says Surge won business from OpenAI and Anthropic, while concerns around trust and quality have contributed to customer movement away from Scale in some cases.

The market is heading toward premium human feedback infrastructure, not commodity labeling. The winners will be vendors that can recruit scarce experts, measure rater quality in real time, and package RLHF, evals, and red teaming into one workflow. That favors specialist providers like Surge and pushes broader platforms like Scale and Invisible to move up the stack.