Labs Internalize Data Operations
Datacurve
This shift pushes coding data vendors up the value stack, because the easiest work is the first thing labs pull in house. A frontier lab that builds its own data ops team can define tasks, recruit raters, check outputs, and feed results straight into training and eval loops without waiting on an outside vendor. That leaves external vendors strongest where they offer something the lab cannot easily replicate, like scarce expert supply, unusually hard workflows, or independent validation.
-
Anthropic is clearly building internal control over data operations. Its Data Operations Manager role covers data strategy, vendor management, and ownership of the pipeline from requirements to production, while its Transparency Hub says Claude training uses a mix of third party data, paid contractors, and data generated internally.
-
That changes what outside vendors are hired for. Self serve RLHF tools from Labelbox, SuperAnnotate, Label Studio, AWS SageMaker Ground Truth, and Vertex AI let labs run standard annotation inside their own stack. Specialists like Datacurve and Surge matter more when the task needs top coders, custom rubrics, or managed review that general tools do not provide.
-
Open benchmarks also compress the low end of the market. The Stack provides a large permissively licensed code corpus, and LiveCodeBench continuously refreshes coding problems for contamination resistant evaluation. They do not replace bespoke datasets, but they reduce the need to pay for generic coding data and make quality easier to benchmark against a public baseline.
The likely end state is a split market. Big labs will keep internal teams for core training data and use vendors for narrow, high judgment work and outside checks. That favors firms that look less like bulk data suppliers and more like specialized research ops partners embedded in the hardest parts of model training and evaluation.