Synthetic Data Needs Human Evaluation

Diving deeper into

Jemma White, COO of Prolific, on why humans ensure AI safety

Interview
Synthetic data alone is not going to be enough, and you will always need that human evaluation
Analyzed 9 sources

The strategic point is that synthetic data shifts human labor up the stack, it does not remove it. As models generate more of their own training material, the scarce input becomes trusted people who can check whether outputs are actually correct, safe, culturally appropriate, and useful in real workflows. That favors Prolific’s vetted participant network, self serve targeting, and growing evaluation infrastructure more than the old world of bulk commodity labeling.

  • Prolific is built for targeted human judgment, not just raw annotation volume. Customers can filter for 300 plus traits and skills, route tasks through external eval tools, and recruit from 200,000 active ID verified participants across 80 plus languages, which fits the shift from broad labeling to specialized evaluation.
  • The market is already moving this way. Mercor, Invisible, Handshake, and Office Hours all grew by supplying vetted experts for reasoning, safety, and domain specific tasks, while older crowdwork models look weaker where customers need credentialing, auditability, and nuanced judgment.
  • Model builders themselves still treat human checks as part of the safety stack. OpenAI publishes human sourced jailbreak and expert grading evals, and notes automated graders are not reliable enough to replace expert graders. Anthropic’s Constitutional AI reduces some dependence on human feedback, but still uses human and AI feedback together for alignment and evaluation.

This pushes Prolific toward becoming an orchestration layer for AI evaluation. The next step is not just supplying people for studies, but helping teams decide when to use synthetic data, when to use humans, and how to combine both in one repeatable workflow for training, red teaming, and release decisions.