Experience Era of AI Evaluation

Diving deeper into

Jemma White, COO of Prolific, on why humans ensure AI safety

Interview
we are now in the experience era of AI training and evaluation
Analyzed 4 sources

This marks a shift from training models to know facts, toward testing whether they behave like trustworthy people in messy real situations. Early AI data work rewarded vendors that could supply huge pools for simple annotation. The newer work is smaller and harder. Labs need people who can judge tone, spot unsafe edge cases, understand culture, and react like real users, which favors deeply profiled participant networks over raw labor volume.

  • The market has moved in steps. Scale won by organizing large scale labeling workflows, then Handshake and Mercor grew by supplying credentialed experts for reasoning tasks. Prolific is pushing the next step, where the scarce input is not just expertise, but lived experience, behavior, and cultural context.
  • In practice, this means red teaming, safety review, localization, and product testing. A model builder may need bilingual users in a specific country, people with certain temperaments, or participants who can handle harmful content and judge whether an output feels acceptable, not just technically correct.
  • That is why Prolific emphasizes profiling depth, verification, and participant history. It has about 200,000 active participants, over 40 countries, more than 80 languages, and years of behavioral data. The product advantage is that customers can find narrow cohorts fast, often through self serve workflows and APIs rather than custom recruiting each time.

Going forward, more AI spend should flow into evaluation layers that measure trust, cultural fit, and safety after core model knowledge is already strong. The winners in human data will look less like bulk annotation factories and more like infrastructure for sourcing the right human judgment on demand, with Prolific well aligned to that direction.