Continuous Human Validation for AI Safety
Jemma White, COO of Prolific, on why humans ensure AI safety
This points to AI evaluation becoming less like one time model training and more like ongoing regulated quality control. As rules harden, labs and AI app companies will need auditable records showing that real people checked outputs, tested edge cases, and verified behavior across languages, cultures, and risk scenarios. That favors vendors like Prolific that can supply fast, well profiled participant pools, transparent payment flows, and repeatable validation workflows.
-
The practical shift is from building an internal annotator pool to proving independent oversight. Prolific describes labs using outside vendors for second opinions, missing participant groups, and broader demographic coverage. EU rules already require human oversight and conformity assessment for high risk AI systems, which makes external validation easier to document and defend.
-
The work itself is changing. Early AI labeling was often broad, repetitive annotation. Now the higher value jobs are red teaming, safety reviews, cultural nuance checks, multilingual testing, and preference judgments from specific kinds of people. Prolific is built around matching studies to deeply profiled participants, while Handshake and Surge are pushing harder into expert heavy evaluation and RLHF workflows.
-
The market consequence is a bigger software layer around human labor. Prolific sells self serve studies and API access that plug into model workflows. Surge similarly combines annotation labor with dashboards, quality metrics, and red teaming tools. Once compliance teams need recurring evidence, these platforms can expand from project based labeling into continuous monitoring and evaluation infrastructure.
Over the next five years, the winners in human data will look less like staffing vendors and more like compliance infrastructure for AI. The durable position will come from owning the workflow where companies source the right humans, run repeatable evaluations, store evidence, and show regulators, customers, and internal risk teams that their models were actually checked by people.