Claude Code and Codex Power Programmatic Evals

Ops lead at Scale AI on using Claude Cowork & Codex for QC automation and multi-tool debugging at scale

Today, at least eighty to ninety percent of programmatic evals run through Claude Code or Codex because they've gotten so good.

Analyzed 5 sources

This shows coding agents have crossed from helper tools into production QA infrastructure. At Scale, Claude Code and Codex now handle most of the first pass on programmatic evals for computer use agent projects, which means the bottleneck is no longer writing eval scripts by hand, but designing good multi layer checks and deciding where humans still need to sample edge cases like file naming, formatting, and other nitpicky ground truth details.

1 sacra 2 sacra 3 sacra 4 sacra 5 sacra

In practice, these evals are not vague model judgments. They are code driven checks over agent outputs and task flows. The ops lead describes four or five layered eval levels, with Claude Code or Codex running the first layer and humans still doing final cross checks where small formatting errors can break a task.

1 sacra 4 sacra
The shift is especially important for computer use work, where models interact with interfaces instead of clean APIs. Three months earlier, internal evals on these projects were too unreliable to trust. Now the same team routes eighty to ninety percent through coding agents first, showing a sharp reliability jump in multi step debugging and test generation.

1 sacra 2 sacra 3 sacra
This fits a broader market pattern. Scale has expanded from data labeling into RL environments for tool use and computer use workflows, while Anthropic and OpenAI are both pushing coding products as major distribution wedges. Better coding agents make eval creation cheaper, which pulls more of the human data business toward software mediated QA instead of pure labor hours.

2 sacra 3 sacra 5 sacra

The next step is that first pass eval automation becomes standard, and the real product edge moves to proprietary eval environments, richer audit trails, and tighter human escalation loops. As coding agents keep improving, companies like Scale can turn quality control from a labor heavy review process into a software defined system with humans concentrated on the small set of failures that still matter most.

2 sacra 3 sacra 5 sacra