Rubric-Based QC Drives 85% Accuracy

Diving deeper into

Ops lead at Scale AI on using Claude Cowork & Codex for QC automation and multi-tool debugging at scale

Interview
The mechanic that unlocked 85% QC accuracy wasn't better prompts alone—it was decomposing a binary pass/fail judgment into 25–30 rubrics
Analyzed 4 sources

Breaking QC into 25 to 30 rubrics turns an LLM from a vague judge into a checklist worker. Instead of asking the model to decide whether an entire task was right or wrong, the workflow feeds it the submission, the QC feedback, the instructions, and the spec doc, then asks where the mismatch sits. That makes errors legible enough to train on, route in Slack, and improve over time, which is how the process moved from roughly 40% to about 85% accuracy.

  • The key gain is localization. A raw pass or fail label only says something went wrong. A rubric says whether the problem was formatting, rule adherence, missing content, or another specific check, which gives the model a smaller and more repeatable judgment to make.
  • The human workflow also gets simpler. Once Cowork flags a rubric level issue, the system posts the task ID, issue summary, error category, and spec doc comparison into a Slack channel for specialist review. That turns the model into a triage layer, not the final judge.
  • This mirrors a broader shift in AI evaluation infrastructure. Other companies in model training and evals are also building rubric based verifiers and criterion based scoring, because complex agent work is easier to audit when quality is split into named dimensions instead of one final score.

The next step is carrying this rubric logic into longer agent chains. As models take actions across more tools, the winning systems will judge every handoff the same way this QC flow judges every task, with explicit checkpoints, category level failures, and clean escalation paths instead of one brittle end to end verdict.