/
Published on 17.02.2026
TLDR: Paul Iusztin lays out an iterative error analysis framework for building AI evaluation datasets: start with 20-50 real production traces, label them with binary pass/fail judgments and written critiques, fix the obvious stuff, build a generic LLM judge using your critiques as few-shot examples, then cluster and prioritize failures to decide where specialized evaluators are actually worth the investment. The secret weapon is your labeled data, not your prompts.
Link: No Evals Dataset? Here's How to Build One from Scratch