Published on 24.03.2026
TLDR: A 7-part AI Evals & Observability series from Decoding AI covering: where evals fit in development lifecycle, building datasets from production traces, designing LLM judges, evaluating evaluators, RAG evaluation with exactly 6 metrics, and lessons from 6 months running evals in production.
Lesson 1: Integrating AI Evals Into Your AI App - Covers the three core scenarios where evals matter: optimization during development, regression testing before merging, and production monitoring on live traffic. Explains the difference between guardrails and evaluators. Confusing them leads to gaps in your system. The minimum viable tech stack: a custom annotation tool and an LLMOps platform.
Lesson 2: Build an AI Evals Dataset from Scratch - Teaches the error analysis flywheel: sample traces, label manually, build evaluators iteratively, perform error analysis, and create specialized evaluators. Explains why one "benevolent dictator" should own labeling consistency across your team. How to graduate from generic to specialized evaluators as your understanding deepens.
Lesson 3: Generate Synthetic Datasets for AI Evals - Covers why you should generate only inputs, not outputs, and let your real app produce the outputs. How to think in dimensions like persona, feature, scenario, and input modality to avoid mode collapse. Tester agents for simulating multi-turn conversations. The reverse workflow for RAG: generate questions from your knowledge base, not the other way around.
Lesson 4: How to Design Evaluators - Teaches the evaluation harness: infrastructure that automates running evaluators across your dataset. When to use fast, deterministic code-based evaluators versus flexible, nuanced LLM judges. Common design mistakes. Advanced designs for multi-turn conversations and agentic workflows.
Lesson 5: How to Evaluate the Evaluator - Covers the iterative refinement loop: measure alignment, diagnose disagreements, adjust few-shot examples, and re-measure. Dealing with non-determinism: why LLM judges give different answers on the same input, and how to stabilize them.
Lesson 6: RAG Evaluation: The Only 6 Metrics You Need - Proves there are exactly three variables in any RAG system: Question, Context, and Answer. There are exactly six possible relationships between them. Every RAG metric maps to one of these six relationships. Tier 1: Retrieval metrics. If retrieval is broken, nothing else matters. Tier 2: The three core RAG metrics you always need. Tier 3: When core metrics cannot explain the failure.
Lesson 7: Lessons from 6 Months of Evals on a Production AI Companion - Guest post by Alejandro Aboy, Senior Data Engineer at Workpath. Covers three observability problems most teams hit: falling for generic metrics, skipping manual annotation, and not treating AI agents as data products. How to use Opik's architecture for production monitoring and evals. How to reverse-engineer evaluation criteria from real traces instead of guessing upfront.
The series is sponsored by Opik, the LLMOps open-source platform used by Uber, Etsy, Netflix, and more. Opik provides custom LLM judges, experiment comparison, and production monitoring with alarms when scores drop below thresholds.