Published on 03.02.2026
AI agents in production face critical observability challenges that simple metrics alone can't solve. This article explores practical frameworks for implementing AI observability, including how to define meaningful evaluation criteria, leverage tools like Opik for tracing and evaluation, and treat AI agents as data products rather than just deployed features. The key insight: observability isn't just about metrics—it's about creating a continuous feedback loop that treats your AI system as a living product that evolves with real-world usage.
When giving AI agents too much power in production, nuanced problems emerge that typical testing doesn't catch:
These issues persist even when metric scores look good, revealing a fundamental gap between synthetic testing and production reality. The challenge is that these problems are difficult to spot without dedicated observability frameworks.
Link: Decoding AI - Behind the Scenes of AI Observability in Production
Standard observability tools provide out-of-the-box metrics like Hallucination, AnswerRelevance, ContextPrecision, and ContextRecall. The problem: knowing your hallucination score is 1 tells you nothing about how to improve it. You need to find what specific aspects to evaluate for your use case.
LLMs produce different outputs on repeated runs, and the same applies to LLM-as-judge evaluations. Running the same evaluation multiple times yields different results. The solution is to keep evaluation metrics lean and grounded—find metrics that are binary and unambiguous, testable by multiple evaluators independently.
Example: Instead of a fuzzy "hallucination" metric, define a specific, binary check: "Verify if the search_knowledge tool was called and if the URL in the output matches the tool's actual output."
You cannot automate away all manual review without missing critical insights. LLMs evaluating other LLMs might hide important findings behind isolated prompts or poorly phrased conditions. Manual reviews surface patterns that automated metrics miss—like agents suggesting capabilities they don't have or answering questions outside their scope.
The worst mistake is treating observability as a one-time evaluation exercise with binary good/bad scores. Each conversation creates an invisible roadmap of improvements: bug fixes, new capabilities, refined prompts, or entirely new feature opportunities. Production data screams improvements that synthetic data never reveals.
AI observability tools like Opik follow familiar Software Engineering principles (like Sentry for error logging):
Opik is an open-source LLMOps platform used by Uber, Etsy, and Netflix. Key features include:
Opik offers both open-source and managed versions, with a generous free tier of 25K spans/month.
Link: Try Opik for Free
Before configuring evaluations, understand what your agent actually does through production traces. Use a reverse-engineering approach:
Key principle: Each metric should focus on ONE thing with ONE clear score definition. Avoid subjective outcomes.
Once you understand what to evaluate, configure online evaluations with:
Example: A "Response Format Compliance" metric evaluates if agents follow prompt formatting standards. After manual review, you might refine the scoring criteria from "match at least 2 of 5 standards" to "match ALL standards."
Manual annotation is necessary but can be exhausting. Use structured approaches:
Prepare an annotation agenda using tools like opik-weekly command to understand:
Use the annotation-review command to:
Document changes to evaluation criteria in a version-controlled location (Google Docs, Notion, Confluence) since Opik doesn't version online evaluations
The Opik MCP Server (optional but powerful) enables:
Available through Github - opik-mcp
The author implemented AI agents at scale over nearly a year, initially using basic PostgreSQL analytics tracking ~100 messages daily. This approach provided no insight into agent improvement paths. After months of manual work—running local LLM-as-judge scripts, manually versioning prompts, guessing what to evaluate without observing real reactions, and burning out from endless annotation—a breakthrough came through discovery of Online Evaluations and Opik MCP.
The key lesson: Traditional AI observability tutorials show the 1% happy path, not the iterative, cyclical process of continuous evolution. Proper implementation requires:
Link: Decoding AI Magazine - Observability for RAG Agents
Recommended Reading: