Evals - testing AI agents

When you go deeper into Agents development, you will encounter problem of evaluationg your Agents. In short you could say that evals are tests for LLMs and Agents. Challange here is that LLMs are non-deterministic, so you can't just write a test that will pass or fail. Tests are binary they will PASS or Fail. Evels are more like a "fuzzy" tests. They will give you a score.

Evals provide a way to measure accuracy and quality of the output. For instance you can test halunation, consistency, faithfulness, and many more.
Evals are new metrics for LLMs, writing and maintaining evals is a skill on its own.

Challenges

undeterministic ouput
many different types of metrics
- response quality
- task completion verification
- user satisfaction
- tool accuracy (basically more tools to choose from = worse LLM decisions)
- cost
- security
- performance
- compliance
- fairness
- transparency
- robustness
- explainability
- interpretability
- accountability
- auditability
- reproducibility
- generalization
- transferability

Evaluation types

Use LLM to evaluate other LLMs.

So one LLM will kind of "judge" for other LLM. In this case we will need some kind of prompt or system prompt that will define the rules for evaluation. For instance if base LLMs task was to generate image description, we can use following prompt:

You are evaluating image description generated by LLM. Your task is to evaluate the quality of the description. You will be given image description generated by LLM and you will have to evaluate it. You will be given 3 criteria to evaluate the description:

Is the description accurate?

Is the description detailed?

Is the description creative?
You will be given 1 for yes and 0 for no. You will have to give score for each criteria and then you will have to give overall score for the description. The overall score will be sum of scores for each criteria. The overall score can be 0, 1, 2, 3. 0 being the worst and 3 being the best.

There might be one or many diffrent metrics used to evaluate. It is good practice to use one of currently best models (chats) to figure out what is the best thing to test in our case. For instance one could ask Cloude:

I am building evaluation system for LLMs. I want to evaluate image description generated by LLM. What are the most important metrics I should use to evaluate the quality of the description?

Remember that you can not improve what you do not measure. So it is not only good practice to have some kind of evaluation system in place. I can bet that if you wont have any evaluation system for your AI agent, it will be not ready for production. You will loose money, time and users.

Human feedback

If you are able to get some feedback from your agent users it is the most valuable data. However collecting this data is hard. It is hard to get users to provide feedback. It is hard to get users to provide feedback in a structured way and in a timely manner. Usually good practice is to show the system (your agent) to limited number of people and collect feedback from them. Next, improve the system based on the feedback and show it to another (larger) group of people. Repeat this process until you are satisfied with the results.

Eval Framewrok

At the time of writing this article most of evaluation frameworsk for LLMs and Agents are for Python.
There is OpenAi Evals API. Read more about it the guide

You could create a generic evaluation framework in TypeScript or other language you prefer. The idea is based on these key components:

Experiments and Result Sets

Experiment: This is the top-level test suite for a specific feature or agent behavior you want to improve. For example, you might have an "ImageGeneration" experiment to test your image-making tool or a "CustomerSupportBot" experiment to test its helpfulness. In the code, this is the Experiment type, which has a name and contains a history of test sets.

export type Experiment = {
  name: string
  history: SetResult[]
}

Set: A Set represents a single, complete execution of your test data at a specific point in time (e.g., before and after a system prompt change). This allows you to track performance over time and see if your changes are improving or degrading the model's quality. The code captures this with a timestamp and an overall score for the set.

export type RunResult = {
  input: string
  output: unknown
  expected: unknown
  reference?: unknown
  scores: Record<string, number>
}

export type SetResult = {
  timestamp: string
  runs: RunResult[]
  averageScores: Record<string, number>
}

The Evaluation Dataset (The "Golden Set")

export type TestCase = {
  input: string
  expected: unknown
  reference?: unknown
}

This is your ground truth. It's a collection of test cases that the framework will run your LLM against. As seen in the data parameter for the runEval function (see below), each test case typically includes:

input: The prompt or query you want to send to your agent.
expected: The ideal or "golden" output you hope to receive. This could be a specific string, a JSON object representing a tool call, or any other structured data.
reference: (Optional) Contextual information that the model's output should be based on. This is critical for testing faithfulness in Retrieval-Augmented Generation (RAG) systems, where the reference would be the retrieved documents. 3. The Task Under Test
This is the actual function or agent you are evaluating. The runEval function is generic and accepts a task function as an argument. This task takes an input from your dataset and produces an output. This design makes the framework highly flexible, as it can test anything from a simple LLM call to a complex, multi-tool agent.

Scorers: The Heart of Measurement

This is where the actual evaluation logic lives. Scorers are functions that compare the output of your task to the expected value or reference context and produce a numerical score (typically between 0 and 1). The framework can use multiple scorers for a single run, allowing you to measure different aspects of quality simultaneously.

import { Scorer } from './types'

export const simpleStringMatchScorer: Scorer = {
  name: 'exactMatch',
  score: ({ output, expected }) => (output === expected ? 1 : 0),
}

export const dummyRelevanceScorer: Scorer = {
  name: 'dummyRelevance',
  score: ({ input, output }) =>
    typeof input === 'string' && typeof output === 'string'
      ? Math.min(input.length / output.length, 1)
      : 0,
}

export const allScorers: Scorer[] = [simpleStringMatchScorer, dummyRelevanceScorer]

Common types of scorers include:

Faithfulness/Groundedness: Does the output only contain information present in the reference context? This helps measure and prevent hallucinations.
Answer Relevancy: Is the output relevant to the user's input query?
Tool Usage: Did the agent call the correct tool with the correct arguments? The expected value would be the desired tool call.
Factuality: Does the output align with known facts? (Requires a dataset with factual questions and answers).
Toxicity/Safety: Does the output contain harmful or inappropriate content?
Cost: How much did the output cost? (For systems that track costs per token or per API call).
Security: Does the output contain any security risks or vulnerabilities?
Performance: How long did it take to generate the output?
Compliance: Does the output comply with any regulations or policies?
Fairness: Is the output fair and unbiased?
Transparency: Is the output transparent and explainable?

The runEval function orchestrates the entire process.

import { Experiment, TestCase, TaskFunction, Scorer, SetResult, RunResult } from './types'
import { loadExperiment, saveExperiment } from './storage'

export async function runEval(params: {
  experimentName: string
  data: TestCase[]
  task: TaskFunction
  scorers: Scorer[]
}) {
  const { experimentName, data, task, scorers } = params

  const experiment = loadExperiment(experimentName) ?? {
    name: experimentName,
    history: [],
  }

  const runs: RunResult[] = []

  for (const item of data) {
    const output = await task(item.input)

    const scores: Record<string, number> = {}

    for (const scorer of scorers) {
      scores[scorer.name] = await scorer.score({
        input: item.input,
        output,
        expected: item.expected,
        reference: item.reference,
      })
    }

    runs.push({
      input: item.input,
      output,
      expected: item.expected,
      reference: item.reference,
      scores,
    })
  }

  const averageScores: Record<string, number> = {}
  for (const scorer of scorers) {
    const name = scorer.name
    const total = runs.reduce((sum, run) => sum + run.scores[name], 0)
    averageScores[name] = total / runs.length
  }

  const setResult: SetResult = {
    timestamp: new Date().toISOString(),
    runs,
    averageScores,
  }

  experiment.history.push(setResult)

  saveExperiment(experiment)

  const previous = experiment.history.length > 1 ? experiment.history.at(-2) : undefined

  console.log(`\n📈 Experiment: ${experimentName}`)
  for (const [k, v] of Object.entries(averageScores)) {
    const delta = previous?.averageScores[k] !== undefined ? v - previous.averageScores[k] : 0
    const color = delta > 0 ? '🟢' : delta < 0 ? '🔴' : '⚪'
    console.log(`→ ${k}: ${(v * 100).toFixed(2)}% (${color}${delta.toFixed(2)})`)
  }

  return setResult
}

It generally follows these steps:

Initiate: A developer starts an evaluation for a named experiment.
Iterate: The framework loops through every item in the provided data set.
Execute: For each item, it calls the task with the input to get the actual output
Score: It then passes the input, output, expected, and reference to all the provided scorers. Each scorer returns a named score.
Record: The results for this single run (input, output, expected, and all scores) are collected.
Aggregate & Compare: After all data points have been processed, the framework calculates an average score for the entire set. It then loads the score from the previous set for that experiment and shows the difference. This immediate feedback (e.g., +0.15 in green or -0.08 in red) is incredibly valuable for understanding the impact of your changes.
Persist: The new set of results, with its runs and aggregate score, is saved to a database (in this case, results.json). This builds a historical record of your experiment's performance.

Storage

You need to persist the results of your experiments. You can use any database you want. For simplicity you can use JSON file.

import fs from 'fs'
import path from 'path'
import { Experiment } from './types'

const RESULTS_DIR = './results'

if (!fs.existsSync(RESULTS_DIR)) fs.mkdirSync(RESULTS_DIR)

export function loadExperiment(name: string): Experiment | undefined {
  const filePath = path.join(RESULTS_DIR, `${name}.json`)
  if (!fs.existsSync(filePath)) return undefined
  return JSON.parse(fs.readFileSync(filePath, 'utf-8'))
}

export function saveExperiment(exp: Experiment) {
  const filePath = path.join(RESULTS_DIR, `${exp.name}.json`)
  fs.writeFileSync(filePath, JSON.stringify(exp, null, 2))
}

Entry Point (Example Usage)

// experiments/hello.eval.ts

import { runEval } from './runEval'
import { allScorers } from './scorers'
import { TestCase } from './types'

const mockTask = async (input: string) => {
  return input.toUpperCase() // Replace with LLM or agent
}

const goldenSet: TestCase[] = [
  {
    input: 'hello',
    expected: 'HELLO',
  },
  {
    input: 'world',
    expected: 'WORLD',
  },
]

runEval({
  experimentName: 'UppercaseAgent',
  data: goldenSet,
  task: mockTask,
  scorers: allScorers,
})

Finally setup package.json script

  "scripts": {
    "eval": "npx tsx evals/run.ts",
  },

Runner script:

import 'dotenv/config'
import { join } from 'path'
import { fileURLToPath } from 'url'
import { dirname } from 'path'
import { readdir } from 'fs/promises'

const __filename = fileURLToPath(import.meta.url)
const __dirname = dirname(__filename)

const main = async () => {
  let evalPath = ''
  const evalName = process.argv[2]
  const experimentsDir = join(__dirname, 'experiments')

  try {
    const filesToRun = evalName
      ? [`${evalName}.eval.ts`]
      : (await readdir(experimentsDir)).filter((file) => file.endsWith('.eval.ts'))

    for (const file of filesToRun) {
      evalPath = join(experimentsDir, file)
      await import(evalPath)
    }
  } catch (error) {
    console.error(`Failed to run ${evalPath}:`, error)
    process.exit(1)
  }
}

main()

Key Features

✅ Historical tracking per experiment
✅ Multiple scorers
✅ Generic task function
✅ Full test data set input
✅ JSON persistence
✅ Diff of previous set for feedback