Published on 28.07.2025
When you go deeper into Agents development, you will encounter problem of evaluationg your Agents. In short you could say that evals are tests for LLMs and Agents. Challange here is that LLMs are non-deterministic, so you can't just write a test that will pass or fail. Tests are binary they will PASS or Fail. Evels are more like a "fuzzy" tests. They will give you a score.
Evals provide a way to measure accuracy and quality of the output. For instance you can test halunation, consistency, faithfulness, and many more. Evals are new metrics for LLMs, writing and maintaining evals is a skill on its own.
Challenges
So one LLM will kind of "judge" for other LLM. In this case we will need some kind of prompt or system prompt that will define the rules for evaluation. For instance if base LLMs task was to generate image description, we can use following prompt:
You are evaluating image description generated by LLM. Your task is to evaluate the quality of the description. You will be given image description generated by LLM and you will have to evaluate it. You will be given 3 criteria to evaluate the description:
- Is the description accurate?
- Is the description detailed?
- Is the description creative? You will be given 1 for yes and 0 for no. You will have to give score for each criteria and then you will have to give overall score for the description. The overall score will be sum of scores for each criteria. The overall score can be 0, 1, 2, 3. 0 being the worst and 3 being the best.
There might be one or many diffrent metrics used to evaluate. It is good practice to use one of currently best models (chats) to figure out what is the best thing to test in our case. For instance one could ask Cloude:
I am building evaluation system for LLMs. I want to evaluate image description generated by LLM. What are the most important metrics I should use to evaluate the quality of the description?
Remember that you can not improve what you do not measure. So it is not only good practice to have some kind of evaluation system in place. I can bet that if you wont have any evaluation system for your AI agent, it will be not ready for production. You will loose money, time and users.
If you are able to get some feedback from your agent users it is the most valuable data. However collecting this data is hard. It is hard to get users to provide feedback. It is hard to get users to provide feedback in a structured way and in a timely manner. Usually good practice is to show the system (your agent) to limited number of people and collect feedback from them. Next, improve the system based on the feedback and show it to another (larger) group of people. Repeat this process until you are satisfied with the results.
At the time of writing this article most of evaluation frameworsk for LLMs and Agents are for Python. There is OpenAi Evals API. Read more about it the guide
You could create a generic evaluation framework in TypeScript or other language you prefer. The idea is based on these key components:
Experiment: This is the top-level test suite for a specific feature or agent behavior you want to improve. For example, you might have an "ImageGeneration" experiment to test your image-making tool or a "CustomerSupportBot" experiment to test its helpfulness. In the code, this is the Experiment type, which has a name and contains a history of test sets.
export type Experiment = {
name: string
history: SetResult[]
}
Set: A Set represents a single, complete execution of your test data at a specific point in time (e.g., before and after a system prompt change). This allows you to track performance over time and see if your changes are improving or degrading the model's quality. The code captures this with a timestamp and an overall score for the set.
export type RunResult = {
input: string
output: unknown
expected: unknown
reference?: unknown
scores: Record<string, number>
}
export type SetResult = {
timestamp: string
runs: RunResult[]
averageScores: Record<string, number>
}
export type TestCase = {
input: string
expected: unknown
reference?: unknown
}
This is your ground truth. It's a collection of test cases that the framework will run your LLM against. As seen in the data parameter for the runEval function (see below), each test case typically includes:
input: The prompt or query you want to send to your agent. expected: The ideal or "golden" output you hope to receive. This could be a specific string, a JSON object representing a tool call, or any other structured data. reference: (Optional) Contextual information that the model's output should be based on. This is critical for testing faithfulness in Retrieval-Augmented Generation (RAG) systems, where the reference would be the retrieved documents. 3. The Task Under Test This is the actual function or agent you are evaluating. The runEval function is generic and accepts a task function as an argument. This task takes an input from your dataset and produces an output. This design makes the framework highly flexible, as it can test anything from a simple LLM call to a complex, multi-tool agent.
This is where the actual evaluation logic lives. Scorers are functions that compare the output of your task to the expected value or reference context and produce a numerical score (typically between 0 and 1). The framework can use multiple scorers for a single run, allowing you to measure different aspects of quality simultaneously.
import { Scorer } from './types'
export const simpleStringMatchScorer: Scorer = {
name: 'exactMatch',
score: ({ output, expected }) => (output === expected ? 1 : 0),
}
export const dummyRelevanceScorer: Scorer = {
name: 'dummyRelevance',
score: ({ input, output }) =>
typeof input === 'string' && typeof output === 'string'
? Math.min(input.length / output.length, 1)
: 0,
}
export const allScorers: Scorer[] = [simpleStringMatchScorer, dummyRelevanceScorer]
Common types of scorers include:
import { Experiment, TestCase, TaskFunction, Scorer, SetResult, RunResult } from './types'
import { loadExperiment, saveExperiment } from './storage'
export async function runEval(params: {
experimentName: string
data: TestCase[]
task: TaskFunction
scorers: Scorer[]
}) {
const { experimentName, data, task, scorers } = params
const experiment = loadExperiment(experimentName) ?? {
name: experimentName,
history: [],
}
const runs: RunResult[] = []
for (const item of data) {
const output = await task(item.input)
const scores: Record<string, number> = {}
for (const scorer of scorers) {
scores[scorer.name] = await scorer.score({
input: item.input,
output,
expected: item.expected,
reference: item.reference,
})
}
runs.push({
input: item.input,
output,
expected: item.expected,
reference: item.reference,
scores,
})
}
const averageScores: Record<string, number> = {}
for (const scorer of scorers) {
const name = scorer.name
const total = runs.reduce((sum, run) => sum + run.scores[name], 0)
averageScores[name] = total / runs.length
}
const setResult: SetResult = {
timestamp: new Date().toISOString(),
runs,
averageScores,
}
experiment.history.push(setResult)
saveExperiment(experiment)
const previous = experiment.history.length > 1 ? experiment.history.at(-2) : undefined
console.log(`\nš Experiment: ${experimentName}`)
for (const [k, v] of Object.entries(averageScores)) {
const delta = previous?.averageScores[k] !== undefined ? v - previous.averageScores[k] : 0
const color = delta > 0 ? 'š¢' : delta < 0 ? 'š“' : 'āŖ'
console.log(`ā ${k}: ${(v * 100).toFixed(2)}% (${color}${delta.toFixed(2)})`)
}
return setResult
}
It generally follows these steps:
You need to persist the results of your experiments. You can use any database you want. For simplicity you can use JSON file.
import fs from 'fs'
import path from 'path'
import { Experiment } from './types'
const RESULTS_DIR = './results'
if (!fs.existsSync(RESULTS_DIR)) fs.mkdirSync(RESULTS_DIR)
export function loadExperiment(name: string): Experiment | undefined {
const filePath = path.join(RESULTS_DIR, `${name}.json`)
if (!fs.existsSync(filePath)) return undefined
return JSON.parse(fs.readFileSync(filePath, 'utf-8'))
}
export function saveExperiment(exp: Experiment) {
const filePath = path.join(RESULTS_DIR, `${exp.name}.json`)
fs.writeFileSync(filePath, JSON.stringify(exp, null, 2))
}
// experiments/hello.eval.ts
import { runEval } from './runEval'
import { allScorers } from './scorers'
import { TestCase } from './types'
const mockTask = async (input: string) => {
return input.toUpperCase() // Replace with LLM or agent
}
const goldenSet: TestCase[] = [
{
input: 'hello',
expected: 'HELLO',
},
{
input: 'world',
expected: 'WORLD',
},
]
runEval({
experimentName: 'UppercaseAgent',
data: goldenSet,
task: mockTask,
scorers: allScorers,
})
Finally setup package.json script
"scripts": {
"eval": "npx tsx evals/run.ts",
},
Runner script:
import 'dotenv/config'
import { join } from 'path'
import { fileURLToPath } from 'url'
import { dirname } from 'path'
import { readdir } from 'fs/promises'
const __filename = fileURLToPath(import.meta.url)
const __dirname = dirname(__filename)
const main = async () => {
let evalPath = ''
const evalName = process.argv[2]
const experimentsDir = join(__dirname, 'experiments')
try {
const filesToRun = evalName
? [`${evalName}.eval.ts`]
: (await readdir(experimentsDir)).filter((file) => file.endsWith('.eval.ts'))
for (const file of filesToRun) {
evalPath = join(experimentsDir, file)
await import(evalPath)
}
} catch (error) {
console.error(`Failed to run ${evalPath}:`, error)
process.exit(1)
}
}
main()
ā Historical tracking per experiment ā Multiple scorers ā Generic task function ā Full test data set input ā JSON persistence ā Diff of previous set for feedback
Join my newsletter for deep-dives into AI engineering, agent development, and production-ready LLM practices.