Skip to main content

Evaluations Overview

Think of evaluations as unit tests for your AI Agent. Just like software developers write tests to make sure their code works correctly, evaluations let you verify that your AI Agent responds accurately, uses the right tools, and stays aligned with your instructions.


What Are Evaluations?

Evaluations are automated tests you create to measure how well your AI Agent performs. You define a scenario — a user message, expected behavior, and one or more metrics — and the system runs your Agent against that scenario, scores the response, and tells you whether it passed or failed.

This gives you a repeatable, objective way to track your Agent's quality over time.


Key Concepts

Before diving in, here are the core concepts you'll work with:

  • Eval Case: A single test scenario. It contains the user messages to send, the expected AI response, and the metrics to evaluate.
  • Collection: A group of related eval cases that you can run together in one batch.
  • Metric: A specific measurement applied to the AI's response. For example, "Was the answer relevant?" or "Did the Agent use the correct tools?"
  • Run: A single execution of an eval case or collection. Each run produces scores and pass/fail results for every metric.
  • Testing LLM Model: The LLM used to judge the AI Agent's response. This is separate from the LLM your Agent uses to generate responses.

How It Works

The evaluation process follows five steps:

  1. Create an eval case — Define the scenario you want to test, including the user message and optionally an expected AI response.
  2. Add metrics — Choose which metrics to measure (e.g., Answer Relevancy, Task Completion, Hallucination).
  3. Run the evaluation — The system sends your messages to the AI Agent and captures its response.
  4. The Agent responds — Your AI Agent processes the input just like it would in a real conversation, including using tools and retrieving knowledge base content.
  5. Metrics score the response — Each metric evaluates the Agent's response and produces a score between 0.0 and 1.0.

Two Categories of Metrics

Evaluations support ten metrics organized into two categories:

Generic Metrics

These work with any AI Agent, regardless of whether it uses a knowledge base:

  • Answer Relevancy — Is the response relevant to the user's question?
  • Task Completion — Did the Agent complete a specific task?
  • Tool Correctness — Did the Agent use the expected tools?
  • Prompt Alignment — Does the response follow the system prompt instructions?
  • Pattern Match — Does the response match a specific format or pattern?

RAG Metrics

These are designed for Agents that use a knowledge base (Retrieval-Augmented Generation). They evaluate how well the Agent retrieves and uses information from your documents:

  • Hallucination — Did the Agent fabricate information not in the knowledge base?
  • Faithfulness — Is the response grounded in the retrieved context?
  • Contextual Precision — Is the retrieved context relevant and free of noise?
  • Contextual Recall — Was all the necessary context retrieved?
  • Contextual Relevancy — How relevant is the retrieved context to the query?

Learn more about each metric in the Metrics Overview.


Next Steps