Eval Cases & Collections

This page covers the full configuration options for eval cases and how to use collections to organize and batch-run your evaluations.

Eval Cases in Detail

An eval case defines a single test scenario for your AI Agent. Here's everything you can configure:

Messages

Messages define the conversation that will be sent to your Agent:

Human messages — The user input your Agent will receive. Every eval case needs at least one human message.
AI messages — The expected response from your Agent. This is optional, but required by some metrics (like Contextual Recall) for comparison.

You can add multiple messages to simulate multi-turn conversations. Messages are sent in the order you define them.

Configuration Overrides

These optional settings let you customize how your Agent behaves during the evaluation, without modifying the Agent itself:

System Prompt Override — Use a different system prompt for this eval case. Useful for testing prompt variations.
LLM Model Override — Run the eval case with a specific LLM model instead of the Agent's default.
Temperature Override — Set a custom temperature value for this eval case.
Tools to Disable — Selectively disable specific tools for this test scenario.
Knowledge Base — Specify which knowledge base to use for RAG evaluations.

Testing LLM Model

This is the LLM that judges your Agent's response — it's separate from the model your Agent uses to generate answers. The testing LLM evaluates the response against your metrics and produces scores.

Metrics

Each eval case can have one or more metrics attached to it. Metrics define what aspects of the response you want to measure. See the Metrics Overview for the full list.

Collections

A collection is a group of related eval cases that you can run together as a single batch.

Why Use Collections?

Organize by topic — Group eval cases that test the same feature or knowledge area.
Batch execution — Run all cases in a collection with one click instead of running them individually.
Track progress — See aggregated pass/fail results across all cases in the collection.

Creating a Collection

Navigate to your Agent's Evals tab.
Click Create Collection.
Give it a name and description.
Add eval cases to the collection — you can create new ones or move existing cases into it.

Running a Collection

Click the Run button on the collection to execute all its eval cases at once. Each case runs independently and produces its own results, but the collection provides an overall status based on all its cases.

Status Lifecycle

Every eval case run and metric run goes through a status lifecycle:

Status	Description
New	The run has been created but not yet started.
In Progress	The run is currently executing.
Passed	All metrics met their thresholds.
Failed	One or more metrics did not meet their thresholds.
Warning	The run completed with mixed results — some metrics passed, some failed.
Error	An error occurred during execution (e.g., the Agent failed to respond).
Stopped	The run was manually stopped before completion.

Best Practices

Keep eval cases focused — Each case should test one specific behavior or scenario. Avoid combining unrelated tests.
Use descriptive names — Name your cases clearly so you can quickly understand what they test (e.g., "Returns refund policy when asked" rather than "Test 1").
Start with a few metrics — Begin with one or two metrics per case, then add more as you understand what each metric measures.
Use collections for regression testing — Group your most important test cases into a collection and run it after every significant change.
Set realistic thresholds — Start with the default threshold (0.8) and adjust based on your results. Setting it too high may cause unnecessary failures.

Next Steps

AI-Generated Eval Cases — Let AI create eval cases for you automatically.
Understanding Results — Learn how to interpret scores and take action.

Eval Cases & Collections

Eval Cases in Detail​

Messages​

Configuration Overrides​

Testing LLM Model​

Metrics​

Collections​

Why Use Collections?​

Creating a Collection​

Running a Collection​

Status Lifecycle​

Best Practices​

Next Steps​