Eval Cases & Collections
This page covers the full configuration options for eval cases and how to use collections to organize and batch-run your evaluations.
Eval Cases in Detail
An eval case defines a single test scenario for your AI Agent. Here's everything you can configure:
Messages
Messages define the conversation that will be sent to your Agent:
- Human messages — The user input your Agent will receive. Every eval case needs at least one human message.
- AI messages — The expected response from your Agent. This is optional, but required by some metrics (like Contextual Recall) for comparison.
You can add multiple messages to simulate multi-turn conversations. Messages are sent in the order you define them.
Configuration Overrides
These optional settings let you customize how your Agent behaves during the evaluation, without modifying the Agent itself:
- System Prompt Override — Use a different system prompt for this eval case. Useful for testing prompt variations.
- LLM Model Override — Run the eval case with a specific LLM model instead of the Agent's default.
- Temperature Override — Set a custom temperature value for this eval case.
- Tools to Disable — Selectively disable specific tools for this test scenario.
- Knowledge Base — Specify which knowledge base to use for RAG evaluations.
Testing LLM Model
This is the LLM that judges your Agent's response — it's separate from the model your Agent uses to generate answers. The testing LLM evaluates the response against your metrics and produces scores.
Metrics
Each eval case can have one or more metrics attached to it. Metrics define what aspects of the response you want to measure. See the Metrics Overview for the full list.
Collections
A collection is a group of related eval cases that you can run together as a single batch.
Why Use Collections?
- Organize by topic — Group eval cases that test the same feature or knowledge area.
- Batch execution — Run all cases in a collection with one click instead of running them individually.
- Track progress — See aggregated pass/fail results across all cases in the collection.
Creating a Collection
- Navigate to your Agent's Evals tab.
- Click Create Collection.
- Give it a name and description.
- Add eval cases to the collection — you can create new ones or move existing cases into it.
Running a Collection
Click the Run button on the collection to execute all its eval cases at once. Each case runs independently and produces its own results, but the collection provides an overall status based on all its cases.
Status Lifecycle
Every eval case run and metric run goes through a status lifecycle:
| Status | Description |
|---|---|
| New | The run has been created but not yet started. |
| In Progress | The run is currently executing. |
| Passed | All metrics met their thresholds. |
| Failed | One or more metrics did not meet their thresholds. |
| Warning | The run completed with mixed results — some metrics passed, some failed. |
| Error | An error occurred during execution (e.g., the Agent failed to respond). |
| Stopped | The run was manually stopped before completion. |
Best Practices
- Keep eval cases focused — Each case should test one specific behavior or scenario. Avoid combining unrelated tests.
- Use descriptive names — Name your cases clearly so you can quickly understand what they test (e.g., "Returns refund policy when asked" rather than "Test 1").
- Start with a few metrics — Begin with one or two metrics per case, then add more as you understand what each metric measures.
- Use collections for regression testing — Group your most important test cases into a collection and run it after every significant change.
- Set realistic thresholds — Start with the default threshold (0.8) and adjust based on your results. Setting it too high may cause unnecessary failures.
Next Steps
- AI-Generated Eval Cases — Let AI create eval cases for you automatically.
- Understanding Results — Learn how to interpret scores and take action.