AI-Generated Eval Cases
Instead of manually creating every eval case, you can let AI analyze your Agent's configuration and automatically generate a diverse set of test cases. This is a fast way to build initial coverage for your evaluations.
How It Works
The AI generation process works in two steps:
- Scenario generation — The AI analyzes your Agent's system prompt, available tools, and knowledge base configuration to create realistic test scenarios. It generates diverse user messages covering different intents, phrasings, and complexity levels — including edge cases like out-of-scope requests and ambiguous queries.
- Metric assignment — The AI then reviews each generated scenario and assigns appropriate metrics with sensible thresholds. Every case gets at least Answer Relevancy and Prompt Alignment. Cases that involve tools get Tool Correctness, and task-oriented cases get Task Completion.
Using AI Generation
- Navigate to your Agent's Evals tab.
- Click the Generate with AI button.
- The system will analyze your Agent's configuration and generate eval cases.
- Review the generated cases — you can edit, remove, or adjust any of them before saving.
- Save the cases you want to keep.
What Gets Generated
For each eval case, the AI creates:
- Name — A short descriptive title for the test case.
- Description — What the test verifies and why it matters.
- Human message — A realistic user input, including natural language patterns like informal phrasing or typos where appropriate.
- Expected AI response — The ideal response the Agent should produce, used as a reference for metric evaluation.
- Metrics — A set of metrics with configured thresholds (typically 0.8, lowered to 0.7 for subjective or creative tasks).
Limitations
- Generic metrics only — AI generation currently assigns only generic metrics (Answer Relevancy, Task Completion, Tool Correctness, Prompt Alignment, Pattern Match). It does not assign RAG metrics. If your Agent uses a knowledge base, add RAG metrics manually to the generated cases.
- Always review generated cases — The AI provides a strong starting point, but you should review and adjust the generated cases to match your specific requirements.
- Quality depends on your configuration — The better your system prompt and tool descriptions, the more relevant the generated cases will be.
Next Steps
- Understanding Results — Learn how to interpret the scores from your eval runs.
- Metrics Overview — Explore all available metrics and their configurations.