Understanding Results
After running an evaluation, each metric produces a score, a pass/fail status, and an explanation. This page helps you interpret those results and take action to improve your Agent's performance.
Scores Explained
Every metric produces a score between 0.0 and 1.0:
- 1.0 — Perfect score. The response fully satisfies the metric's criteria.
- 0.8 — Good. The response meets most criteria (this is the default passing threshold).
- 0.5 — Moderate. The response partially meets criteria but has room for improvement.
- 0.0 — The response does not meet the metric's criteria at all.
For most metrics, higher is better. The one exception is the Hallucination metric, where a lower score is better (lower means fewer hallucinations).
Strict Mode
When strict mode is enabled on a metric, the score is rounded to a binary result:
- If the score is above the threshold → the final score becomes 1.0.
- If the score is below the threshold → the final score becomes 0.0.
This is useful when you want a clear pass/fail result with no ambiguity.
Reasoning
Most metrics include a reason — a text explanation generated by the testing LLM that describes why the score was given. This helps you understand exactly what the metric found in the response.
The reason is especially valuable when a metric fails, as it often points directly to the issue you need to fix.
The Pattern Match metric does not use an LLM for evaluation, so its reason simply states whether the regex pattern was found in the response.
Common Scenarios & What to Do
| Scenario | Likely Cause | What to Do |
|---|---|---|
| Low Answer Relevancy score | The response doesn't address the user's question directly | Refine your system prompt to focus on answering questions concisely and directly. |
| Low Task Completion score | The Agent didn't complete the required task or steps | Check that the task description in the metric config is clear. Verify the Agent has access to the necessary tools. |
| Low Tool Correctness score | The Agent used the wrong tools or skipped required ones | Review the expected tools list. Make sure tool descriptions clearly indicate when each tool should be used. |
| Low Prompt Alignment score | The response doesn't follow the system prompt instructions | Rewrite ambiguous instructions in your system prompt. Make guidelines specific and actionable. |
| Pattern Match failure | The response doesn't contain the expected format or keywords | Verify the regex pattern is correct. Check if the Agent's response format matches what you expect. |
| High Hallucination score | The Agent fabricated information not in the knowledge base | Improve your knowledge base content. Add instructions in the system prompt to only use provided context. |
| Low Faithfulness score | The response deviates from the retrieved context | Ensure your knowledge base contains complete information. Instruct the Agent to stay grounded in source material. |
| Low Contextual Precision score | Too many irrelevant chunks are being retrieved | Review your knowledge base chunking strategy. Remove outdated or duplicate content. |
| Low Contextual Recall score | Not all relevant information is being retrieved | Add missing content to your knowledge base. Check that documents are properly chunked and indexed. |
| Low Contextual Relevancy score | Retrieved context doesn't match the query well | Improve document titles and content structure. Consider reorganizing your knowledge base. |
Next Steps
- Metrics Overview — Explore each metric's configuration in detail.
- Eval Cases & Collections — Refine your eval cases based on your results.