Understanding Results

After running an evaluation, each metric produces a score, a pass/fail status, and an explanation. This page helps you interpret those results and take action to improve your Agent's performance.

Scores Explained

Every metric produces a score between 0.0 and 1.0:

1.0 — Perfect score. The response fully satisfies the metric's criteria.
0.8 — Good. The response meets most criteria (this is the default passing threshold).
0.5 — Moderate. The response partially meets criteria but has room for improvement.
0.0 — The response does not meet the metric's criteria at all.

For most metrics, higher is better. The one exception is the Hallucination metric, where a lower score is better (lower means fewer hallucinations).

Strict Mode

When strict mode is enabled on a metric, the score is rounded to a binary result:

If the score is above the threshold → the final score becomes 1.0.
If the score is below the threshold → the final score becomes 0.0.

This is useful when you want a clear pass/fail result with no ambiguity.

Reasoning

Most metrics include a reason — a text explanation generated by the testing LLM that describes why the score was given. This helps you understand exactly what the metric found in the response.

The reason is especially valuable when a metric fails, as it often points directly to the issue you need to fix.

info

The Pattern Match metric does not use an LLM for evaluation, so its reason simply states whether the regex pattern was found in the response.

Common Scenarios & What to Do

Scenario	Likely Cause	What to Do
Low Answer Relevancy score	The response doesn't address the user's question directly	Refine your system prompt to focus on answering questions concisely and directly.
Low Task Completion score	The Agent didn't complete the required task or steps	Check that the task description in the metric config is clear. Verify the Agent has access to the necessary tools.
Low Tool Correctness score	The Agent used the wrong tools or skipped required ones	Review the expected tools list. Make sure tool descriptions clearly indicate when each tool should be used.
Low Prompt Alignment score	The response doesn't follow the system prompt instructions	Rewrite ambiguous instructions in your system prompt. Make guidelines specific and actionable.
Pattern Match failure	The response doesn't contain the expected format or keywords	Verify the regex pattern is correct. Check if the Agent's response format matches what you expect.
High Hallucination score	The Agent fabricated information not in the knowledge base	Improve your knowledge base content. Add instructions in the system prompt to only use provided context.
Low Faithfulness score	The response deviates from the retrieved context	Ensure your knowledge base contains complete information. Instruct the Agent to stay grounded in source material.
Low Contextual Precision score	Too many irrelevant chunks are being retrieved	Review your knowledge base chunking strategy. Remove outdated or duplicate content.
Low Contextual Recall score	Not all relevant information is being retrieved	Add missing content to your knowledge base. Check that documents are properly chunked and indexed.
Low Contextual Relevancy score	Retrieved context doesn't match the query well	Improve document titles and content structure. Consider reorganizing your knowledge base.

Next Steps

Metrics Overview — Explore each metric's configuration in detail.
Eval Cases & Collections — Refine your eval cases based on your results.

Understanding Results

Scores Explained​

Strict Mode​

Reasoning​

Common Scenarios & What to Do​

Next Steps​

Scores Explained

Strict Mode

Reasoning

Common Scenarios & What to Do

Next Steps