Skip to main content

Understanding Results

After running an evaluation, each metric produces a score, a pass/fail status, and an explanation. This page helps you interpret those results and take action to improve your Agent's performance.


Scores Explained

Every metric produces a score between 0.0 and 1.0:

  • 1.0 — Perfect score. The response fully satisfies the metric's criteria.
  • 0.8 — Good. The response meets most criteria (this is the default passing threshold).
  • 0.5 — Moderate. The response partially meets criteria but has room for improvement.
  • 0.0 — The response does not meet the metric's criteria at all.

For most metrics, higher is better. The one exception is the Hallucination metric, where a lower score is better (lower means fewer hallucinations).


Strict Mode

When strict mode is enabled on a metric, the score is rounded to a binary result:

  • If the score is above the threshold → the final score becomes 1.0.
  • If the score is below the threshold → the final score becomes 0.0.

This is useful when you want a clear pass/fail result with no ambiguity.


Reasoning

Most metrics include a reason — a text explanation generated by the testing LLM that describes why the score was given. This helps you understand exactly what the metric found in the response.

The reason is especially valuable when a metric fails, as it often points directly to the issue you need to fix.

info

The Pattern Match metric does not use an LLM for evaluation, so its reason simply states whether the regex pattern was found in the response.


Common Scenarios & What to Do

ScenarioLikely CauseWhat to Do
Low Answer Relevancy scoreThe response doesn't address the user's question directlyRefine your system prompt to focus on answering questions concisely and directly.
Low Task Completion scoreThe Agent didn't complete the required task or stepsCheck that the task description in the metric config is clear. Verify the Agent has access to the necessary tools.
Low Tool Correctness scoreThe Agent used the wrong tools or skipped required onesReview the expected tools list. Make sure tool descriptions clearly indicate when each tool should be used.
Low Prompt Alignment scoreThe response doesn't follow the system prompt instructionsRewrite ambiguous instructions in your system prompt. Make guidelines specific and actionable.
Pattern Match failureThe response doesn't contain the expected format or keywordsVerify the regex pattern is correct. Check if the Agent's response format matches what you expect.
High Hallucination scoreThe Agent fabricated information not in the knowledge baseImprove your knowledge base content. Add instructions in the system prompt to only use provided context.
Low Faithfulness scoreThe response deviates from the retrieved contextEnsure your knowledge base contains complete information. Instruct the Agent to stay grounded in source material.
Low Contextual Precision scoreToo many irrelevant chunks are being retrievedReview your knowledge base chunking strategy. Remove outdated or duplicate content.
Low Contextual Recall scoreNot all relevant information is being retrievedAdd missing content to your knowledge base. Check that documents are properly chunked and indexed.
Low Contextual Relevancy scoreRetrieved context doesn't match the query wellImprove document titles and content structure. Consider reorganizing your knowledge base.

Next Steps