Skip to main content

Tool Correctness

Generic Metric

Introduction

The Tool Correctness metric validates that your AI Agent used the correct tools during a conversation. It compares the tools the Agent actually called against a list of expected tools you define, ensuring the Agent selects the right tools for the job.


When to Use This Metric

  • You want to verify that the Agent calls specific tools for specific user requests.
  • You're testing whether prompt or tool description changes affect tool selection behavior.
  • You need to ensure the Agent doesn't call unnecessary or incorrect tools.
  • You're validating tool routing logic for Agents with multiple available tools.
  • You want to catch cases where the Agent tries to answer from memory instead of using a tool.

Configuration

ParameterTypeDefaultRequiredDescription
thresholdfloat0.8NoScore threshold for passing (0.0–1.0).
strict_modebooleanfalseNoRounds score to 1.0 or 0.0 based on threshold.
expected_toolslist of strings[]NoList of tool names that the Agent should use for this scenario.

How It Works

  1. The AI Agent receives the input message and processes it, potentially calling one or more tools.
  2. The system captures the list of tools that were actually called during the interaction.
  3. The testing LLM compares the actual tool calls against the expected tools list.
  4. A score is produced reflecting how well the actual tool usage matches the expected tools.

Scoring

  • Range: 0.0 to 1.0 (higher is better).
  • High score (close to 1.0): The Agent used the expected tools correctly.
  • Low score (close to 0.0): The Agent used incorrect tools, skipped required tools, or called unexpected tools.
  • Pass condition: The score must be greater than or equal to the configured threshold.

Example

Expected tools: ["search_knowledge_base", "create_ticket"]

Input: "I can't log in to my account. Can you create a support ticket?"

AI Response: "I've searched our knowledge base for common login issues and created a support ticket #4521 for you. Our team will follow up within 24 hours."

Tools called: search_knowledge_base, create_ticket

Score: 1.0

Result: Passed (threshold: 0.8)

The Agent called exactly the expected tools in this scenario.


Tips for Improving Scores

  • Make tool descriptions clear and specific so the Agent knows exactly when to use each tool.
  • List all expected tools — if the Agent should call multiple tools for a scenario, include all of them.
  • If the Agent calls extra tools beyond the expected list, review whether those calls are necessary or if tool descriptions need refinement.
  • Test edge cases where the Agent might be tempted to use the wrong tool (e.g., similar-sounding tools with different purposes).
  • Ensure tool names in the expected_tools list match the exact tool names configured in your Agent.