Tool Correctness
Generic Metric
Introduction
The Tool Correctness metric validates that your AI Agent used the correct tools during a conversation. It compares the tools the Agent actually called against a list of expected tools you define, ensuring the Agent selects the right tools for the job.
When to Use This Metric
- You want to verify that the Agent calls specific tools for specific user requests.
- You're testing whether prompt or tool description changes affect tool selection behavior.
- You need to ensure the Agent doesn't call unnecessary or incorrect tools.
- You're validating tool routing logic for Agents with multiple available tools.
- You want to catch cases where the Agent tries to answer from memory instead of using a tool.
Configuration
| Parameter | Type | Default | Required | Description |
|---|---|---|---|---|
threshold | float | 0.8 | No | Score threshold for passing (0.0–1.0). |
strict_mode | boolean | false | No | Rounds score to 1.0 or 0.0 based on threshold. |
expected_tools | list of strings | [] | No | List of tool names that the Agent should use for this scenario. |
How It Works
- The AI Agent receives the input message and processes it, potentially calling one or more tools.
- The system captures the list of tools that were actually called during the interaction.
- The testing LLM compares the actual tool calls against the expected tools list.
- A score is produced reflecting how well the actual tool usage matches the expected tools.
Scoring
- Range: 0.0 to 1.0 (higher is better).
- High score (close to 1.0): The Agent used the expected tools correctly.
- Low score (close to 0.0): The Agent used incorrect tools, skipped required tools, or called unexpected tools.
- Pass condition: The score must be greater than or equal to the configured
threshold.
Example
Expected tools: ["search_knowledge_base", "create_ticket"]
Input: "I can't log in to my account. Can you create a support ticket?"
AI Response: "I've searched our knowledge base for common login issues and created a support ticket #4521 for you. Our team will follow up within 24 hours."
Tools called: search_knowledge_base, create_ticket
Score: 1.0
Result: Passed (threshold: 0.8)
The Agent called exactly the expected tools in this scenario.
Tips for Improving Scores
- Make tool descriptions clear and specific so the Agent knows exactly when to use each tool.
- List all expected tools — if the Agent should call multiple tools for a scenario, include all of them.
- If the Agent calls extra tools beyond the expected list, review whether those calls are necessary or if tool descriptions need refinement.
- Test edge cases where the Agent might be tempted to use the wrong tool (e.g., similar-sounding tools with different purposes).
- Ensure tool names in the
expected_toolslist match the exact tool names configured in your Agent.