Task Completion
Generic Metric
Introduction
The Task Completion metric evaluates whether your AI Agent successfully completed a specific task you defined. It checks both the response content and the tools the Agent used to determine if the task was fully accomplished.
When to Use This Metric
- You need to verify that your Agent completes specific workflows (e.g., booking a meeting, looking up information).
- You're testing whether the Agent follows multi-step processes correctly.
- You want to validate that the Agent uses tools appropriately to accomplish a task.
- You're testing task-oriented Agents that need to perform actions, not just answer questions.
- You need to ensure the Agent handles the full lifecycle of a request.
Configuration
| Parameter | Type | Default | Required | Description |
|---|---|---|---|---|
threshold | float | 0.8 | No | Score threshold for passing (0.0–1.0). |
strict_mode | boolean | false | No | Rounds score to 1.0 or 0.0 based on threshold. |
task | string | — | Yes | The task that should be completed by the AI Agent. |
The task parameter is required. You must provide a clear description of the task you expect the Agent to complete. Without it, the metric cannot evaluate task completion.
How It Works
- The AI Agent receives the input message and generates a response, potentially using tools in the process.
- The testing LLM reviews the Agent's response and the list of tools called during the interaction.
- The testing LLM evaluates whether the described task was fully completed based on the response and tool usage.
- A score is produced reflecting the degree of task completion.
Scoring
- Range: 0.0 to 1.0 (higher is better).
- High score (close to 1.0): The Agent fully completed all aspects of the defined task.
- Low score (close to 0.0): The Agent failed to complete the task or only partially addressed it.
- Pass condition: The score must be greater than or equal to the configured
threshold.
Example
Task: "The Agent should look up the customer's order status and provide the tracking number."
Input: "Can you check the status of my order #12345?"
AI Response: "Your order #12345 is currently in transit. The tracking number is TRK-987654321. It's expected to arrive by Thursday."
Tools called: lookup_order
Score: 0.92
Result: Passed (threshold: 0.8)
The Agent completed both parts of the task: looking up the order status and providing the tracking number.
Tips for Improving Scores
- Write clear, specific task descriptions that define exactly what "completion" looks like.
- Break complex tasks into subtasks in the description (e.g., "The Agent should: 1) look up the order, 2) provide the tracking number, 3) give the delivery estimate").
- Ensure the Agent has access to all the tools it needs to complete the task.
- If the Agent partially completes tasks, check whether tool descriptions are clear enough for the Agent to know when and how to use them.
- Review the reason text when scores are low — it often indicates which part of the task was missed.