Skip to main content

Task Completion

Generic Metric

Introduction

The Task Completion metric evaluates whether your AI Agent successfully completed a specific task you defined. It checks both the response content and the tools the Agent used to determine if the task was fully accomplished.


When to Use This Metric

  • You need to verify that your Agent completes specific workflows (e.g., booking a meeting, looking up information).
  • You're testing whether the Agent follows multi-step processes correctly.
  • You want to validate that the Agent uses tools appropriately to accomplish a task.
  • You're testing task-oriented Agents that need to perform actions, not just answer questions.
  • You need to ensure the Agent handles the full lifecycle of a request.

Configuration

ParameterTypeDefaultRequiredDescription
thresholdfloat0.8NoScore threshold for passing (0.0–1.0).
strict_modebooleanfalseNoRounds score to 1.0 or 0.0 based on threshold.
taskstringYesThe task that should be completed by the AI Agent.
warning

The task parameter is required. You must provide a clear description of the task you expect the Agent to complete. Without it, the metric cannot evaluate task completion.


How It Works

  1. The AI Agent receives the input message and generates a response, potentially using tools in the process.
  2. The testing LLM reviews the Agent's response and the list of tools called during the interaction.
  3. The testing LLM evaluates whether the described task was fully completed based on the response and tool usage.
  4. A score is produced reflecting the degree of task completion.

Scoring

  • Range: 0.0 to 1.0 (higher is better).
  • High score (close to 1.0): The Agent fully completed all aspects of the defined task.
  • Low score (close to 0.0): The Agent failed to complete the task or only partially addressed it.
  • Pass condition: The score must be greater than or equal to the configured threshold.

Example

Task: "The Agent should look up the customer's order status and provide the tracking number."

Input: "Can you check the status of my order #12345?"

AI Response: "Your order #12345 is currently in transit. The tracking number is TRK-987654321. It's expected to arrive by Thursday."

Tools called: lookup_order

Score: 0.92

Result: Passed (threshold: 0.8)

The Agent completed both parts of the task: looking up the order status and providing the tracking number.


Tips for Improving Scores

  • Write clear, specific task descriptions that define exactly what "completion" looks like.
  • Break complex tasks into subtasks in the description (e.g., "The Agent should: 1) look up the order, 2) provide the tracking number, 3) give the delivery estimate").
  • Ensure the Agent has access to all the tools it needs to complete the task.
  • If the Agent partially completes tasks, check whether tool descriptions are clear enough for the Agent to know when and how to use them.
  • Review the reason text when scores are low — it often indicates which part of the task was missed.