Batch Evaluation [preview]

Batch evaluation runs evaluators across all agent sessions in CloudWatch, producing per-session scores and aggregate metrics. Use it to measure agent quality over time, compare before/after prompt changes, or validate ground truth expectations.

Quick Start

# Run a single evaluator across all sessions
agentcore run batch-evaluation -r MyAgent -e Builtin.Correctness

# Multiple evaluators
agentcore run batch-evaluation -r MyAgent -e Builtin.Correctness Builtin.Helpfulness Builtin.Faithfulness

# JSON output for scripting
agentcore run batch-evaluation -r MyAgent -e Builtin.Helpfulness --json

Available Evaluators

Built-in evaluators provided by AgentCore:

Evaluator	What it measures
`Builtin.Correctness`	Factual accuracy of responses
`Builtin.Helpfulness`	How well responses address the user's goal
`Builtin.Faithfulness`	Grounding in tool results / provided context
`Builtin.GoalSuccessRate`	Whether the agent achieved the user's goal
`Builtin.ToolSelectionAccuracy`	Correct tool chosen for the task
`Builtin.Completeness`	Whether all parts of the request were handled
`Builtin.TrajectoryExactOrderMatch`	Tool call sequence matches expected trajectory

Custom evaluators defined in your project (via agentcore add evaluator) can also be used.

Filtering Sessions

By time window

# Only sessions from the last 3 days
agentcore run batch-evaluation -r MyAgent -e Builtin.Helpfulness --lookback-days 3

By session ID

agentcore run batch-evaluation -r MyAgent -e Builtin.Correctness -s <session-id-1> <session-id-2>

Ground Truth

Provide expected responses, assertions, or expected tool trajectories for specific sessions:

agentcore run batch-evaluation \
  -r MyAgent \
  -e Builtin.Correctness Builtin.GoalSuccessRate \
  -s <session-id> \
  --ground-truth ./ground_truth.json

Ground truth file format

[
  {
    "sessionId": "<session-id>",
    "groundTruth": {
      "inline": {
        "assertions": [{ "text": "Agent should use the lookup_order tool" }],
        "expectedTrajectory": {
          "toolNames": ["lookup_order"]
        },
        "turns": [
          {
            "input": "What's the status of order ORD-1001?",
            "expectedResponse": { "text": "Order ORD-1001 has been delivered" }
          }
        ]
      }
    }
  }
]

All fields inside inline are optional — include only what's relevant:

assertions — free-text expectations evaluated by Builtin.GoalSuccessRate
expectedTrajectory — tool call sequence evaluated by Builtin.TrajectoryExactOrderMatch
turns — input/expected-response pairs evaluated by Builtin.Correctness

Custom Name

agentcore run batch-evaluation -r MyAgent -e Builtin.Helpfulness -n "weekly_quality_check"

Names must start with a letter and contain only letters, digits, and underscores (max 48 characters).

Stopping a Running Evaluation

agentcore stop batch-evaluation -i <batch-evaluation-id>

Viewing Results

CLI output

The CLI shows scores grouped by evaluator with average scores after the run completes.

Local history

Results are saved in .cli/eval-job-results/. View past runs via the TUI:

agentcore
# Navigate to: Evals → Batch Evaluation History

JSON output

agentcore run batch-evaluation -r MyAgent -e Builtin.Helpfulness --json

Returns batchEvaluationId, evaluationResults with numberOfSessionsCompleted, evaluatorSummaries with per-evaluator averageScore.

TUI Wizard

Run agentcore → Run → Batch Evaluation for a guided flow:

Select agent
Multi-select evaluators
Set lookback days
Optionally select specific sessions
Optionally add ground truth
Name the run (optional)
Confirm and run

The TUI shows real-time progress with elapsed time and step indicators.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch Evaluation [preview]

Quick Start

Available Evaluators

Filtering Sessions

By time window

By session ID

Ground Truth

Ground truth file format

Custom Name

Stopping a Running Evaluation

Viewing Results

CLI output

Local history

JSON output

TUI Wizard

FilesExpand file tree

batch-evaluation.md

Latest commit

History

batch-evaluation.md

File metadata and controls

Batch Evaluation [preview]

Quick Start

Available Evaluators

Filtering Sessions

By time window

By session ID

Ground Truth

Ground truth file format

Custom Name

Stopping a Running Evaluation

Viewing Results

CLI output

Local history

JSON output

TUI Wizard