Batch evaluation runs evaluators across all agent sessions in CloudWatch, producing per-session scores and aggregate metrics. Use it to measure agent quality over time, compare before/after prompt changes, or validate ground truth expectations.
# Run a single evaluator across all sessions
agentcore run batch-evaluation -r MyAgent -e Builtin.Correctness
# Multiple evaluators
agentcore run batch-evaluation -r MyAgent -e Builtin.Correctness Builtin.Helpfulness Builtin.Faithfulness
# JSON output for scripting
agentcore run batch-evaluation -r MyAgent -e Builtin.Helpfulness --jsonBuilt-in evaluators provided by AgentCore:
| Evaluator | What it measures |
|---|---|
Builtin.Correctness |
Factual accuracy of responses |
Builtin.Helpfulness |
How well responses address the user's goal |
Builtin.Faithfulness |
Grounding in tool results / provided context |
Builtin.GoalSuccessRate |
Whether the agent achieved the user's goal |
Builtin.ToolSelectionAccuracy |
Correct tool chosen for the task |
Builtin.Completeness |
Whether all parts of the request were handled |
Builtin.TrajectoryExactOrderMatch |
Tool call sequence matches expected trajectory |
Custom evaluators defined in your project (via agentcore add evaluator) can also be used.
# Only sessions from the last 3 days
agentcore run batch-evaluation -r MyAgent -e Builtin.Helpfulness --lookback-days 3agentcore run batch-evaluation -r MyAgent -e Builtin.Correctness -s <session-id-1> <session-id-2>Provide expected responses, assertions, or expected tool trajectories for specific sessions:
agentcore run batch-evaluation \
-r MyAgent \
-e Builtin.Correctness Builtin.GoalSuccessRate \
-s <session-id> \
--ground-truth ./ground_truth.json[
{
"sessionId": "<session-id>",
"groundTruth": {
"inline": {
"assertions": [{ "text": "Agent should use the lookup_order tool" }],
"expectedTrajectory": {
"toolNames": ["lookup_order"]
},
"turns": [
{
"input": "What's the status of order ORD-1001?",
"expectedResponse": { "text": "Order ORD-1001 has been delivered" }
}
]
}
}
}
]All fields inside inline are optional — include only what's relevant:
assertions— free-text expectations evaluated byBuiltin.GoalSuccessRateexpectedTrajectory— tool call sequence evaluated byBuiltin.TrajectoryExactOrderMatchturns— input/expected-response pairs evaluated byBuiltin.Correctness
agentcore run batch-evaluation -r MyAgent -e Builtin.Helpfulness -n "weekly_quality_check"Names must start with a letter and contain only letters, digits, and underscores (max 48 characters).
agentcore stop batch-evaluation -i <batch-evaluation-id>The CLI shows scores grouped by evaluator with average scores after the run completes.
Results are saved in .cli/eval-job-results/. View past runs via the TUI:
agentcore
# Navigate to: Evals → Batch Evaluation Historyagentcore run batch-evaluation -r MyAgent -e Builtin.Helpfulness --jsonReturns batchEvaluationId, evaluationResults with numberOfSessionsCompleted, evaluatorSummaries with
per-evaluator averageScore.
Run agentcore → Run → Batch Evaluation for a guided flow:
- Select agent
- Multi-select evaluators
- Set lookback days
- Optionally select specific sessions
- Optionally add ground truth
- Name the run (optional)
- Confirm and run
The TUI shows real-time progress with elapsed time and step indicators.