Summary
agentcore run eval consistently fails with "Failed to parse event body for span with ID: <span_id>" when evaluating Strands-based agents. After tracing through the bedrock-agentcore SDK source code, I believe this is a mismatch between what the CLI sends to the evaluate API and what the backend evaluator expects.
Environment
agentcore CLI version: latest (Node.js, @aws/agentcore)
bedrock-agentcore SDK version: 1.7.0
strands-agents version: >=0.1.0 (1.37.0 installed)
- Agent framework: Strands (Python)
- AWS region:
us-east-1
Steps to Reproduce
- Deploy a Strands-based agent to AgentCore Runtime
- Invoke the agent once to produce a session with spans in CloudWatch (
aws/spans)
- Run
agentcore run eval --evaluator "Builtin.Correctness" --trace-id <trace_id>
Observed Behavior
{
"evaluator": "Builtin.Correctness",
"aggregateScore": 0,
"sessionScores": [
{
"sessionId": "a3498176-816d-462b-a038-8ad4e7236a50",
"traceId": "69f15c222374220b7e1f00bc2f1dcaa6",
"value": 0,
"errorMessage": "Failed to parse event body for span with ID: 040303ac6912ea97"
}
]
}
Root Cause Analysis
I traced through the bedrock-agentcore SDK and found two distinct evaluation paths that produce different sessionSpans payloads:
Path 1: Python SDK (StrandsEvalsAgentCoreEvaluator) — works
When using convert_strands_to_adot() on raw in-memory OTel spans, it:
- Calls
ADOTDocumentBuilder.build_span_document() — which omits the events array from the span document
- Generates a separate conversation log record document with
body.input.messages / body.output.messages
- Sends both to the evaluate API in
sessionSpans
The backend reads the log record and evaluates successfully.
Path 2: agentcore run eval CLI — broken
The CLI uses CloudWatchAgentSpanCollector._fetch_spans() which queries aws/spans (and the runtime log group) and sends the raw CloudWatch spans directly to the evaluate API without any transformation.
These CloudWatch spans have the events array embedded but no separate conversation log records. The backend tries to parse the span events and fails.
Relevant Code
CloudWatchAgentSpanCollector._fetch_spans() (agent_span_collector.py):
# Fetches raw ADOT spans from CloudWatch including the events array
aws_spans = self._helper.query_log_group(AWS_SPANS_LOG_GROUP, ...)
event_spans = self._helper.query_log_group(self.log_group_name, ...)
all_data = aws_spans + event_spans
return all_data # Sent as-is to evaluate API — no conversation log records generated
ADOTDocumentBuilder.build_span_document() (adot_models.py):
return {
"resource": ..., "scope": ..., "traceId": ..., "spanId": ...,
"attributes": attributes,
"status": ...,
# NOTE: no "events" key — events are intentionally dropped here
}
Failing Span (from aws/spans)
The span that the backend fails to parse is the invoke_agent Strands Agents span. Its events look like:
{
"traceId": "69f15c222374220b7e1f00bc2f1dcaa6",
"spanId": "040303ac6912ea97",
"name": "invoke_agent Strands Agents",
"events": [
{
"name": "gen_ai.user.message",
"attributes": {
"content": "[{\"text\": \"Is this email phishing? Subject: ...\"}]"
}
},
{
"name": "gen_ai.choice",
"attributes": {
"message": "{\"label\":\"phishing\", ...}\n",
"finish_reason": "end_turn"
}
}
]
}
The content field is a JSON-encoded Anthropic message content array ([{"text": "..."}]), not a plain string. This is what strands-agents emits via serialize(message["content"]).
Proposed Fix
The CLI's evaluation path should apply the same span transformation that the Python SDK does before sending to the evaluate API:
- For each span with events, run it through
StrandsToADOTConverter.convert_span() (or equivalent)
- This drops the
events from the span document and generates a separate conversation log record with body.input.messages / body.output.messages
- Include both the transformed span documents and the conversation log records in
sessionSpans
Alternatively, the backend evaluator could be updated to correctly parse the Strands event format (Anthropic content arrays) directly from span events.
Workaround
Use the Python SDK evaluation path directly instead of the CLI:
from bedrock_agentcore.evaluation.integrations.strands_agents_evals.evaluator import StrandsEvalsAgentCoreEvaluator
# ... invoke agent with in-memory OTel exporter, convert spans, evaluate
This bypasses CloudWatch entirely and works correctly.
Summary
agentcore run evalconsistently fails with"Failed to parse event body for span with ID: <span_id>"when evaluating Strands-based agents. After tracing through the bedrock-agentcore SDK source code, I believe this is a mismatch between what the CLI sends to the evaluate API and what the backend evaluator expects.Environment
agentcoreCLI version: latest (Node.js,@aws/agentcore)bedrock-agentcoreSDK version:1.7.0strands-agentsversion:>=0.1.0(1.37.0 installed)us-east-1Steps to Reproduce
aws/spans)agentcore run eval --evaluator "Builtin.Correctness" --trace-id <trace_id>Observed Behavior
{ "evaluator": "Builtin.Correctness", "aggregateScore": 0, "sessionScores": [ { "sessionId": "a3498176-816d-462b-a038-8ad4e7236a50", "traceId": "69f15c222374220b7e1f00bc2f1dcaa6", "value": 0, "errorMessage": "Failed to parse event body for span with ID: 040303ac6912ea97" } ] }Root Cause Analysis
I traced through the bedrock-agentcore SDK and found two distinct evaluation paths that produce different
sessionSpanspayloads:Path 1: Python SDK (
StrandsEvalsAgentCoreEvaluator) — worksWhen using
convert_strands_to_adot()on raw in-memory OTel spans, it:ADOTDocumentBuilder.build_span_document()— which omits theeventsarray from the span documentbody.input.messages/body.output.messagessessionSpansThe backend reads the log record and evaluates successfully.
Path 2:
agentcore run evalCLI — brokenThe CLI uses
CloudWatchAgentSpanCollector._fetch_spans()which queriesaws/spans(and the runtime log group) and sends the raw CloudWatch spans directly to the evaluate API without any transformation.These CloudWatch spans have the
eventsarray embedded but no separate conversation log records. The backend tries to parse the span events and fails.Relevant Code
CloudWatchAgentSpanCollector._fetch_spans()(agent_span_collector.py):ADOTDocumentBuilder.build_span_document()(adot_models.py):Failing Span (from
aws/spans)The span that the backend fails to parse is the
invoke_agent Strands Agentsspan. Its events look like:{ "traceId": "69f15c222374220b7e1f00bc2f1dcaa6", "spanId": "040303ac6912ea97", "name": "invoke_agent Strands Agents", "events": [ { "name": "gen_ai.user.message", "attributes": { "content": "[{\"text\": \"Is this email phishing? Subject: ...\"}]" } }, { "name": "gen_ai.choice", "attributes": { "message": "{\"label\":\"phishing\", ...}\n", "finish_reason": "end_turn" } } ] }The
contentfield is a JSON-encoded Anthropic message content array ([{"text": "..."}]), not a plain string. This is whatstrands-agentsemits viaserialize(message["content"]).Proposed Fix
The CLI's evaluation path should apply the same span transformation that the Python SDK does before sending to the evaluate API:
StrandsToADOTConverter.convert_span()(or equivalent)eventsfrom the span document and generates a separate conversation log record withbody.input.messages/body.output.messagessessionSpansAlternatively, the backend evaluator could be updated to correctly parse the Strands event format (Anthropic content arrays) directly from span events.
Workaround
Use the Python SDK evaluation path directly instead of the CLI:
This bypasses CloudWatch entirely and works correctly.