Component
CDK / infrastructure
Describe the bug
When a repository is routed to the ECS Fargate backend, the orchestrator's
DynamoDB heartbeat check (pollTaskStatus in
cdk/src/handlers/shared/orchestrator.ts) unconditionally flips
sessionUnhealthy = true after ~6 minutes of RUNNING, because the ECS batch
entrypoint never writes agent_heartbeat_at. finalizeTask then overrides the
real outcome with:
Agent session lost: no recent heartbeat from the runtime (container may have crashed, been OOM-killed, or stopped)
even though the agent completed successfully, pushed the branch, and opened the PR.
Expected behavior
Upon making changes listed in possible solution,
$cli % node lib/bin/bgagent.js status 01KPXC5AEGM16EFC6SMM182BG1
Task: 01KPXC5AEGM16EFC6SMM182BG1
Status: COMPLETED
Repo: charan-amzn/agent-plugins
Description: Append a line at the end of README.md that says 'ECS smoke test after orchestrator heartbeat fix'
Branch: bgagent/01KPXC5AEGM16EFC6SMM182BG1/append-a-line-at-the-end-of-readme-md-that-says-ec
Session: arn:aws:ecs:us-west-2:123456789122:task/backgroundagent-dev-EcsAgentClusterDA22BECE-Lwm9HnUhAeDh/e61c91c9832249b0be01fea70e5d9e3a
PR: charan-amzn/agent-plugins#2
Created: 2026-04-23T14:33:32.880Z
Started: 2026-04-23T14:33:38.803Z
Completed: 2026-04-23T14:39:25Z
Duration: 197.7s
Cost: $0.1467
Current behavior
An example task status after PR being successful
$cli % node lib/bin/bgagent.js status 01KPVT9E1TQBH7FHG2RJ0WBZXY 2>&1
Task: 01KPVT9E1TQBH7FHG2RJ0WBZXY
Status: FAILED
Repo: charan-amzn/agent-plugins
Description: Add a line to README.md that says 'Hello from ABCA on ECS Fargate'
Branch: bgagent/01KPVT9E1TQBH7FHG2RJ0WBZXY/add-a-line-to-readme-md-that-says-hello-from-abca-
Session: arn:aws:ecs:us-west-2:123456789122:task/backgroundagent-dev-EcsAgentClusterDA22BECE-Lwm9HnUhAeDh/63194785609643249912bdc14b699190
Error: Agent session lost: no recent heartbeat from the runtime (container may have crashed, been OOM-killed, or stopped)
Created: 2026-04-23T00:01:58.842Z
Started: 2026-04-23T00:02:04.451Z
Completed: 2026-04-23T00:06:37.219Z
Reproduction steps
-
Onboard a repo with compute: { type: 'ecs' } via Blueprint.
-
Submit a task that takes more than ~2 minutes of wall-clock time (e.g. one
that runs mise run build after the edit):
bgagent submit \
--repo your-org/your-repo \
--task "Add a line to README.md that says hello"
-
Observe the Fargate task complete, the PR appear in GitHub, and then:
bgagent status <task-id>
Status: FAILED
Error: Agent session lost: no recent heartbeat from the runtime ...
A real recorded run that hit this: task 01KPVT9E1TQBH7FHG2RJ0WBZXY (referenced in summary) — PR created successfully, status still FAILED.
Possible solution
Pass the compute type into pollTaskStatus and skip the heartbeat-staleness logic when the backend is ECS.
Diff from my successful test
cdk/src/handlers/shared/orchestrator.ts
export async function pollTaskStatus(
taskId: string,
state: PollState,
+ computeType?: string,
): Promise<PollState> {
...
let sessionUnhealthy = false;
if (
+ computeType !== 'ecs'
- currentStatus === TaskStatus.RUNNING
+ && currentStatus === TaskStatus.RUNNING
&& item?.session_id
&& typeof item.started_at === 'string'
) {
cdk/src/handlers/orchestrate-task.ts
- const ddbState = await pollTaskStatus(taskId, state);
+ const ddbState = await pollTaskStatus(taskId, state, blueprintConfig.compute_type);
No behavior change on AgentCore. On ECS, sessionUnhealthy stays false; the existing computeStrategy.pollSession path still detects real failures and real success.
Node version
No response
Python version
No response
mise version
No response
AWS region
No response
Local modifications
No response
Additional context
No response
AWS quotas
Component
CDK / infrastructure
Describe the bug
When a repository is routed to the ECS Fargate backend, the orchestrator's
DynamoDB heartbeat check (
pollTaskStatusincdk/src/handlers/shared/orchestrator.ts) unconditionally flipssessionUnhealthy = trueafter ~6 minutes ofRUNNING, because the ECS batchentrypoint never writes
agent_heartbeat_at.finalizeTaskthen overrides thereal outcome with:
even though the agent completed successfully, pushed the branch, and opened the PR.
Expected behavior
Upon making changes listed in possible solution,
$cli % node lib/bin/bgagent.js status 01KPXC5AEGM16EFC6SMM182BG1
Task: 01KPXC5AEGM16EFC6SMM182BG1
Status: COMPLETED
Repo: charan-amzn/agent-plugins
Description: Append a line at the end of README.md that says 'ECS smoke test after orchestrator heartbeat fix'
Branch: bgagent/01KPXC5AEGM16EFC6SMM182BG1/append-a-line-at-the-end-of-readme-md-that-says-ec
Session: arn:aws:ecs:us-west-2:123456789122:task/backgroundagent-dev-EcsAgentClusterDA22BECE-Lwm9HnUhAeDh/e61c91c9832249b0be01fea70e5d9e3a
PR: charan-amzn/agent-plugins#2
Created: 2026-04-23T14:33:32.880Z
Started: 2026-04-23T14:33:38.803Z
Completed: 2026-04-23T14:39:25Z
Duration: 197.7s
Cost: $0.1467
Current behavior
An example task status after PR being successful
Reproduction steps
Onboard a repo with
compute: { type: 'ecs' }viaBlueprint.Submit a task that takes more than ~2 minutes of wall-clock time (e.g. one
that runs
mise run buildafter the edit):bgagent submit \ --repo your-org/your-repo \ --task "Add a line to README.md that says hello"Observe the Fargate task complete, the PR appear in GitHub, and then:
A real recorded run that hit this: task
01KPVT9E1TQBH7FHG2RJ0WBZXY(referenced in summary) — PR created successfully, status stillFAILED.Possible solution
Pass the compute type into
pollTaskStatusand skip the heartbeat-staleness logic when the backend is ECS.Diff from my successful test
cdk/src/handlers/shared/orchestrator.tscdk/src/handlers/orchestrate-task.tsNo behavior change on AgentCore. On ECS,
sessionUnhealthystaysfalse; the existingcomputeStrategy.pollSessionpath still detects real failures and real success.Node version
No response
Python version
No response
mise version
No response
AWS region
No response
Local modifications
No response
Additional context
No response
AWS quotas