Skip to content

ECS Fargate tasks marked FAILED after successful completion #45

@charan-amzn

Description

@charan-amzn

Component

CDK / infrastructure

Describe the bug

When a repository is routed to the ECS Fargate backend, the orchestrator's
DynamoDB heartbeat check (pollTaskStatus in
cdk/src/handlers/shared/orchestrator.ts) unconditionally flips
sessionUnhealthy = true after ~6 minutes of RUNNING, because the ECS batch
entrypoint never writes agent_heartbeat_at. finalizeTask then overrides the
real outcome with:

Agent session lost: no recent heartbeat from the runtime (container may have crashed, been OOM-killed, or stopped)

even though the agent completed successfully, pushed the branch, and opened the PR.

Expected behavior

Upon making changes listed in possible solution,

$cli % node lib/bin/bgagent.js status 01KPXC5AEGM16EFC6SMM182BG1
Task: 01KPXC5AEGM16EFC6SMM182BG1
Status: COMPLETED
Repo: charan-amzn/agent-plugins
Description: Append a line at the end of README.md that says 'ECS smoke test after orchestrator heartbeat fix'
Branch: bgagent/01KPXC5AEGM16EFC6SMM182BG1/append-a-line-at-the-end-of-readme-md-that-says-ec
Session: arn:aws:ecs:us-west-2:123456789122:task/backgroundagent-dev-EcsAgentClusterDA22BECE-Lwm9HnUhAeDh/e61c91c9832249b0be01fea70e5d9e3a
PR: charan-amzn/agent-plugins#2
Created: 2026-04-23T14:33:32.880Z
Started: 2026-04-23T14:33:38.803Z
Completed: 2026-04-23T14:39:25Z
Duration: 197.7s
Cost: $0.1467

Current behavior

An example task status after PR being successful

$cli % node lib/bin/bgagent.js status 01KPVT9E1TQBH7FHG2RJ0WBZXY 2>&1
Task:        01KPVT9E1TQBH7FHG2RJ0WBZXY
Status:      FAILED
Repo:        charan-amzn/agent-plugins
Description: Add a line to README.md that says 'Hello from ABCA on ECS Fargate'
Branch:      bgagent/01KPVT9E1TQBH7FHG2RJ0WBZXY/add-a-line-to-readme-md-that-says-hello-from-abca-
Session:     arn:aws:ecs:us-west-2:123456789122:task/backgroundagent-dev-EcsAgentClusterDA22BECE-Lwm9HnUhAeDh/63194785609643249912bdc14b699190
Error:       Agent session lost: no recent heartbeat from the runtime (container may have crashed, been OOM-killed, or stopped)
Created:     2026-04-23T00:01:58.842Z
Started:     2026-04-23T00:02:04.451Z
Completed:   2026-04-23T00:06:37.219Z

Reproduction steps

  1. Onboard a repo with compute: { type: 'ecs' } via Blueprint.

  2. Submit a task that takes more than ~2 minutes of wall-clock time (e.g. one
    that runs mise run build after the edit):

    bgagent submit \
      --repo your-org/your-repo \
      --task "Add a line to README.md that says hello"
  3. Observe the Fargate task complete, the PR appear in GitHub, and then:

    bgagent status <task-id>
    Status:  FAILED
    Error:   Agent session lost: no recent heartbeat from the runtime ...
    

A real recorded run that hit this: task 01KPVT9E1TQBH7FHG2RJ0WBZXY (referenced in summary) — PR created successfully, status still FAILED.

Possible solution

Pass the compute type into pollTaskStatus and skip the heartbeat-staleness logic when the backend is ECS.

Diff from my successful test

cdk/src/handlers/shared/orchestrator.ts

 export async function pollTaskStatus(
   taskId: string,
   state: PollState,
+  computeType?: string,
 ): Promise<PollState> {
   ...
   let sessionUnhealthy = false;
   if (
+    computeType !== 'ecs'
-    currentStatus === TaskStatus.RUNNING
+    && currentStatus === TaskStatus.RUNNING
     && item?.session_id
     && typeof item.started_at === 'string'
   ) {

cdk/src/handlers/orchestrate-task.ts

-      const ddbState = await pollTaskStatus(taskId, state);
+      const ddbState = await pollTaskStatus(taskId, state, blueprintConfig.compute_type);

No behavior change on AgentCore. On ECS, sessionUnhealthy stays false; the existing computeStrategy.pollSession path still detects real failures and real success.

Node version

No response

Python version

No response

mise version

No response

AWS region

No response

Local modifications

No response

Additional context

No response

AWS quotas

  • I have reviewed the relevant service quotas (or N/A)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions