feat(compute): add EC2 fleet compute strategy#32
feat(compute): add EC2 fleet compute strategy#32MichaelWalker-git wants to merge 11 commits intomainfrom
Conversation
Add a third compute backend (EC2 fleet with SSM Run Command) alongside the existing AgentCore and ECS strategies. This provides maximum flexibility with no image size limits, configurable instance types (including GPU), and full control over the compute environment. New files: - ec2-strategy.ts: ComputeStrategy implementation using EC2 tags for instance tracking and SSM RunShellScript for task dispatch - ec2-agent-fleet.ts: CDK construct with ASG, launch template, security group, S3 payload bucket, and IAM role - ec2-strategy.test.ts and ec2-agent-fleet.test.ts: full test coverage Wiring: - repo-config.ts: add 'ec2' to ComputeType, add instance_type field - compute-strategy.ts: add EC2 SessionHandle variant and resolver case - task-orchestrator.ts: add ec2Config prop with env vars and IAM grants - orchestrate-task.ts: enable compute polling for EC2 - cancel-task.ts: add SSM CancelCommand for EC2 tasks - task-api.ts: add ssm:CancelCommand permission for cancel Lambda - agent.ts: add commented-out EC2 fleet block (same pattern as ECS)
There was a problem hiding this comment.
Pull request overview
This PR adds a third compute backend (“ec2” fleet) alongside AgentCore and ECS, enabling task dispatch via SSM Run Command to tagged EC2 instances and adding the supporting CDK construct and runtime wiring.
Changes:
- Introduces
Ec2ComputeStrategy(S3 payload upload + EC2 instance selection/tagging + SSM dispatch/poll/cancel). - Adds
Ec2AgentFleetCDK construct (ASG + instance role/SG + payload bucket + user data) and optional stack wiring. - Extends config/types and handlers to support
compute_type: 'ec2'and EC2 cancellation/polling.
Reviewed changes
Copilot reviewed 15 out of 16 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| yarn.lock | Locks new AWS SDK client dependencies used by EC2/SSM/S3 integration. |
| cdk/package.json | Adds @aws-sdk/client-ec2, client-ssm, client-s3 dependencies. |
| cdk/test/handlers/shared/strategies/ec2-strategy.test.ts | Unit tests for EC2 strategy start/poll/stop behavior. |
| cdk/test/handlers/shared/compute-strategy.test.ts | Ensures compute_type: ec2 resolves to Ec2ComputeStrategy. |
| cdk/test/constructs/ec2-agent-fleet.test.ts | Asserts key resources/permissions created by Ec2AgentFleet. |
| cdk/src/stacks/agent.ts | Adds commented wiring to enable EC2 fleet backend. |
| cdk/src/handlers/shared/strategies/ec2-strategy.ts | Implements EC2 fleet compute strategy via S3 + EC2 tags + SSM. |
| cdk/src/handlers/shared/repo-config.ts | Extends ComputeType to include ec2; adds instance_type. |
| cdk/src/handlers/shared/orchestrator.ts | Propagates instance_type into BlueprintConfig. |
| cdk/src/handlers/shared/compute-strategy.ts | Adds EC2 handle type + resolver case. |
| cdk/src/handlers/orchestrate-task.ts | Stores EC2 compute metadata and enables compute polling for EC2. |
| cdk/src/handlers/cancel-task.ts | Adds SSM CancelCommand support for EC2-backed tasks. |
| cdk/src/constructs/task-orchestrator.ts | Adds EC2 env vars and IAM policies for orchestrator Lambda. |
| cdk/src/constructs/task-api.ts | Adds ssm:CancelCommand permission option for cancel-task Lambda. |
| cdk/src/constructs/ec2-agent-fleet.ts | New ASG-based fleet construct with IAM/SG/S3/log group/user data. |
| cdk/src/constructs/blueprint.ts | Allows blueprint compute.type to be ec2. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Remove unnecessary iam:PassRole from orchestrator (EC2 strategy never passes a role to any API) - Simplify ec2FleetConfig in task-api to empty object (instanceRoleArn was unused) - Use CDK Tags.of() for ASG fleet tag propagation instead of no-op user-data tagging — instances are now tagged at launch - Fix missing AWS_REGION in boot script by deriving from IMDS - Eliminate shell injection risk by reading all task data from S3 payload at runtime instead of interpolating into bash exports - Add cleanup trap in boot script to always retag instance as idle on exit (success, error, or signal) - Add try/catch rollback in startSession to retag instance as idle when SSM dispatch fails - Generalize ECS-specific log messages in poll loop to be compute-backend-agnostic (uses strategy type label)
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 15 out of 16 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 15 out of 16 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Fix malformed sed quoting in AWS_REGION derivation (ec2-strategy.ts) - Remove unused blueprintConfig destructuring (ec2-strategy.ts) - Scope EC2/SSM IAM permissions: condition ec2:CreateTags on fleet tag, scope ssm:SendCommand to fleet-tagged instances and AWS-RunShellScript document, separate DescribeInstances (requires resource '*')
Code reviewNo issues found. Checked for bugs and CLAUDE.md compliance. 🤖 Generated with Claude Code - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
1. TOCTOU race in instance selection: after tagging an instance as busy, re-describe to verify our task-id stuck. If another orchestrator won the race, try the next idle candidate instead of double-dispatching. 2. Heartbeat false-positive: EC2/ECS tasks invoke run_task() directly and may not send continuous heartbeats. Suppress sessionUnhealthy checks when compute-level crash detection (pollSession) is active, preventing premature task failure after ~6 minutes. 3. SSM Cancelling status: map to 'running' (transient) instead of 'failed' to avoid premature failure while cancel propagates. 4. Fix babel parse errors in test mocks (remove `: unknown` annotations from jest.mock factory callbacks).
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 15 out of 16 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
1. Add rollback on verify failure: if DescribeInstances throws during the tag-then-verify claim, roll back the busy/task-id tags so the instance isn't stuck. 2. Use docker container prune instead of docker system prune in cleanup trap to preserve cached images and avoid re-pulling on next task. 3. Add ecr:BatchCheckLayerAvailability to instance role ECR permissions — required for docker pull from ECR. 4. InvocationDoesNotExist now rethrows instead of returning failed, letting the orchestrator's consecutiveComputePollFailures counter handle transient propagation delays (fails after 3 consecutive).
… container Two bugs prevented the EC2 compute strategy from working end-to-end: 1. Python sys.path used /app but the Docker image places modules at /app/src — fixed to sys.path.insert(0, "/app/src"). 2. GITHUB_TOKEN_SECRET_ARN was not passed to the Docker container, causing the agent to fail with "github_token is required" — now exported in the boot script and forwarded via docker run -e. Also enables the EC2 fleet construct in agent.ts with blueprints for krokoko/agent-plugins and aws-samples/sample-autonomous-cloud-coding-agents.
End-to-End EC2 Compute Strategy Test ResultsDeployed the stack with EC2 fleet enabled and ran a live test task against Bugs Found and Fixed
Test Task:
|
|
@theagenticguy could you please have a look, I know you have a good understanding of what the devbox should look like |
|
Output from automated review: Good points: Positives
To address:
File: cdk/src/handlers/shared/strategies/ec2-strategy.ts (boot script construction) The value blueprintConfig.github_token_secret_arn is interpolated directly into single-quoted shell context: Fix: Pass githubTokenSecretArn through the S3 payload JSON instead of interpolating into shell, or validate the value to reject shell-unsafe characters.
File: cdk/src/constructs/ec2-agent-fleet.ts:174-182 The IAM policy grants ec2:CreateTags/ec2:DeleteTags with resources: ['*'] conditioned only on ec2:ResourceTag/bgagent:fleet. A compromised instance could:
Fix: Add aws:TagKeys condition to restrict modifiable tag keys to only bgagent:status and bgagent:task-id, preventing modification of bgagent:fleet itself.
File: cdk/src/handlers/shared/strategies/ec2-strategy.ts (startSession) The tag-then-verify pattern for claiming idle instances has a TOCTOU gap. Two concurrent orchestrators could both tag the same instance within the EC2 API propagation window, both Fix: Consider using a DynamoDB conditional write as an atomic claim mechanism, or document this as a known limitation with mitigations (e.g., boot script checks for existing High Severity Issues
File: cdk/src/handlers/shared/strategies/ec2-strategy.ts (stopSession, final try/catch) } catch { Fix: Log at warn level with the error message.
File: cdk/src/handlers/shared/strategies/ec2-strategy.ts (startSession, two rollback catches) Both SSM dispatch rollback and verify-claim rollback log "Failed to rollback..." but don't include the error object, making root cause diagnosis impossible. Fix: Capture and include error: rollbackErr instanceof Error ? rollbackErr.message : String(rollbackErr) in both log calls.
File: cdk/src/handlers/shared/strategies/ec2-strategy.ts (pollSession default case) If AWS adds a new terminal SSM status, the system reports the command as "running" indefinitely -- the task polls for up to 8.5 hours before timing out with no indication of the Fix: Add a logger.warn in the default case logging the unknown status value.
File: cdk/src/handlers/shared/strategies/ec2-strategy.ts (bootScript) The EC2 boot script sets sys.path.insert(0, "/app/src") but the ECS strategy uses sys.path.insert(0, "/app"). If the Docker image has entrypoint.py at /app/entrypoint.py (matching Fix: Verify the correct path in the agent Dockerfile and make both strategies consistent. Medium Severity Issues
repo-config.ts and orchestrator.ts add instance_type to BlueprintConfig, but Ec2ComputeStrategy.startSession never reads it. The ASG uses a hardcoded m7g.xlarge default. This is dead
The construct accepts memoryId?: string but never references it in the constructor body (unlike EcsAgentCluster which passes it as a container env var).
In task-api.ts, this type carries no data -- it's just a truthy flag. Either use boolean for clarity or pass fleet identifiers for IAM scoping.
In task-api.ts, the cancel Lambda gets blanket SSM permissions with no comment explaining the API limitation (SSM doesn't support resource-level restrictions for CancelCommand). Add a
The AwsSolutions-IAM5 reason string in task-orchestrator.ts still only mentions ECS and iam:PassRole -- doesn't cover the new EC2, SSM, or S3 statements.
If docker pull or ECR login fails, the instance never gets tagged as idle, sits in the ASG as "healthy" but unreachable for tasks. No error reporting mechanism exists. Fix: Add a trap handler that tags the instance as bgagent:status=boot-failed on ERR Test coverage gaps: Test Coverage Gaps ┌──────────┬────────────────────────────────────────────────────────────────────────────────────────┬───────────────────────────────────────────────┐ |
Summary
Ec2ComputeStrategyhandler: finds idle instances via tags, uploads payload to S3, dispatches via SSMAWS-RunShellScript, pollsGetCommandInvocation, cancels withCancelCommandEc2AgentFleetCDK construct: Auto Scaling Group with launch template (AL2023 ARM64), security group (443 egress only), S3 payload bucket, IAM role with scoped permissions, Docker user data for pre-pulling imagescompute_type: 'ec2'instance_typefield toRepoConfigandBlueprintConfigfor future GPU/custom instance type supportTest plan
mise //cdk:compile— no TypeScript errorsmise //cdk:test— 43 suites, 697 tests all passing (including new ec2-strategy and ec2-agent-fleet tests)mise //cdk:synth— synthesizes without errors (EC2 block commented out)mise //cdk:build— full build including lint passescompute_type: 'ec2'