Skip to content

feat(ops): automated ephemeral stack cleanup script#109

Open
scottschreckengaust wants to merge 6 commits into
mainfrom
feat/cleanup-ephemeral-stacks
Open

feat(ops): automated ephemeral stack cleanup script#109
scottschreckengaust wants to merge 6 commits into
mainfrom
feat/cleanup-ephemeral-stacks

Conversation

@scottschreckengaust

@scottschreckengaust scottschreckengaust commented May 18, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Adds scripts/cleanup-ephemeral-stacks.sh — a shell script that identifies and deletes orphaned ABCA ephemeral CloudFormation stacks
  • Handles stuck ENI cleanup (Lambda/AgentCore Hyperplane ENIs) before stack deletion to prevent DELETE_FAILED states
  • Respects termination protection and skips stacks in active transitions
  • Supports --dry-run, --max-age-hours, --prefix filter, and --force-eni options

Status

Script only — this is the operational cleanup tool from Issue #72. Still needed for full issue completion:

  • EventBridge scheduled rule (Lambda or Step Functions wrapper) for automated recurring execution
  • CloudWatch logging/metrics for audit trail
  • CDK construct or standalone stack to deploy the scheduler
  • Integration tests / dry-run CI validation

The script itself is complete and production-ready for manual use today.

Test plan

  • Run with --dry-run against an account with ephemeral stacks
  • Verify termination-protected stacks are never touched
  • Verify ENI cleanup works for stuck VPC security groups
  • Verify --max-age-hours filtering is correct

Part of #72 — this PR delivers the manual cleanup script foundation only. Full issue completion (EventBridge schedule, CloudWatch audit, CDK construct, tests) lands in a future PR.

🤖 Generated with Claude Code

Shell script that identifies and deletes orphaned ABCA ephemeral
CloudFormation stacks. Handles stuck ENI cleanup (Lambda/AgentCore
Hyperplane ENIs) before stack deletion, respects termination protection,
and supports dry-run mode.

Closes #72

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@scottschreckengaust scottschreckengaust marked this pull request as ready for review June 5, 2026 22:35
@scottschreckengaust scottschreckengaust requested a review from a team as a code owner June 5, 2026 22:35
Security/robustness review of scripts/cleanup-ephemeral-stacks.sh (#72):

- Fail CLOSED on unparseable CreationTime. Previously a parse failure fell
  back to epoch 0, making every matching stack look ~billions of seconds old
  and eligible for deletion — the age gate failed open. Now it SKIPs.
- Validate --max-age-hours is a non-negative integer before arithmetic
  (rejects injected/garbage input).
- Print account + caller ARN (sts:GetCallerIdentity) before any action so the
  operator can confirm blast radius; hard-fail if identity can't be resolved.
- Tolerate a single delete-stack failure instead of aborting the whole loop
  under set -e (would otherwise orphan later stacks); track and report a
  Failed count, and only increment Deleted on a delete actually initiated.
- Remove dead --force-eni flag (parsed but never used; shellcheck SC2034).
- Annotate the JMESPath --query backticks as intentional (shellcheck SC2016).

shellcheck: clean (exit 0). semgrep --config=auto: 0 findings.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@scottschreckengaust

Copy link
Copy Markdown
Contributor Author

Revived this branch: rebased onto latest main (kept your "Update branch" merge 971b1da) and hardened the script for defense in depth (commit baaaec0).

Security/robustness review — ran shellcheck (exit 0, clean) and semgrep --config=auto (0 findings) on the result. Manual review found and fixed:

  • Fail-open age gate (most important): a parse failure on CreationTime previously fell back to epoch 0, making every matching stack look billions of seconds old and eligible for deletion. Now fails closed — unparseable time → SKIP.
  • Input validation: --max-age-hours is validated as a non-negative integer before arithmetic (rejects garbage/injected values).
  • Blast-radius visibility: prints account ID + caller ARN (sts:GetCallerIdentity) before any mutation; hard-fails if identity can't be resolved.
  • No mid-run abort: a single delete-stack failure under set -e would abort the loop and orphan later stacks — now tolerated, counted, and reported (Failed: line).
  • Removed dead --force-eni flag (SC2034); annotated JMESPath --query backticks (SC2016 false positive).

Scope unchanged: still the manual script only (Part of #72). EventBridge schedule, CloudWatch audit, CDK construct, and tests are deferred to a future PR.

@scottschreckengaust

Copy link
Copy Markdown
Contributor Author

Follow-up filed: #278 — add shellcheck to the toolchain (mise + prek hooks + CI). This PR added the first destructive shell script and surfaced that shell is the only language surface with no static-analysis gate. Out of scope for this PR (toolchain/CI change, separate governance per ADR-003); tracked separately.

scottschreckengaust and others added 3 commits June 8, 2026 05:57
Capture root-cause findings and the validated redesign for
cleanup-ephemeral-stacks.sh: AgentCore Hyperplane (ela-attach) ENIs
cannot be force-detached/deleted and are reclaimed asynchronously by
AWS. Replace the impossible force-clean block with a read-only ENI
gate on the DELETE_FAILED retry path; rely on cron cadence for retry.

Validated end-to-end against a live DELETE_FAILED stack (scoschre).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@isadeks

isadeks commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Structural

This PR ships the script alongside a design doc (docs/superpowers/specs/2026-06-08-ephemeral-stack-cleanup-eni-aware-design.md) that explicitly declares the script's "force-detach → sleep 15delete-stack" model "structurally broken" for stacks containing a Bedrock AgentCore Runtime. The doc's evidence is solid (ela-attach ENIs reject --force; --dry-run validates IAM only). So we'd be merging code already documented as wrong.

Two clean paths:

  1. Split. Drop the script from this PR; land the design doc (or fold into feat: automated ephemeral stack cleanup (scheduled) #72), then open a follow-up implementing the gated-retry / Pending-exit-0 design. My preference.
  2. Implement the redesign now. Replace lines 146–197 with the read-only ENI gate, add Pending branch, ship together.

Either way, ordering is wrong: delete-stack first, then observe what's stuck, then act on confirmed orphans — not act first, then delete. And the pre-clean is redundant for non-Hyperplane ENIs too, since CloudFormation's own teardown handles regular Lambda VPC ENIs correctly.

Concrete bugs (fix in either path)

  1. Silent "no stacks" on API failure. Lines 66-77 — list-stacks is wrapped in 2>/dev/null. If it fails (auth, throttle, IAM), STACKS is empty and the script prints "No stacks found." and exits 0. Operator gets the same output for "nothing to clean" and "we don't have permission to ask." Check the exit code separately from emptiness.

  2. BSD date loses timezone. Line 122 — ${CREATION_TIME%%.*} strips both the fractional seconds and the trailing Z. BSD date -j then parses as local time. On a Mac in PST, AGE_SECONDS is off by 8h, so a 48h-old stack reads as 40h and gets skipped. GNU date -d (Linux/CI) handles Z correctly, so this is local-Mac-only. Use date -j -u -f or keep the timezone in the input.

  3. delete-stack failure reason swallowed. Line 206 — 2>/dev/null discards the actual API error. The script then prints a generic "ERROR: delete-stack failed (continuing)" but you can't tell AccessDenied from ValidationError. This is the one place you definitely want stderr — drop 2>/dev/null here.

  4. N+1 API calls per stack. list-stacks returns enough info to pre-filter, but the script then issues describe-stacks per stack (line 90) and list-stack-resources per stack (line 148). For an account with hundreds of stacks (the case that motivates the script), that's hundreds of round-trips. Tag-based filtering on list-stacks plus a batched describe-stacks would be O(1) at API level.

  5. MAX_AGE_HOURS default mismatch. Header comment says default: 4, code is :-48. Pick one.

  6. Argument parsing edge case. --max-age-hours/--prefix shift 2 without verifying $2 exists. With set -u, calling ./script --max-age-hours errors as "$2: unbound variable" instead of a useful message.

  7. No --region / --profile flag though the header docs AWS_PROFILE=abca. Drift between docs and code.

  8. Account ID leak. Spec doc names 46********** and us-east-1 as the example stuck-stack environment. Replace with "the shared dev account."

  9. Description-equality filter is fragile. [[ "$DESCRIPTION" != "ABCA Development Stack" ]] (line 101) silently disqualifies stacks if the description is human-edited or renamed. Tag-based filtering would survive both. Not a blocker but a TODO.

  10. No tests. The spec doc itself recommends extracting classify_stack(status, eni_count) → action as a pure function so the load-bearing branch (DELETE_FAILED + 0 ENIs → retry; DELETE_FAILED + N ENIs → Pending) is unit-testable without AWS. Worth doing before this becomes an EventBridge schedule.

  11. feat: automated ephemeral stack cleanup (scheduled) #72 governance. Spec doc flags feat: automated ephemeral stack cleanup (scheduled) #72 as "not yet approved"; per AGENTS.md the rule is approve-first. Worth pinging an admin to label.

What the script gets right (don't lose these in a rewrite)

  • set -euo pipefail — proper strict mode.
  • sts:GetCallerIdentity blast-radius print before any mutation.
  • Numeric validation on --max-age-hours (line 37).
  • BSD/GNU date fallback chain with fail-closed parse (line 123 — modulo the timezone bug above).
  • *IN_PROGRESS* skip is narrow enough that DELETE_FAILED is not matched — the spec doc identifies this distinction as load-bearing for retry. Pin with a regression test in the next pass.
  • Counter increments inside <<< (current shell), not a pipe, so counters actually persist.
  • ((counter++)) || true survives starting-from-zero (worth a one-line comment so a future "simplification" doesn't break it).
  • Termination-protection check is unconditional and early.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants