Skip to content

Use git ls-files in upload-ct-artifacts.sh to avoid phantom deletions#48

Closed
ntarakad-aws wants to merge 1 commit into
aws-samples:atx-remote-infrafrom
ntarakad-aws:fix-upload-respect-gitignore
Closed

Use git ls-files in upload-ct-artifacts.sh to avoid phantom deletions#48
ntarakad-aws wants to merge 1 commit into
aws-samples:atx-remote-infrafrom
ntarakad-aws:fix-upload-respect-gitignore

Conversation

@ntarakad-aws

Copy link
Copy Markdown

Problem

The current zip exclusion list:

zip -qry /tmp/code.zip . -x '.env*' -x '*.pem' -x '*.key' -x 'node_modules/*' -x '.aws/*'

filters by pattern regardless of whether files are tracked in git. When tracked files match a pattern, customers extracting code.zip see them as "deleted" in git status:

ntarakad@host extracted % git status
On branch atx-result-staging-20260605_231047_debbf958
Changes not staged for commit:
  deleted:    .envrc
  deleted:    modules/openapi-generator/src/main/resources/rust-server/example-ca.pem
  deleted:    modules/openapi-generator/src/main/resources/rust-server/example-server-chain.pem
  deleted:    samples/server/petstore/rust-server/output/multipart-v3/examples/ca.pem
  ...

Real example: openapi-generator tracks test certificates (.pem files for the rust-server template). After running tech-debt-comprehensive on it, the customer's extracted artifact shows ~50 .pem files as deleted, even though the analysis bot didn't touch them. The exclusion list is removing them from the working tree, but .git/ (correctly preserved for diff review) still references them.

Fix

Use git ls-files as the source of truth for what to zip:

{ git ls-files --recurse-submodules; \
  git ls-files --others --exclude-standard; } | sort -u > /tmp/code-files.txt
zip -q /tmp/code.zip -@ < /tmp/code-files.txt
zip -qry /tmp/code.zip .git

This:

  • Respects the repo's .gitignorenode_modules/, build dirs, .env (if gitignored), etc. are excluded automatically
  • Includes the analysis bot's auto-committed output (e.g., ATXDocumentation/ on the result branch) since git ls-files reflects HEAD's contents
  • Includes tracked files that happen to match the old patterns (test certs, .envrc) without phantom deletes
  • Preserves .git/ for git log / git diff review
  • Falls back to conservative pattern-based exclusion if the repo isn't a git working tree (defensive)

Behavior comparison

File In git? Old behavior New behavior
node_modules/ gitignored Excluded (pattern) Excluded (gitignore)
.env gitignored Excluded (pattern) Excluded (gitignore)
Test .pem certs tracked Excluded → phantom delete Included (no phantom delete)
Tracked .envrc tracked Excluded → phantom delete Included (no phantom delete)
Real secrets accidentally tracked tracked Excluded but .git/ still has them Included (already in .git/ anyway — exclusion never hid them)
Bot output (ATXDocumentation/) newly committed on result branch Included Included

Note on secret exclusion

The old exclusion list gave a false sense of security: if a real secret was committed and tracked, the file was "excluded" from the working tree but .git/objects/ (which IS in the zip) still contained it. The exclusion never actually hid secrets — it only created phantom deletes. The new approach is honest: whatever's in git history is in the artifact, period. Secret hygiene needs to happen at commit time (e.g., git filter-branch, BFG), not at upload time.

Testing

Local test on a repo with tracked .env / .pem files:

  • Before: git status after extract shows tens of "deleted" files
  • After: git status after extract shows clean working tree (or only the bot's actual changes)

Customer-side workflow now works as expected:

unzip code.zip -d ./extracted/
cd ./extracted
git log atx-result-staging-...    # see what the bot committed
git diff main..HEAD               # review the bot's changes

The previous exclusion list (-x '.env*' -x '*.pem' -x '*.key' -x 'node_modules/*'
-x '.aws/*') filters out files by pattern regardless of whether they are tracked
in git. This creates a confusing experience for customers when those patterns
match TRACKED files: extracting the resulting code.zip and running 'git status'
shows them as deleted (because .git/ is preserved but the working tree files
were excluded by the upload).

Example: openapi-generator tracks test certificates (.pem files for the rust-server
template). After analysis, the customer's extracted artifact shows ~50 .pem files
as 'deleted' in git status, even though the analysis bot didn't touch them.

Fix: use 'git ls-files' (tracked files + non-gitignored new files) as the source
of truth for what to include. This:
  - Respects the repo's .gitignore (node_modules, build dirs, .env if gitignored,
    etc. continue to be excluded automatically)
  - Includes the analysis bot's auto-committed output (e.g., ATXDocumentation/
    on the result branch) since git ls-files reflects HEAD's contents
  - Includes tracked files that happen to match the old patterns (test certs,
    .envrc) without phantom deletes
  - Preserves .git/ for git log / git diff review
  - Falls back to conservative pattern-based exclusion if the repo is not a git
    working tree (defensive — shouldn't occur post-clone, but keeps the script
    safe for edge cases)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant