[Klaud Cold] Add /nuke command for bumping engine image tags#1625
Conversation
Adds .claude/commands/nuke.md: a slash command that bumps single-node vLLM or SGLang container image tags across recipes, opening one [Klaud Cold] PR per model+precision+SKU family (mtp folded in), with full-sweep-enabled labels, docker-tag existence verification, and scope confirmation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| url=$(gh pr create --repo SemiAnalysisAI/InferenceX --base main --head "$branch" \ | ||
| --title "[Klaud Cold] Update <basekey>[ (+mtp)] <PHRASE> to <TAG>" \ | ||
| --body "<BODY>" --label full-sweep-enabled | grep -o 'https://github.com/[^ ]*') | ||
| # patch the changelog pr-link with the real URL, then amend + force-push | ||
| python3 - perf-changelog.yaml "$url" <<'PY' | ||
| import sys; f,u=sys.argv[1],sys.argv[2] | ||
| open(f,'w').write(open(f).read().replace("PRLINK_PLACEHOLDER",u,1)) | ||
| PY | ||
| git add perf-changelog.yaml && git commit -q --amend --no-edit && git push -q --force-with-lease |
There was a problem hiding this comment.
🔴 If gh pr create on line 119 fails (auth, network, label permission, branch already has a PR, etc.) the surrounding pipeline still 'succeeds' because there is no set -o pipefail — only grep's exit status propagates and $url ends up empty. The python heredoc on lines 123–126 then runs replace("PRLINK_PLACEHOLDER", "", 1), the amend+force-push on line 127 publishes a perf-changelog.yaml with a bare pr-link: (no value), and you're left with an orphaned remote branch carrying broken changelog data and no PR. Add set -euo pipefail to the snippet, or guard with [ -n "$url" ] || exit 1 immediately after the assignment.
Extended reasoning...
What the bug is. The pipeline\n\nbash\nurl=$(gh pr create … --label full-sweep-enabled | grep -o 'https://github.com/[^ ]*')\n\n\nuses no set -e or set -o pipefail (none are set anywhere in this command document). In a default bash pipeline, the exit status of $(…) is the exit status of the last command — grep — so gh pr create's exit code is discarded. When gh pr create writes its error to stderr and emits no URL on stdout, grep -o matches nothing, exits 1, and the surrounding bash sees $url="" with no error indication. The script continues.\n\nHow it manifests. Several realistic ways gh pr create fails in this exact flow:\n- a PR already exists for $branch (re-runs / partial prior attempts);\n- the full-sweep-enabled label doesn't exist on the repo or the token lacks permission to apply labels;\n- transient network / auth error against api.github.com;\n- title/body validation errors.\n\nIn each case stderr carries the diagnostic, stdout is empty, grep -o produces no match.\n\nWhat gets committed. Lines 123–126 unconditionally call\n\npython\nopen(f,'w').write(open(f).read().replace("PRLINK_PLACEHOLDER", u, 1))\n\n\nwith u="". The placeholder is replaced with the empty string, yielding a YAML entry like:\n\nyaml\n pr-link:\n\n\nThat is valid YAML (pr-link is null) but semantically broken — there is no PR to link to. Line 127 then runs git commit --amend --no-edit && git push -q --force-with-lease, force-pushing the corrupted changelog onto the branch that was already pushed by line 118. The remote branch persists with no associated PR and a junk changelog entry.\n\nStep-by-step proof.\n1. /nuke vllm v0.22.0 kimik2.5 is invoked on a branch where a previous (failed) iteration already opened a PR for klaud-cold/kimik2.5-int4-b300-vllm-v0.22.0.\n2. git push -u origin "$branch" -q --force-with-lease succeeds (line 118) — the branch already exists upstream.\n3. gh pr create … --head "$branch" (line 119) fails with a pull request for branch "klaud-cold/..." into branch "main" already exists on stderr. stdout is empty. gh's exit code is 1.\n4. The pipe feeds nothing to grep -o 'https://github.com/[^ ]*'. grep exits 1. Because $(…)'s status is grep's status, and no set -e/pipefail is in effect, bash simply assigns url="" and continues.\n5. The python heredoc (lines 123–126) replaces PRLINK_PLACEHOLDER with "" in perf-changelog.yaml — the entry becomes pr-link: with an empty value.\n6. git add perf-changelog.yaml && git commit -q --amend --no-edit && git push -q --force-with-lease (line 127) silently rewrites the existing branch with the broken changelog. No error surfaces. The for-each-family loop moves on to the next recipe.\n7. Step 5 of the procedure (return to clean main + report PR table) will be missing a row for this family, but in a multi-recipe sweep that's easy to overlook — and the orphaned remote branch with junk changelog data has already been published.\n\nWhy existing code doesn't prevent it. Nothing in the snippet (or elsewhere in the file) sets pipefail or errexit. There is no [ -n "$url" ] guard before the amend. The python heredoc happily accepts an empty replacement string. git commit --amend will succeed even if the only staged change leaves the YAML semantically meaningless. --force-with-lease only protects against concurrent upstream changes, not against pushing self-inflicted garbage.\n\nFix. Two equivalent one-liners:\n\nbash\nset -euo pipefail # at the top of the per-family block\n\n\nor, more targeted:\n\nbash\nurl=$(gh pr create … | grep -o 'https://github.com/[^ ]*') || true\n[ -n "$url" ] || { echo "gh pr create failed for $branch" >&2; exit 1; }\n\n\nEither makes the failure loud and stops the changelog amend from running with an empty URL.
| ```bash | ||
| git checkout main -q && git reset --hard origin/main -q | ||
| branch="klaud-cold/<basekey>-<TAG>" | ||
| git checkout -b "$branch" -q | ||
| python3 /tmp/edit_image.py <master.yaml> <NEW_IMAGE> <key> [<key>-mtp] | ||
| python3 /tmp/append_changelog.py perf-changelog.yaml "<DESC>" <key> [<key>-mtp] | ||
| git add -A | ||
| git commit -q -m "[Klaud Cold] Update <basekey>[ (+mtp)] <PHRASE> to <TAG>" \ | ||
| -m "Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>" | ||
| git push -u origin "$branch" -q --force-with-lease |
There was a problem hiding this comment.
🔴 Step 4 starts each family iteration with git checkout main -q && git reset --hard origin/main -q (line 110), and Step 5 finalizes with the same reset on main (line 151), both running in the user's primary worktree. Any uncommitted changes, staged work, or unpushed local commits on main are silently destroyed — and the loop multiplies the blast radius across N PRs. The sibling .claude/commands/fix-klaud-cron-prs.md deliberately uses git worktree add for exactly this reason ("Use a worktree so the loop never disturbs the user's working tree"); /nuke should follow the same pattern, or at minimum gate the reset behind a clean-tree check.
Extended reasoning...
What the bug is
The new /nuke slash command directs the executing agent to run destructive git reset --hard against the user's primary worktree in two places:
- Step 4, line 110 (inside the per-family loop):
git checkout main -q && git reset --hard origin/main -q - Step 5, line 151 (finalization):
Return to a clean
main(git checkout main && git reset --hard origin/main).
Neither location checks whether the working tree is clean, neither stashes, and neither uses a temp worktree. Any uncommitted edits, staged changes, untracked-but-tracked-path overwrites, or unpushed local commits on main are wiped silently with no recovery path other than the reflog.
Why existing code doesn't prevent it
There is no clean-tree precondition in the command's body. The Step 1–3 preamble is purely discovery/confirmation about recipes, not about the user's git state. The very first action of the actual mutating phase (Step 4) is the destructive reset. The N-iteration loop then repeats the reset before each new family, so even if the first iteration happened to be safe, intermediate state created mid-sweep is also clobbered.
Why this is inconsistent with the project's own pattern
The sibling command .claude/commands/fix-klaud-cron-prs.md solves the identical "loop that needs a fresh main per iteration" problem with a temp worktree, with an explicit author comment:
Use a worktree so the loop never disturbs the user's working tree.
/nuke knowingly or unknowingly deviates from this established safe pattern in the same directory.
Impact
- Severity: silent destruction of uncommitted/unpushed work — the canonical "destructive git footgun."
- Blast radius: amplified by the loop. A single
/nuke vllm v0.22.0over the v0.22.0 sweep created ~30 PRs ([Klaud Cold] Update dsv4-fp4-b300-vllm (+mtp) vLLM image to v0.22.0 #1595–[Klaud Cold] Update dsv4-fp4-mi355x-vllm vLLM ROCm image to v0.22.0 #1624); 30 iterations means 30 unconditional resets against the user's primary worktree. - Detectability: none at the time of damage —
git reset --hardis silent on success. Recovery requires the user to know to consultgit reflogand act before GC. - This also violates the global tool-use guidance in this very environment: "NEVER run destructive git commands (push --force, reset --hard, …) unless the user explicitly requests these actions." Encoding
reset --hardinto a slash command effectively bypasses that guardrail by making the destructive action implicit in invoking the command.
Step-by-step proof
- User has uncommitted edits on
main(say, a half-written CLAUDE.md change), and runs/nuke vllm v0.22.0. - Agent reaches Step 4, iteration 1, line 110. Executes
git checkout main -q && git reset --hard origin/main -q. git checkout mainsucceeds (uncommitted edits to tracked files are not blocked because checking out the current branch is a no-op).git reset --hard origin/mainthen unconditionally rewrites the working tree to match the remote tip.- The user's uncommitted edits to CLAUDE.md (and any other tracked file) are now gone — no prompt, no warning, no backup. Untracked files survive; everything else is overwritten.
- The loop continues for the remaining N−1 families. Even if the user had pushed a local-only commit between iterations (e.g. a different branch they happened to be on before invoking), step 4's repeated reset on
mainclobbers any intermediatemainstate the agent may have been mid-creating, too. - Step 5 fires
git reset --hard origin/mainonce more for good measure.
How to fix it
Two acceptable fixes, in order of preference:
- Use a temp worktree, matching the sibling command. Before Step 4:
…and at the end of Step 5,
wt=/tmp/claude-0/tmp.JptYdkmIlW git worktree add "" main -q cd ""
git worktree remove --force "". This is exactly the patternfix-klaud-cron-prs.mduses (lines 47–54). - Minimum acceptable mitigation: insert a clean-tree precondition before Step 4 starts, e.g.:
This still puts the user on
git diff --quiet && git diff --cached --quiet || { echo "working tree dirty — aborting /nuke"; exit 1; } git fetch origin && [ "$(git rev-parse main)" = "$(git rev-parse origin/main)" ] || { echo "local main diverges from origin/main — aborting"; exit 1; }
mainrather than a worktree, but at least guarantees nothing is silently destroyed.
| ```bash | ||
| git checkout main -q && git reset --hard origin/main -q | ||
| branch="klaud-cold/<basekey>-<TAG>" | ||
| git checkout -b "$branch" -q |
There was a problem hiding this comment.
🟡 Step 4 (line 110) runs git reset --hard origin/main -q at the start of each family iteration without a preceding git fetch, so origin/main is whatever was cached at last fetch. On a long-lived shell or after recent unrelated merges, every PR branch gets forked from a stale base, and the perf-changelog.yaml block is appended on top of an outdated snapshot — leading to avoidable rebase/merge churn. One-line fix: prepend git fetch origin main -q before the reset (or fetch once at the top of the command).
Extended reasoning...
What the bug is
Step 4 documents the per-family loop as:
git checkout main -q && git reset --hard origin/main -q
branch="klaud-cold/<basekey>-<TAG>"
git checkout -b "$branch" -qorigin/main here is the local ref refs/remotes/origin/main, which only updates when git fetch is run. Nothing in the command — neither this loop nor any preceding step — runs git fetch origin. So the base every PR branch is created from is whatever the local cache happened to be at the last fetch.
Why the staleness matters
The refutation correctly notes that /nuke itself does not merge PRs (they go out with full-sweep-enabled and need a long CI sweep before landing), so the worst case isn't a self-race within one sweep. But two realistic cold-start scenarios remain:
- Long-running shell / Claude Code session. If the user has had the repo checked out for hours/days without fetching,
origin/mainlags behind realmain. Every family iteration resets to that stale ref. - Concurrent unrelated merges. Other PRs (manual fixes, sibling
Klaud Coldwork, etc.) land onmainduring or just before the sweep. Each new family branch misses them and the changelog append is computed against an obsolete file.
Concrete proof
Assume perf-changelog.yaml on real origin/main has been updated by PR #1626 (lands at T=0) to include a new entry block. The user starts /nuke at T+30s on a shell that last fetched at T-10min, so the local refs/remotes/origin/main still points at the pre-#1626 commit.
- Iteration 1 — family
kimik2.5-int4-b300-vllm:git reset --hard origin/mainchecks out the pre-[AMD] Add DeepSeek-V4-Pro FP4 MI355X ATOM DP-attention benchmark #1626 changelog.append_changelog.pyappends the new block at the end of that older file. - Iteration 2 — family
kimik2.5-int4-b200-vllm: same thing — same stale base, same append point. - When these PRs are reviewed/merged, GitHub compares each branch's
perf-changelog.yamlagainst the real currentmain, which already has [AMD] Add DeepSeek-V4-Pro FP4 MI355X ATOM DP-attention benchmark #1626's block. The trailing region of the file (where appends happen) is now divergent, producing textual merge conflicts that require manual rebase or GitHub auto-merge intervention.
The vendor image edits don't hit this because they're scoped to unique config keys (kimik2.5-int4-b300-vllm: only appears once), but perf-changelog.yaml is appended sequentially and is the canonical collision point.
Why existing code doesn't prevent it
--force-with-leaseprotects the feature branch from being clobbered, butmainis never force-pushed; the issue is the parent commit the feature branch is forked from.- Step 1 discovery reads master YAMLs from the working tree, so it doesn't implicitly fetch.
- Step 5's closing
git reset --hard origin/mainruns after all PRs are already created, too late to help. - The sibling command
.claude/commands/fix-klaud-cron-prs.md:50explicitly doesgit fetch origin "$BRANCH"before branch operations, showing this repo's expected hygiene.
Fix
One line — either prepend git fetch origin main -q immediately before the git reset --hard origin/main -q in Step 4, or fetch once at the top of the command (cheapest, since most iterations would reuse the same fetched ref).
Severity
Nit. The blast radius is bounded to merge-time rebase churn on perf-changelog.yaml, not silent data loss or broken recipes; image edits are per-unique-key. But the fix is trivial and the documentation is the contract an LLM follows verbatim, so worth tightening.
| url=$(gh pr create --repo SemiAnalysisAI/InferenceX --base main --head "$branch" \ | ||
| --title "[Klaud Cold] Update <basekey>[ (+mtp)] <PHRASE> to <TAG>" \ | ||
| --body "<BODY>" --label full-sweep-enabled | grep -o 'https://github.com/[^ ]*') | ||
| # patch the changelog pr-link with the real URL, then amend + force-push | ||
| python3 - perf-changelog.yaml "$url" <<'PY' | ||
| import sys; f,u=sys.argv[1],sys.argv[2] | ||
| open(f,'w').write(open(f).read().replace("PRLINK_PLACEHOLDER",u,1)) | ||
| PY | ||
| git add perf-changelog.yaml && git commit -q --amend --no-edit && git push -q --force-with-lease |
There was a problem hiding this comment.
🟡 The flow on lines 119-127 (gh pr create --label full-sweep-enabled → amend → git push --force-with-lease) double-fires the sweep workflow: the --label triggers a labeled event run on the original SHA, then the force-push fires a synchronize event run on the amended SHA. The first run is auto-cancelled by run-sweep.yml's cancel-in-progress concurrency, so GPU impact is bounded to setup-job seconds — but every /nuke invocation still leaves a cancelled sweep run in CI history per PR. Simple fix: drop --label full-sweep-enabled from the gh pr create call and apply the label after the amend+force-push via gh pr edit --add-label full-sweep-enabled.
Extended reasoning...
What the bug is
On lines 119-127 of .claude/commands/nuke.md, the per-family PR sequence is:
gh pr create --label full-sweep-enabled— creates the PR with the sweep label attached at creation time.- Patch
perf-changelog.yamlto substitutePRLINK_PLACEHOLDERwith the real PR URL (returned by step 1). git commit --amend --no-editthengit push --force-with-lease.
.github/workflows/run-sweep.yml triggers on pull_request for types [ready_for_review, synchronize, labeled, unlabeled] filtered to perf-changelog.yaml. So step 1 fires a labeled event (sweep #1 on the original SHA), and step 3 fires a synchronize event (sweep #2 on the amended SHA). Both sweeps target a PR whose only diff is perf-changelog.yaml, so the path filter passes for both.
Why the GPU-doubling framing is wrong (refutation acknowledged)
The refuting verifier is correct that this does not double GPU spend. Evaluating run-sweep.yml's concurrency expression for both events:
labeledwithlabel.name == 'full-sweep-enabled': the inner AND chain in the ternary includesgithub.event.label.name != 'full-sweep-enabled', which is false, so the whole expression falls back to'active'. Group =sweep-<PR#>-active.synchronize:(action == 'labeled' || action == 'unlabeled')is false, AND-chain short-circuits to false, fallback to'active'. Group =sweep-<PR#>-active.
Both events land in the same concurrency group with cancel-in-progress: true, so sweep #2 cancels sweep #1 before any GPU job is dispatched (cancellation happens at the setup job stage on ubuntu-latest). The original 'doubles GPU spend' framing overstates the impact.
Why it's still worth flagging
The cancelled-run-per-PR is not just cosmetic for a one-off PR, but /nuke is explicitly a batch operation that creates one PR per recipe family (the v0.22.0 example referenced in the description spans PRs #1595–#1624 — 30 PRs in a single invocation). Each invocation leaves N cancelled sweep entries in CI history, plus N ubuntu-latest setup-job-seconds of wasted minutes. That accumulates across invocations of a command designed to be run repeatedly.
Step-by-step proof
git push -u origin klaud-cold/foo-v0.22.0— branch pushed with commit A (no PR yet, no workflow trigger from this push since branch isn't tracked bypush:trigger which only watchesmain).gh pr create ... --label full-sweep-enabled— PR opens with the label attached. GitHub firespull_requestwithaction=labeled,label.name=full-sweep-enabled, head=A.run-sweep.ymlevaluates:pathsfilter matches (perf-changelog.yaml is in the PR diff),setupjob'sifpasses. Sweep run [NVIDIA] Add TRT-LLM 70B FP8 via slurm #1 enters concurrency groupsweep-<PR#>-activeand starts.git commit --amend --no-editrewrites A → B;perf-changelog.yamlcontent differs (PR link patched).git push --force-with-leasefirespull_requestwithaction=synchronize, head=B.pathsmatches (changelog still modified). Sweep run [NVIDIA] Add TRT 70B (FP8 and FP4) #2 enters concurrency groupsweep-<PR#>-active.- With
cancel-in-progress: true, sweep [NVIDIA] Add TRT-LLM 70B FP8 via slurm #1 is cancelled. Net result: 1 successful sweep run on the amended SHA + 1 cancelled sweep run on the abandoned original SHA, per PR.
Fix
Cleanest fix is a one-line reorder around the gh pr create call: omit --label full-sweep-enabled from create, then add the label after the amend+force-push completes:
url=$(gh pr create --repo ... --title ... --body "<BODY>" | grep -o 'https://github.com/[^ ]*')
# ... patch perf-changelog.yaml ...
git add perf-changelog.yaml && git commit -q --amend --no-edit && git push -q --force-with-lease
gh pr edit "$url" --add-label full-sweep-enabledThe labeled event now lands on the amended SHA, so only one sweep runs. (The refutation's objection that 'the URL cannot be known until after gh pr create returns' doesn't apply to this variant — we still call gh pr create first, we just defer the label until after the amend.)
Summary
Adds
.claude/commands/nuke.md— a/nukeslash command that bumps single-nodeinference-engine container image tags across recipes.
[Klaud Cold]PR permodel + precision + SKUrecipe family;
-mtpsiblings folded into the same PR; never groups differentmodels / precisions / SKUs together; skips
*-agenticunless opted in.full-sweep-enabledlabel so CI kicks off, and aperf-changelog.yamlentry with the real PR link.Generalizes the v0.22.0 vLLM batch (PRs #1595–#1624) into a reusable command.
Test plan
🤖 Generated with Claude Code
Note
Low Risk
Documentation-only slash command; no runtime, config, or CI behavior changes in the diff.
Overview
Adds a new Claude Code slash command at
.claude/commands/nuke.mdthat automates bumping vLLM or SGLang container image tags for single-node benchmark recipes.The command documents a full workflow: discover keys in
nvidia-master.yaml/amd-master.yaml, verify tags on Docker Hub, confirm scope with the user, then open one[Klaud Cold]PR permodel + precision + SKUfamily (optionally folding-mtpsiblings), with strict rules against mixing models, precisions, or SKUs. It includes helper scripts for per-keyimage:edits andperf-changelog.yamlentries, branch/PR conventions, and thefull-sweep-enabledlabel so GPU sweeps run.This generalizes the prior v0.22.0 vLLM batch workflow into a reusable operator playbook; no master configs, benchmarks, or CI files change in this PR.
Reviewed by Cursor Bugbot for commit 580ae2b. Bugbot is set up for automated code reviews on this repo. Configure here.