[Klaud Cold] Add /nuke command for bumping engine image tags by functionstackx · Pull Request #1625 · SemiAnalysisAI/InferenceX

functionstackx · 2026-05-30T22:18:27Z

Summary

Adds .claude/commands/nuke.md — a /nuke slash command that bumps single-node
inference-engine container image tags across recipes.

Works for both vLLM and SGLang (NVIDIA + AMD/ROCm image repos).
Grouping rules baked in: one [Klaud Cold] PR per model + precision + SKU
recipe family; -mtp siblings folded into the same PR; never groups different
models / precisions / SKUs together; skips *-agentic unless opted in.
Verifies the target Docker tag exists before bumping.
Confirms scope (vendor / agentic / special-pin overrides) before creating PRs.
Each PR gets the full-sweep-enabled label so CI kicks off, and a
perf-changelog.yaml entry with the real PR link.

Generalizes the v0.22.0 vLLM batch (PRs #1595–#1624) into a reusable command.

Test plan

N/A — docs/command only; no recipe or CI config changed.

🤖 Generated with Claude Code

Note

Low Risk
Documentation-only slash command; no runtime, config, or CI behavior changes in the diff.

Overview
Adds a new Claude Code slash command at .claude/commands/nuke.md that automates bumping vLLM or SGLang container image tags for single-node benchmark recipes.

The command documents a full workflow: discover keys in nvidia-master.yaml / amd-master.yaml, verify tags on Docker Hub, confirm scope with the user, then open one [Klaud Cold] PR per model + precision + SKU family (optionally folding -mtp siblings), with strict rules against mixing models, precisions, or SKUs. It includes helper scripts for per-key image: edits and perf-changelog.yaml entries, branch/PR conventions, and the full-sweep-enabled label so GPU sweeps run.

This generalizes the prior v0.22.0 vLLM batch workflow into a reusable operator playbook; no master configs, benchmarks, or CI files change in this PR.

^{Reviewed by Cursor Bugbot for commit 580ae2b. Bugbot is set up for automated code reviews on this repo. Configure here.}

Adds .claude/commands/nuke.md: a slash command that bumps single-node vLLM or SGLang container image tags across recipes, opening one [Klaud Cold] PR per model+precision+SKU family (mtp folded in), with full-sweep-enabled labels, docker-tag existence verification, and scope confirmation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

claude · 2026-05-30T22:28:40Z

+url=$(gh pr create --repo SemiAnalysisAI/InferenceX --base main --head "$branch" \
+      --title "[Klaud Cold] Update <basekey>[ (+mtp)] <PHRASE> to <TAG>" \
+      --body "<BODY>" --label full-sweep-enabled | grep -o 'https://github.com/[^ ]*')
+# patch the changelog pr-link with the real URL, then amend + force-push
+python3 - perf-changelog.yaml "$url" <<'PY'
+import sys; f,u=sys.argv[1],sys.argv[2]
+open(f,'w').write(open(f).read().replace("PRLINK_PLACEHOLDER",u,1))
+PY
+git add perf-changelog.yaml && git commit -q --amend --no-edit && git push -q --force-with-lease


🔴 If gh pr create on line 119 fails (auth, network, label permission, branch already has a PR, etc.) the surrounding pipeline still 'succeeds' because there is no set -o pipefail — only grep's exit status propagates and $url ends up empty. The python heredoc on lines 123–126 then runs replace("PRLINK_PLACEHOLDER", "", 1), the amend+force-push on line 127 publishes a perf-changelog.yaml with a bare pr-link: (no value), and you're left with an orphaned remote branch carrying broken changelog data and no PR. Add set -euo pipefail to the snippet, or guard with [ -n "$url" ] || exit 1 immediately after the assignment.

Extended reasoning...

What the bug is. The pipeline\n\nbash\nurl=$(gh pr create … --label full-sweep-enabled | grep -o 'https://github.com/[^ ]*')\n\n\nuses no set -e or set -o pipefail (none are set anywhere in this command document). In a default bash pipeline, the exit status of $(…) is the exit status of the last command — grep — so gh pr create's exit code is discarded. When gh pr create writes its error to stderr and emits no URL on stdout, grep -o matches nothing, exits 1, and the surrounding bash sees $url="" with no error indication. The script continues.\n\nHow it manifests. Several realistic ways gh pr create fails in this exact flow:\n- a PR already exists for $branch (re-runs / partial prior attempts);\n- the full-sweep-enabled label doesn't exist on the repo or the token lacks permission to apply labels;\n- transient network / auth error against api.github.com;\n- title/body validation errors.\n\nIn each case stderr carries the diagnostic, stdout is empty, grep -o produces no match.\n\nWhat gets committed. Lines 123–126 unconditionally call\n\npython\nopen(f,'w').write(open(f).read().replace("PRLINK_PLACEHOLDER", u, 1))\n\n\nwith u="". The placeholder is replaced with the empty string, yielding a YAML entry like:\n\nyaml\n pr-link:\n\n\nThat is valid YAML (pr-link is null) but semantically broken — there is no PR to link to. Line 127 then runs git commit --amend --no-edit && git push -q --force-with-lease, force-pushing the corrupted changelog onto the branch that was already pushed by line 118. The remote branch persists with no associated PR and a junk changelog entry.\n\nStep-by-step proof.\n1. /nuke vllm v0.22.0 kimik2.5 is invoked on a branch where a previous (failed) iteration already opened a PR for klaud-cold/kimik2.5-int4-b300-vllm-v0.22.0.\n2. git push -u origin "$branch" -q --force-with-lease succeeds (line 118) — the branch already exists upstream.\n3. gh pr create … --head "$branch" (line 119) fails with a pull request for branch "klaud-cold/..." into branch "main" already exists on stderr. stdout is empty. gh's exit code is 1.\n4. The pipe feeds nothing to grep -o 'https://github.com/[^ ]*'. grep exits 1. Because $(…)'s status is grep's status, and no set -e/pipefail is in effect, bash simply assigns url="" and continues.\n5. The python heredoc (lines 123–126) replaces PRLINK_PLACEHOLDER with "" in perf-changelog.yaml — the entry becomes pr-link: with an empty value.\n6. git add perf-changelog.yaml && git commit -q --amend --no-edit && git push -q --force-with-lease (line 127) silently rewrites the existing branch with the broken changelog. No error surfaces. The for-each-family loop moves on to the next recipe.\n7. Step 5 of the procedure (return to clean main + report PR table) will be missing a row for this family, but in a multi-recipe sweep that's easy to overlook — and the orphaned remote branch with junk changelog data has already been published.\n\nWhy existing code doesn't prevent it. Nothing in the snippet (or elsewhere in the file) sets pipefail or errexit. There is no [ -n "$url" ] guard before the amend. The python heredoc happily accepts an empty replacement string. git commit --amend will succeed even if the only staged change leaves the YAML semantically meaningless. --force-with-lease only protects against concurrent upstream changes, not against pushing self-inflicted garbage.\n\nFix. Two equivalent one-liners:\n\nbash\nset -euo pipefail # at the top of the per-family block\n\n\nor, more targeted:\n\nbash\nurl=$(gh pr create … | grep -o 'https://github.com/[^ ]*') || true\n[ -n "$url" ] || { echo "gh pr create failed for $branch" >&2; exit 1; }\n\n\nEither makes the failure loud and stops the changelog amend from running with an empty URL.

claude · 2026-05-30T22:28:40Z

+```bash
+git checkout main -q && git reset --hard origin/main -q
+branch="klaud-cold/<basekey>-<TAG>"
+git checkout -b "$branch" -q
+python3 /tmp/edit_image.py <master.yaml> <NEW_IMAGE> <key> [<key>-mtp]
+python3 /tmp/append_changelog.py perf-changelog.yaml "<DESC>" <key> [<key>-mtp]
+git add -A
+git commit -q -m "[Klaud Cold] Update <basekey>[ (+mtp)] <PHRASE> to <TAG>" \
+  -m "Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
+git push -u origin "$branch" -q --force-with-lease


🔴 Step 4 starts each family iteration with git checkout main -q && git reset --hard origin/main -q (line 110), and Step 5 finalizes with the same reset on main (line 151), both running in the user's primary worktree. Any uncommitted changes, staged work, or unpushed local commits on main are silently destroyed — and the loop multiplies the blast radius across N PRs. The sibling .claude/commands/fix-klaud-cron-prs.md deliberately uses git worktree add for exactly this reason ("Use a worktree so the loop never disturbs the user's working tree"); /nuke should follow the same pattern, or at minimum gate the reset behind a clean-tree check.

Extended reasoning...

What the bug is

The new /nuke slash command directs the executing agent to run destructive git reset --hard against the user's primary worktree in two places:

Step 4, line 110 (inside the per-family loop):
git checkout main -q && git reset --hard origin/main -q

Step 5, line 151 (finalization):

Return to a clean main (git checkout main && git reset --hard origin/main).

Neither location checks whether the working tree is clean, neither stashes, and neither uses a temp worktree. Any uncommitted edits, staged changes, untracked-but-tracked-path overwrites, or unpushed local commits on main are wiped silently with no recovery path other than the reflog.

Why existing code doesn't prevent it

There is no clean-tree precondition in the command's body. The Step 1–3 preamble is purely discovery/confirmation about recipes, not about the user's git state. The very first action of the actual mutating phase (Step 4) is the destructive reset. The N-iteration loop then repeats the reset before each new family, so even if the first iteration happened to be safe, intermediate state created mid-sweep is also clobbered.

Why this is inconsistent with the project's own pattern

The sibling command .claude/commands/fix-klaud-cron-prs.md solves the identical "loop that needs a fresh main per iteration" problem with a temp worktree, with an explicit author comment:

Use a worktree so the loop never disturbs the user's working tree.

/nuke knowingly or unknowingly deviates from this established safe pattern in the same directory.

Impact

Severity: silent destruction of uncommitted/unpushed work — the canonical "destructive git footgun."

Blast radius: amplified by the loop. A single /nuke vllm v0.22.0 over the v0.22.0 sweep created ~30 PRs ([Klaud Cold] Update dsv4-fp4-b300-vllm (+mtp) vLLM image to v0.22.0 #1595–[Klaud Cold] Update dsv4-fp4-mi355x-vllm vLLM ROCm image to v0.22.0 #1624); 30 iterations means 30 unconditional resets against the user's primary worktree.

Detectability: none at the time of damage — git reset --hard is silent on success. Recovery requires the user to know to consult git reflog and act before GC.

This also violates the global tool-use guidance in this very environment: "NEVER run destructive git commands (push --force, reset --hard, …) unless the user explicitly requests these actions." Encoding reset --hard into a slash command effectively bypasses that guardrail by making the destructive action implicit in invoking the command.

Step-by-step proof

User has uncommitted edits on main (say, a half-written CLAUDE.md change), and runs /nuke vllm v0.22.0.

Agent reaches Step 4, iteration 1, line 110. Executes git checkout main -q && git reset --hard origin/main -q.

git checkout main succeeds (uncommitted edits to tracked files are not blocked because checking out the current branch is a no-op). git reset --hard origin/main then unconditionally rewrites the working tree to match the remote tip.

The user's uncommitted edits to CLAUDE.md (and any other tracked file) are now gone — no prompt, no warning, no backup. Untracked files survive; everything else is overwritten.

The loop continues for the remaining N−1 families. Even if the user had pushed a local-only commit between iterations (e.g. a different branch they happened to be on before invoking), step 4's repeated reset on main clobbers any intermediate main state the agent may have been mid-creating, too.

Step 5 fires git reset --hard origin/main once more for good measure.

How to fix it

Two acceptable fixes, in order of preference:

Use a temp worktree, matching the sibling command. Before Step 4:
wt=/tmp/claude-0/tmp.JptYdkmIlW git worktree add "" main -q cd ""
…and at the end of Step 5, git worktree remove --force "". This is exactly the pattern fix-klaud-cron-prs.md uses (lines 47–54).

Minimum acceptable mitigation: insert a clean-tree precondition before Step 4 starts, e.g.:
git diff --quiet && git diff --cached --quiet || { echo "working tree dirty — aborting /nuke"; exit 1; } git fetch origin && [ "$(git rev-parse main)" = "$(git rev-parse origin/main)" ] || { echo "local main diverges from origin/main — aborting"; exit 1; }
This still puts the user on main rather than a worktree, but at least guarantees nothing is silently destroyed.

claude · 2026-05-30T22:28:40Z

+```bash
+git checkout main -q && git reset --hard origin/main -q
+branch="klaud-cold/<basekey>-<TAG>"
+git checkout -b "$branch" -q


🟡 Step 4 (line 110) runs git reset --hard origin/main -q at the start of each family iteration without a preceding git fetch, so origin/main is whatever was cached at last fetch. On a long-lived shell or after recent unrelated merges, every PR branch gets forked from a stale base, and the perf-changelog.yaml block is appended on top of an outdated snapshot — leading to avoidable rebase/merge churn. One-line fix: prepend git fetch origin main -q before the reset (or fetch once at the top of the command).

Extended reasoning...

What the bug is

Step 4 documents the per-family loop as:

git checkout main -q && git reset --hard origin/main -q branch="klaud-cold/<basekey>-<TAG>" git checkout -b "$branch" -q

origin/main here is the local ref refs/remotes/origin/main, which only updates when git fetch is run. Nothing in the command — neither this loop nor any preceding step — runs git fetch origin. So the base every PR branch is created from is whatever the local cache happened to be at the last fetch.

Why the staleness matters

The refutation correctly notes that /nuke itself does not merge PRs (they go out with full-sweep-enabled and need a long CI sweep before landing), so the worst case isn't a self-race within one sweep. But two realistic cold-start scenarios remain:

Long-running shell / Claude Code session. If the user has had the repo checked out for hours/days without fetching, origin/main lags behind real main. Every family iteration resets to that stale ref.

Concurrent unrelated merges. Other PRs (manual fixes, sibling Klaud Cold work, etc.) land on main during or just before the sweep. Each new family branch misses them and the changelog append is computed against an obsolete file.

Concrete proof

Assume perf-changelog.yaml on real origin/main has been updated by PR #1626 (lands at T=0) to include a new entry block. The user starts /nuke at T+30s on a shell that last fetched at T-10min, so the local refs/remotes/origin/main still points at the pre-#1626 commit.

Iteration 1 — family kimik2.5-int4-b300-vllm: git reset --hard origin/main checks out the pre-[AMD] Add DeepSeek-V4-Pro FP4 MI355X ATOM DP-attention benchmark #1626 changelog. append_changelog.py appends the new block at the end of that older file.

Iteration 2 — family kimik2.5-int4-b200-vllm: same thing — same stale base, same append point.

When these PRs are reviewed/merged, GitHub compares each branch's perf-changelog.yaml against the real current main, which already has [AMD] Add DeepSeek-V4-Pro FP4 MI355X ATOM DP-attention benchmark #1626's block. The trailing region of the file (where appends happen) is now divergent, producing textual merge conflicts that require manual rebase or GitHub auto-merge intervention.

The vendor image edits don't hit this because they're scoped to unique config keys (kimik2.5-int4-b300-vllm: only appears once), but perf-changelog.yaml is appended sequentially and is the canonical collision point.

Why existing code doesn't prevent it

--force-with-lease protects the feature branch from being clobbered, but main is never force-pushed; the issue is the parent commit the feature branch is forked from.

Step 1 discovery reads master YAMLs from the working tree, so it doesn't implicitly fetch.

Step 5's closing git reset --hard origin/main runs after all PRs are already created, too late to help.

The sibling command .claude/commands/fix-klaud-cron-prs.md:50 explicitly does git fetch origin "$BRANCH" before branch operations, showing this repo's expected hygiene.

Fix

One line — either prepend git fetch origin main -q immediately before the git reset --hard origin/main -q in Step 4, or fetch once at the top of the command (cheapest, since most iterations would reuse the same fetched ref).

Severity

Nit. The blast radius is bounded to merge-time rebase churn on perf-changelog.yaml, not silent data loss or broken recipes; image edits are per-unique-key. But the fix is trivial and the documentation is the contract an LLM follows verbatim, so worth tightening.

claude · 2026-05-30T22:28:40Z

+url=$(gh pr create --repo SemiAnalysisAI/InferenceX --base main --head "$branch" \
+      --title "[Klaud Cold] Update <basekey>[ (+mtp)] <PHRASE> to <TAG>" \
+      --body "<BODY>" --label full-sweep-enabled | grep -o 'https://github.com/[^ ]*')
+# patch the changelog pr-link with the real URL, then amend + force-push
+python3 - perf-changelog.yaml "$url" <<'PY'
+import sys; f,u=sys.argv[1],sys.argv[2]
+open(f,'w').write(open(f).read().replace("PRLINK_PLACEHOLDER",u,1))
+PY
+git add perf-changelog.yaml && git commit -q --amend --no-edit && git push -q --force-with-lease


🟡 The flow on lines 119-127 (gh pr create --label full-sweep-enabled → amend → git push --force-with-lease) double-fires the sweep workflow: the --label triggers a labeled event run on the original SHA, then the force-push fires a synchronize event run on the amended SHA. The first run is auto-cancelled by run-sweep.yml's cancel-in-progress concurrency, so GPU impact is bounded to setup-job seconds — but every /nuke invocation still leaves a cancelled sweep run in CI history per PR. Simple fix: drop --label full-sweep-enabled from the gh pr create call and apply the label after the amend+force-push via gh pr edit --add-label full-sweep-enabled.

Extended reasoning...

What the bug is

On lines 119-127 of .claude/commands/nuke.md, the per-family PR sequence is:

gh pr create --label full-sweep-enabled — creates the PR with the sweep label attached at creation time.

Patch perf-changelog.yaml to substitute PRLINK_PLACEHOLDER with the real PR URL (returned by step 1).

git commit --amend --no-edit then git push --force-with-lease.

.github/workflows/run-sweep.yml triggers on pull_request for types [ready_for_review, synchronize, labeled, unlabeled] filtered to perf-changelog.yaml. So step 1 fires a labeled event (sweep #1 on the original SHA), and step 3 fires a synchronize event (sweep #2 on the amended SHA). Both sweeps target a PR whose only diff is perf-changelog.yaml, so the path filter passes for both.

Why the GPU-doubling framing is wrong (refutation acknowledged)

The refuting verifier is correct that this does not double GPU spend. Evaluating run-sweep.yml's concurrency expression for both events:

labeled with label.name == 'full-sweep-enabled': the inner AND chain in the ternary includes github.event.label.name != 'full-sweep-enabled', which is false, so the whole expression falls back to 'active'. Group = sweep-<PR#>-active.

synchronize: (action == 'labeled' || action == 'unlabeled') is false, AND-chain short-circuits to false, fallback to 'active'. Group = sweep-<PR#>-active.

Both events land in the same concurrency group with cancel-in-progress: true, so sweep #2 cancels sweep #1 before any GPU job is dispatched (cancellation happens at the setup job stage on ubuntu-latest). The original 'doubles GPU spend' framing overstates the impact.

Why it's still worth flagging

The cancelled-run-per-PR is not just cosmetic for a one-off PR, but /nuke is explicitly a batch operation that creates one PR per recipe family (the v0.22.0 example referenced in the description spans PRs #1595–#1624 — 30 PRs in a single invocation). Each invocation leaves N cancelled sweep entries in CI history, plus N ubuntu-latest setup-job-seconds of wasted minutes. That accumulates across invocations of a command designed to be run repeatedly.

Step-by-step proof

git push -u origin klaud-cold/foo-v0.22.0 — branch pushed with commit A (no PR yet, no workflow trigger from this push since branch isn't tracked by push: trigger which only watches main).

gh pr create ... --label full-sweep-enabled — PR opens with the label attached. GitHub fires pull_request with action=labeled, label.name=full-sweep-enabled, head=A. run-sweep.yml evaluates: paths filter matches (perf-changelog.yaml is in the PR diff), setup job's if passes. Sweep run [NVIDIA] Add TRT-LLM 70B FP8 via slurm #1 enters concurrency group sweep-<PR#>-active and starts.

git commit --amend --no-edit rewrites A → B; perf-changelog.yaml content differs (PR link patched). git push --force-with-lease fires pull_request with action=synchronize, head=B. paths matches (changelog still modified). Sweep run [NVIDIA] Add TRT 70B (FP8 and FP4) #2 enters concurrency group sweep-<PR#>-active.

With cancel-in-progress: true, sweep [NVIDIA] Add TRT-LLM 70B FP8 via slurm #1 is cancelled. Net result: 1 successful sweep run on the amended SHA + 1 cancelled sweep run on the abandoned original SHA, per PR.

Fix

Cleanest fix is a one-line reorder around the gh pr create call: omit --label full-sweep-enabled from create, then add the label after the amend+force-push completes:

url=$(gh pr create --repo ... --title ... --body "<BODY>" | grep -o 'https://github.com/[^ ]*') # ... patch perf-changelog.yaml ... git add perf-changelog.yaml && git commit -q --amend --no-edit && git push -q --force-with-lease gh pr edit "$url" --add-label full-sweep-enabled

The labeled event now lands on the amended SHA, so only one sweep runs. (The refutation's objection that 'the URL cannot be known until after gh pr create returns' doesn't apply to this variant — we still call gh pr create first, we just defer the label until after the amend.)

functionstackx requested a review from a team May 30, 2026 22:18

github-project-automation Bot added this to InferenceMAX Board May 30, 2026

functionstackx merged commit a73df61 into main May 30, 2026
5 checks passed

functionstackx deleted the klaud-cold/add-nuke-command branch May 30, 2026 22:19

github-project-automation Bot moved this to Done in InferenceMAX Board May 30, 2026

claude Bot reviewed May 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Klaud Cold] Add /nuke command for bumping engine image tags#1625

[Klaud Cold] Add /nuke command for bumping engine image tags#1625
functionstackx merged 1 commit into
mainfrom
klaud-cold/add-nuke-command

functionstackx commented May 30, 2026 •

edited by cursor Bot

Loading

Uh oh!

Uh oh!

claude Bot May 30, 2026

Uh oh!

claude Bot May 30, 2026

Uh oh!

claude Bot May 30, 2026

Uh oh!

claude Bot May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

functionstackx commented May 30, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Uh oh!

claude Bot May 30, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 30, 2026

Choose a reason for hiding this comment

What the bug is

Why existing code doesn't prevent it

Why this is inconsistent with the project's own pattern

Impact

Step-by-step proof

How to fix it

Uh oh!

claude Bot May 30, 2026

Choose a reason for hiding this comment

What the bug is

Why the staleness matters

Concrete proof

Why existing code doesn't prevent it

Fix

Severity

Uh oh!

claude Bot May 30, 2026

Choose a reason for hiding this comment

What the bug is

Why the GPU-doubling framing is wrong (refutation acknowledged)

Why it's still worth flagging

Step-by-step proof

Fix

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

functionstackx commented May 30, 2026 •

edited by cursor Bot

Loading