Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
153 changes: 153 additions & 0 deletions .claude/commands/nuke.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
---
description: Bump single-node inference-engine image tags (vLLM or SGLang) across recipes, one [Klaud Cold] PR per model+precision+SKU
argument-hint: <vllm|sglang> <target-tag> [model/sku filter]
---

Bump the container image tag for single-node benchmark recipes that use a given
inference engine, opening **one PR per recipe family** with the grouping rules below.

Arguments (`$ARGUMENTS`): `<engine> <target-tag> [filter]`
- `engine` — `vllm` or `sglang`
- `target-tag` — e.g. `v0.22.0` (NVIDIA/CUDA) ; for SGLang the NVIDIA and AMD tag
strings usually differ (CUDA `…-cu130` vs ROCm `…-rocm720-mi35x-…`), so confirm
the exact tag per image repo with the user before editing.
- `filter` (optional) — restrict to a model and/or SKU substring (e.g. `kimik2.5`,
`b300`, `minimaxm2.5 mi355x`). If omitted, all matching recipes are in scope.

## Image repos by engine + vendor

| engine | NVIDIA image | AMD/ROCm image | master config |
|--------|--------------|----------------|---------------|
| vllm | `vllm/vllm-openai` | `vllm/vllm-openai-rocm` | `.github/configs/nvidia-master.yaml` / `amd-master.yaml` |
| sglang | `lmsysorg/sglang` | `lmsysorg/sglang` (rocm-suffixed tag) | same two files |

## Grouping rules (NON-NEGOTIABLE)

1. **One PR per `model + precision + SKU` recipe family.** The config-key shape is
`<model>-<precision>-<sku>-<engine>` (e.g. `kimik2.5-int4-b300-vllm`).
2. **Fold the `-mtp` (and non-mtp) sibling into the SAME PR** as its base recipe.
This is the *only* thing you may combine.
3. **Never** put two different models, two different precisions, or two different
SKUs in the same PR. (fp4 vs fp8 vs int4 are different precisions → separate PRs.)
4. Skip `*-agentic` recipes unless the user explicitly opts in — they are
deliberately diverged/pinned.

## Step 1 — discover candidate recipes

Parse both master YAMLs for top-level keys whose `framework:` matches `engine`, and
record each key's current `image:`. Keep only single-node keys (they carry a SKU like
`b200/b300/h100/h200/mi300x/mi325x/mi355x` and map to `benchmarks/single_node/*`); drop
multi-node/disagg keys. Apply the `filter` if given. Then collapse `-mtp` siblings into
their base family.

## Step 2 — verify the target tag(s) EXIST before bumping

Per standing guidance, never invent a tag. Check each image repo you'll touch:

```bash
for repo in vllm/vllm-openai vllm/vllm-openai-rocm; do # or lmsysorg/sglang
code=$(curl -s -o /dev/null -w "%{http_code}" "https://hub.docker.com/v2/repositories/${repo}/tags/<TAG>")
echo "$repo:<TAG> -> $code" # want 200
done
```

If a vendor-specific variant 404s (e.g. `…-cu130` for a version that only ships
plain), confirm the correct tag string with the user before proceeding.

## Step 3 — confirm scope with the user (AskUserQuestion)

Before creating anything, present the full recipe list (count + current→target per
family) and confirm:
- **Vendor scope**: NVIDIA, AMD, or both.
- **Agentic**: include `*-agentic` siblings? (default: exclude)
- **Special pins**: call out any recipe currently on a nightly/non-stable/special tag
(e.g. `nightly-…`, `…-cu129`, a one-off build) and ask whether to override it.

Each PR triggers a full GPU sweep, so surface the total PR count explicitly.

## Step 4 — create one PR per family

Use these helpers (write them to /tmp) for precise, per-config-key edits — a blind
`sed` is unsafe because the same old tag appears under many keys.

`/tmp/edit_image.py`:
```python
#!/usr/bin/env python3
# Usage: edit_image.py <yaml_file> <new_image> <key1> [key2 ...]
import re, sys
f, new_image, keys = sys.argv[1], sys.argv[2], sys.argv[3:]
lines = open(f).read().split('\n')
for key in keys:
kre = re.compile(r'^' + re.escape(key) + r':\s*$')
start = next((i for i,l in enumerate(lines) if kre.match(l)), None)
if start is None: sys.exit(f"ERROR: key not found: {key}")
img_i = None
for j in range(start+1, len(lines)):
if re.match(r'^[A-Za-z0-9._-]+:\s*$', lines[j]): break # next top-level key
m = re.match(r'^(\s+)image:\s*(.+?)\s*$', lines[j])
if m: img_i, indent, old = j, m.group(1), m.group(2); break
if img_i is None: sys.exit(f"ERROR: no image: line for key {key}")
if old != new_image: lines[img_i] = f"{indent}image: {new_image}"; print(f"{key}: {old} -> {new_image}")
else: print(f"{key}: already {new_image} (no change)")
open(f,'w').write('\n'.join(lines))
```

`/tmp/append_changelog.py`:
```python
#!/usr/bin/env python3
# Usage: append_changelog.py <changelog> <description> <key1> [key2 ...]
import sys
f, desc, keys = sys.argv[1], sys.argv[2], sys.argv[3:]
content = open(f).read().rstrip('\n')
block = ["", "- config-keys:"] + [f" - {k}" for k in keys]
block += [" description:", f' - "{desc}"', " pr-link: PRLINK_PLACEHOLDER"]
open(f,'w').write(content + '\n' + '\n'.join(block) + '\n')
```

For each family (run strictly sequentially — git checkouts can't be parallel):

```bash
git checkout main -q && git reset --hard origin/main -q
branch="klaud-cold/<basekey>-<TAG>"
git checkout -b "$branch" -q

Check warning on line 112 in .claude/commands/nuke.md

View check run for this annotation

Claude / Claude Code Review

Missing git fetch before reset --hard uses stale origin/main

Step 4 (line 110) runs `git reset --hard origin/main -q` at the start of each family iteration without a preceding `git fetch`, so `origin/main` is whatever was cached at last fetch. On a long-lived shell or after recent unrelated merges, every PR branch gets forked from a stale base, and the `perf-changelog.yaml` block is appended on top of an outdated snapshot — leading to avoidable rebase/merge churn. One-line fix: prepend `git fetch origin main -q` before the reset (or fetch once at the top
Comment on lines +109 to +112
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Step 4 (line 110) runs git reset --hard origin/main -q at the start of each family iteration without a preceding git fetch, so origin/main is whatever was cached at last fetch. On a long-lived shell or after recent unrelated merges, every PR branch gets forked from a stale base, and the perf-changelog.yaml block is appended on top of an outdated snapshot — leading to avoidable rebase/merge churn. One-line fix: prepend git fetch origin main -q before the reset (or fetch once at the top of the command).

Extended reasoning...

What the bug is

Step 4 documents the per-family loop as:

git checkout main -q && git reset --hard origin/main -q
branch="klaud-cold/<basekey>-<TAG>"
git checkout -b "$branch" -q

origin/main here is the local ref refs/remotes/origin/main, which only updates when git fetch is run. Nothing in the command — neither this loop nor any preceding step — runs git fetch origin. So the base every PR branch is created from is whatever the local cache happened to be at the last fetch.

Why the staleness matters

The refutation correctly notes that /nuke itself does not merge PRs (they go out with full-sweep-enabled and need a long CI sweep before landing), so the worst case isn't a self-race within one sweep. But two realistic cold-start scenarios remain:

  1. Long-running shell / Claude Code session. If the user has had the repo checked out for hours/days without fetching, origin/main lags behind real main. Every family iteration resets to that stale ref.
  2. Concurrent unrelated merges. Other PRs (manual fixes, sibling Klaud Cold work, etc.) land on main during or just before the sweep. Each new family branch misses them and the changelog append is computed against an obsolete file.

Concrete proof

Assume perf-changelog.yaml on real origin/main has been updated by PR #1626 (lands at T=0) to include a new entry block. The user starts /nuke at T+30s on a shell that last fetched at T-10min, so the local refs/remotes/origin/main still points at the pre-#1626 commit.

  1. Iteration 1 — family kimik2.5-int4-b300-vllm: git reset --hard origin/main checks out the pre-[AMD] Add DeepSeek-V4-Pro FP4 MI355X ATOM DP-attention benchmark #1626 changelog. append_changelog.py appends the new block at the end of that older file.
  2. Iteration 2 — family kimik2.5-int4-b200-vllm: same thing — same stale base, same append point.
  3. When these PRs are reviewed/merged, GitHub compares each branch's perf-changelog.yaml against the real current main, which already has [AMD] Add DeepSeek-V4-Pro FP4 MI355X ATOM DP-attention benchmark #1626's block. The trailing region of the file (where appends happen) is now divergent, producing textual merge conflicts that require manual rebase or GitHub auto-merge intervention.

The vendor image edits don't hit this because they're scoped to unique config keys (kimik2.5-int4-b300-vllm: only appears once), but perf-changelog.yaml is appended sequentially and is the canonical collision point.

Why existing code doesn't prevent it

  • --force-with-lease protects the feature branch from being clobbered, but main is never force-pushed; the issue is the parent commit the feature branch is forked from.
  • Step 1 discovery reads master YAMLs from the working tree, so it doesn't implicitly fetch.
  • Step 5's closing git reset --hard origin/main runs after all PRs are already created, too late to help.
  • The sibling command .claude/commands/fix-klaud-cron-prs.md:50 explicitly does git fetch origin "$BRANCH" before branch operations, showing this repo's expected hygiene.

Fix

One line — either prepend git fetch origin main -q immediately before the git reset --hard origin/main -q in Step 4, or fetch once at the top of the command (cheapest, since most iterations would reuse the same fetched ref).

Severity

Nit. The blast radius is bounded to merge-time rebase churn on perf-changelog.yaml, not silent data loss or broken recipes; image edits are per-unique-key. But the fix is trivial and the documentation is the contract an LLM follows verbatim, so worth tightening.

python3 /tmp/edit_image.py <master.yaml> <NEW_IMAGE> <key> [<key>-mtp]
python3 /tmp/append_changelog.py perf-changelog.yaml "<DESC>" <key> [<key>-mtp]
git add -A
git commit -q -m "[Klaud Cold] Update <basekey>[ (+mtp)] <PHRASE> to <TAG>" \
-m "Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
git push -u origin "$branch" -q --force-with-lease

Check failure on line 118 in .claude/commands/nuke.md

View check run for this annotation

Claude / Claude Code Review

Destructive git reset --hard clobbers user working tree

Step 4 starts each family iteration with `git checkout main -q && git reset --hard origin/main -q` (line 110), and Step 5 finalizes with the same reset on main (line 151), both running in the user's primary worktree. Any uncommitted changes, staged work, or unpushed local commits on `main` are silently destroyed — and the loop multiplies the blast radius across N PRs. The sibling `.claude/commands/fix-klaud-cron-prs.md` deliberately uses `git worktree add` for exactly this reason ("Use a worktre
Comment on lines +109 to +118
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Step 4 starts each family iteration with git checkout main -q && git reset --hard origin/main -q (line 110), and Step 5 finalizes with the same reset on main (line 151), both running in the user's primary worktree. Any uncommitted changes, staged work, or unpushed local commits on main are silently destroyed — and the loop multiplies the blast radius across N PRs. The sibling .claude/commands/fix-klaud-cron-prs.md deliberately uses git worktree add for exactly this reason ("Use a worktree so the loop never disturbs the user's working tree"); /nuke should follow the same pattern, or at minimum gate the reset behind a clean-tree check.

Extended reasoning...

What the bug is

The new /nuke slash command directs the executing agent to run destructive git reset --hard against the user's primary worktree in two places:

  1. Step 4, line 110 (inside the per-family loop):
    git checkout main -q && git reset --hard origin/main -q
  2. Step 5, line 151 (finalization):

    Return to a clean main (git checkout main && git reset --hard origin/main).

Neither location checks whether the working tree is clean, neither stashes, and neither uses a temp worktree. Any uncommitted edits, staged changes, untracked-but-tracked-path overwrites, or unpushed local commits on main are wiped silently with no recovery path other than the reflog.

Why existing code doesn't prevent it

There is no clean-tree precondition in the command's body. The Step 1–3 preamble is purely discovery/confirmation about recipes, not about the user's git state. The very first action of the actual mutating phase (Step 4) is the destructive reset. The N-iteration loop then repeats the reset before each new family, so even if the first iteration happened to be safe, intermediate state created mid-sweep is also clobbered.

Why this is inconsistent with the project's own pattern

The sibling command .claude/commands/fix-klaud-cron-prs.md solves the identical "loop that needs a fresh main per iteration" problem with a temp worktree, with an explicit author comment:

Use a worktree so the loop never disturbs the user's working tree.

/nuke knowingly or unknowingly deviates from this established safe pattern in the same directory.

Impact

  • Severity: silent destruction of uncommitted/unpushed work — the canonical "destructive git footgun."
  • Blast radius: amplified by the loop. A single /nuke vllm v0.22.0 over the v0.22.0 sweep created ~30 PRs ([Klaud Cold] Update dsv4-fp4-b300-vllm (+mtp) vLLM image to v0.22.0 #1595[Klaud Cold] Update dsv4-fp4-mi355x-vllm vLLM ROCm image to v0.22.0 #1624); 30 iterations means 30 unconditional resets against the user's primary worktree.
  • Detectability: none at the time of damage — git reset --hard is silent on success. Recovery requires the user to know to consult git reflog and act before GC.
  • This also violates the global tool-use guidance in this very environment: "NEVER run destructive git commands (push --force, reset --hard, …) unless the user explicitly requests these actions." Encoding reset --hard into a slash command effectively bypasses that guardrail by making the destructive action implicit in invoking the command.

Step-by-step proof

  1. User has uncommitted edits on main (say, a half-written CLAUDE.md change), and runs /nuke vllm v0.22.0.
  2. Agent reaches Step 4, iteration 1, line 110. Executes git checkout main -q && git reset --hard origin/main -q.
  3. git checkout main succeeds (uncommitted edits to tracked files are not blocked because checking out the current branch is a no-op). git reset --hard origin/main then unconditionally rewrites the working tree to match the remote tip.
  4. The user's uncommitted edits to CLAUDE.md (and any other tracked file) are now gone — no prompt, no warning, no backup. Untracked files survive; everything else is overwritten.
  5. The loop continues for the remaining N−1 families. Even if the user had pushed a local-only commit between iterations (e.g. a different branch they happened to be on before invoking), step 4's repeated reset on main clobbers any intermediate main state the agent may have been mid-creating, too.
  6. Step 5 fires git reset --hard origin/main once more for good measure.

How to fix it

Two acceptable fixes, in order of preference:

  1. Use a temp worktree, matching the sibling command. Before Step 4:
    wt=/tmp/claude-0/tmp.JptYdkmIlW
    git worktree add "" main -q
    cd ""
    …and at the end of Step 5, git worktree remove --force "". This is exactly the pattern fix-klaud-cron-prs.md uses (lines 47–54).
  2. Minimum acceptable mitigation: insert a clean-tree precondition before Step 4 starts, e.g.:
    git diff --quiet && git diff --cached --quiet || { echo "working tree dirty — aborting /nuke"; exit 1; }
    git fetch origin && [ "$(git rev-parse main)" = "$(git rev-parse origin/main)" ] || { echo "local main diverges from origin/main — aborting"; exit 1; }
    This still puts the user on main rather than a worktree, but at least guarantees nothing is silently destroyed.

url=$(gh pr create --repo SemiAnalysisAI/InferenceX --base main --head "$branch" \
--title "[Klaud Cold] Update <basekey>[ (+mtp)] <PHRASE> to <TAG>" \
--body "<BODY>" --label full-sweep-enabled | grep -o 'https://github.com/[^ ]*')
# patch the changelog pr-link with the real URL, then amend + force-push
python3 - perf-changelog.yaml "$url" <<'PY'
import sys; f,u=sys.argv[1],sys.argv[2]
open(f,'w').write(open(f).read().replace("PRLINK_PLACEHOLDER",u,1))
PY
git add perf-changelog.yaml && git commit -q --amend --no-edit && git push -q --force-with-lease

Check failure on line 127 in .claude/commands/nuke.md

View check run for this annotation

Claude / Claude Code Review

gh pr create failure silently writes empty pr-link to changelog

If `gh pr create` on line 119 fails (auth, network, label permission, branch already has a PR, etc.) the surrounding pipeline still 'succeeds' because there is no `set -o pipefail` — only `grep`'s exit status propagates and `$url` ends up empty. The python heredoc on lines 123–126 then runs `replace("PRLINK_PLACEHOLDER", "", 1)`, the amend+force-push on line 127 publishes a `perf-changelog.yaml` with a bare `pr-link:` (no value), and you're left with an orphaned remote branch carrying broken cha

Check warning on line 127 in .claude/commands/nuke.md

View check run for this annotation

Claude / Claude Code Review

Amend + force-push after labeled gh pr create triggers redundant sweep

The flow on lines 119-127 (`gh pr create --label full-sweep-enabled` → amend → `git push --force-with-lease`) double-fires the sweep workflow: the `--label` triggers a `labeled` event run on the original SHA, then the force-push fires a `synchronize` event run on the amended SHA. The first run is auto-cancelled by `run-sweep.yml`'s `cancel-in-progress` concurrency, so GPU impact is bounded to setup-job seconds — but every /nuke invocation still leaves a cancelled sweep run in CI history per PR.
Comment on lines +119 to +127
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 If gh pr create on line 119 fails (auth, network, label permission, branch already has a PR, etc.) the surrounding pipeline still 'succeeds' because there is no set -o pipefail — only grep's exit status propagates and $url ends up empty. The python heredoc on lines 123–126 then runs replace("PRLINK_PLACEHOLDER", "", 1), the amend+force-push on line 127 publishes a perf-changelog.yaml with a bare pr-link: (no value), and you're left with an orphaned remote branch carrying broken changelog data and no PR. Add set -euo pipefail to the snippet, or guard with [ -n "$url" ] || exit 1 immediately after the assignment.

Extended reasoning...

What the bug is. The pipeline\n\nbash\nurl=$(gh pr create … --label full-sweep-enabled | grep -o 'https://github.com/[^ ]*')\n\n\nuses no set -e or set -o pipefail (none are set anywhere in this command document). In a default bash pipeline, the exit status of $(…) is the exit status of the last command — grep — so gh pr create's exit code is discarded. When gh pr create writes its error to stderr and emits no URL on stdout, grep -o matches nothing, exits 1, and the surrounding bash sees $url="" with no error indication. The script continues.\n\nHow it manifests. Several realistic ways gh pr create fails in this exact flow:\n- a PR already exists for $branch (re-runs / partial prior attempts);\n- the full-sweep-enabled label doesn't exist on the repo or the token lacks permission to apply labels;\n- transient network / auth error against api.github.com;\n- title/body validation errors.\n\nIn each case stderr carries the diagnostic, stdout is empty, grep -o produces no match.\n\nWhat gets committed. Lines 123–126 unconditionally call\n\npython\nopen(f,'w').write(open(f).read().replace("PRLINK_PLACEHOLDER", u, 1))\n\n\nwith u="". The placeholder is replaced with the empty string, yielding a YAML entry like:\n\nyaml\n pr-link:\n\n\nThat is valid YAML (pr-link is null) but semantically broken — there is no PR to link to. Line 127 then runs git commit --amend --no-edit && git push -q --force-with-lease, force-pushing the corrupted changelog onto the branch that was already pushed by line 118. The remote branch persists with no associated PR and a junk changelog entry.\n\nStep-by-step proof.\n1. /nuke vllm v0.22.0 kimik2.5 is invoked on a branch where a previous (failed) iteration already opened a PR for klaud-cold/kimik2.5-int4-b300-vllm-v0.22.0.\n2. git push -u origin "$branch" -q --force-with-lease succeeds (line 118) — the branch already exists upstream.\n3. gh pr create … --head "$branch" (line 119) fails with a pull request for branch "klaud-cold/..." into branch "main" already exists on stderr. stdout is empty. gh's exit code is 1.\n4. The pipe feeds nothing to grep -o 'https://github.com/[^ ]*'. grep exits 1. Because $(…)'s status is grep's status, and no set -e/pipefail is in effect, bash simply assigns url="" and continues.\n5. The python heredoc (lines 123–126) replaces PRLINK_PLACEHOLDER with "" in perf-changelog.yaml — the entry becomes pr-link: with an empty value.\n6. git add perf-changelog.yaml && git commit -q --amend --no-edit && git push -q --force-with-lease (line 127) silently rewrites the existing branch with the broken changelog. No error surfaces. The for-each-family loop moves on to the next recipe.\n7. Step 5 of the procedure (return to clean main + report PR table) will be missing a row for this family, but in a multi-recipe sweep that's easy to overlook — and the orphaned remote branch with junk changelog data has already been published.\n\nWhy existing code doesn't prevent it. Nothing in the snippet (or elsewhere in the file) sets pipefail or errexit. There is no [ -n "$url" ] guard before the amend. The python heredoc happily accepts an empty replacement string. git commit --amend will succeed even if the only staged change leaves the YAML semantically meaningless. --force-with-lease only protects against concurrent upstream changes, not against pushing self-inflicted garbage.\n\nFix. Two equivalent one-liners:\n\nbash\nset -euo pipefail # at the top of the per-family block\n\n\nor, more targeted:\n\nbash\nurl=$(gh pr create … | grep -o 'https://github.com/[^ ]*') || true\n[ -n "$url" ] || { echo "gh pr create failed for $branch" >&2; exit 1; }\n\n\nEither makes the failure loud and stops the changelog amend from running with an empty URL.

Comment on lines +119 to +127
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The flow on lines 119-127 (gh pr create --label full-sweep-enabled → amend → git push --force-with-lease) double-fires the sweep workflow: the --label triggers a labeled event run on the original SHA, then the force-push fires a synchronize event run on the amended SHA. The first run is auto-cancelled by run-sweep.yml's cancel-in-progress concurrency, so GPU impact is bounded to setup-job seconds — but every /nuke invocation still leaves a cancelled sweep run in CI history per PR. Simple fix: drop --label full-sweep-enabled from the gh pr create call and apply the label after the amend+force-push via gh pr edit --add-label full-sweep-enabled.

Extended reasoning...

What the bug is

On lines 119-127 of .claude/commands/nuke.md, the per-family PR sequence is:

  1. gh pr create --label full-sweep-enabled — creates the PR with the sweep label attached at creation time.
  2. Patch perf-changelog.yaml to substitute PRLINK_PLACEHOLDER with the real PR URL (returned by step 1).
  3. git commit --amend --no-edit then git push --force-with-lease.

.github/workflows/run-sweep.yml triggers on pull_request for types [ready_for_review, synchronize, labeled, unlabeled] filtered to perf-changelog.yaml. So step 1 fires a labeled event (sweep #1 on the original SHA), and step 3 fires a synchronize event (sweep #2 on the amended SHA). Both sweeps target a PR whose only diff is perf-changelog.yaml, so the path filter passes for both.

Why the GPU-doubling framing is wrong (refutation acknowledged)

The refuting verifier is correct that this does not double GPU spend. Evaluating run-sweep.yml's concurrency expression for both events:

  • labeled with label.name == 'full-sweep-enabled': the inner AND chain in the ternary includes github.event.label.name != 'full-sweep-enabled', which is false, so the whole expression falls back to 'active'. Group = sweep-<PR#>-active.
  • synchronize: (action == 'labeled' || action == 'unlabeled') is false, AND-chain short-circuits to false, fallback to 'active'. Group = sweep-<PR#>-active.

Both events land in the same concurrency group with cancel-in-progress: true, so sweep #2 cancels sweep #1 before any GPU job is dispatched (cancellation happens at the setup job stage on ubuntu-latest). The original 'doubles GPU spend' framing overstates the impact.

Why it's still worth flagging

The cancelled-run-per-PR is not just cosmetic for a one-off PR, but /nuke is explicitly a batch operation that creates one PR per recipe family (the v0.22.0 example referenced in the description spans PRs #1595#1624 — 30 PRs in a single invocation). Each invocation leaves N cancelled sweep entries in CI history, plus N ubuntu-latest setup-job-seconds of wasted minutes. That accumulates across invocations of a command designed to be run repeatedly.

Step-by-step proof

  1. git push -u origin klaud-cold/foo-v0.22.0 — branch pushed with commit A (no PR yet, no workflow trigger from this push since branch isn't tracked by push: trigger which only watches main).
  2. gh pr create ... --label full-sweep-enabled — PR opens with the label attached. GitHub fires pull_request with action=labeled, label.name=full-sweep-enabled, head=A. run-sweep.yml evaluates: paths filter matches (perf-changelog.yaml is in the PR diff), setup job's if passes. Sweep run [NVIDIA] Add TRT-LLM 70B FP8 via slurm #1 enters concurrency group sweep-<PR#>-active and starts.
  3. git commit --amend --no-edit rewrites A → B; perf-changelog.yaml content differs (PR link patched). git push --force-with-lease fires pull_request with action=synchronize, head=B. paths matches (changelog still modified). Sweep run [NVIDIA] Add TRT 70B (FP8 and FP4) #2 enters concurrency group sweep-<PR#>-active.
  4. With cancel-in-progress: true, sweep [NVIDIA] Add TRT-LLM 70B FP8 via slurm #1 is cancelled. Net result: 1 successful sweep run on the amended SHA + 1 cancelled sweep run on the abandoned original SHA, per PR.

Fix

Cleanest fix is a one-line reorder around the gh pr create call: omit --label full-sweep-enabled from create, then add the label after the amend+force-push completes:

url=$(gh pr create --repo ... --title ... --body "<BODY>" | grep -o 'https://github.com/[^ ]*')
# ... patch perf-changelog.yaml ...
git add perf-changelog.yaml && git commit -q --amend --no-edit && git push -q --force-with-lease
gh pr edit "$url" --add-label full-sweep-enabled

The labeled event now lands on the amended SHA, so only one sweep runs. (The refutation's objection that 'the URL cannot be known until after gh pr create returns' doesn't apply to this variant — we still call gh pr create first, we just defer the label until after the amend.)

```

Conventions:
- `<PHRASE>` = `vLLM image` / `vLLM ROCm image` / `SGLang image` / `SGLang ROCm image`.
- Title gets `(+mtp)` only when the family has an mtp sibling.
- Every PR carries the **`full-sweep-enabled`** label so CI kicks off.
- `<DESC>` = `Update <PHRASE> from <old-tag> to <TAG>` (note both tags when the
base/mtp differ, e.g. base already on target).
- PR body:
```
## Summary
<DESC>

Recipes touched: `key1`, `key2`

## Test plan
- [ ] full-sweep-enabled sweep passes.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
```

## Step 5 — finish

Return to a clean `main` (`git checkout main && git reset --hard origin/main`).
Report a table of every PR created (number + URL + recipe), flag any special-pin
overrides, and note that each PR's sweep will run via the `full-sweep-enabled` label.