-
Notifications
You must be signed in to change notification settings - Fork 179
[Klaud Cold] Add /nuke command for bumping engine image tags #1625
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,153 @@ | ||
| --- | ||
| description: Bump single-node inference-engine image tags (vLLM or SGLang) across recipes, one [Klaud Cold] PR per model+precision+SKU | ||
| argument-hint: <vllm|sglang> <target-tag> [model/sku filter] | ||
| --- | ||
|
|
||
| Bump the container image tag for single-node benchmark recipes that use a given | ||
| inference engine, opening **one PR per recipe family** with the grouping rules below. | ||
|
|
||
| Arguments (`$ARGUMENTS`): `<engine> <target-tag> [filter]` | ||
| - `engine` — `vllm` or `sglang` | ||
| - `target-tag` — e.g. `v0.22.0` (NVIDIA/CUDA) ; for SGLang the NVIDIA and AMD tag | ||
| strings usually differ (CUDA `…-cu130` vs ROCm `…-rocm720-mi35x-…`), so confirm | ||
| the exact tag per image repo with the user before editing. | ||
| - `filter` (optional) — restrict to a model and/or SKU substring (e.g. `kimik2.5`, | ||
| `b300`, `minimaxm2.5 mi355x`). If omitted, all matching recipes are in scope. | ||
|
|
||
| ## Image repos by engine + vendor | ||
|
|
||
| | engine | NVIDIA image | AMD/ROCm image | master config | | ||
| |--------|--------------|----------------|---------------| | ||
| | vllm | `vllm/vllm-openai` | `vllm/vllm-openai-rocm` | `.github/configs/nvidia-master.yaml` / `amd-master.yaml` | | ||
| | sglang | `lmsysorg/sglang` | `lmsysorg/sglang` (rocm-suffixed tag) | same two files | | ||
|
|
||
| ## Grouping rules (NON-NEGOTIABLE) | ||
|
|
||
| 1. **One PR per `model + precision + SKU` recipe family.** The config-key shape is | ||
| `<model>-<precision>-<sku>-<engine>` (e.g. `kimik2.5-int4-b300-vllm`). | ||
| 2. **Fold the `-mtp` (and non-mtp) sibling into the SAME PR** as its base recipe. | ||
| This is the *only* thing you may combine. | ||
| 3. **Never** put two different models, two different precisions, or two different | ||
| SKUs in the same PR. (fp4 vs fp8 vs int4 are different precisions → separate PRs.) | ||
| 4. Skip `*-agentic` recipes unless the user explicitly opts in — they are | ||
| deliberately diverged/pinned. | ||
|
|
||
| ## Step 1 — discover candidate recipes | ||
|
|
||
| Parse both master YAMLs for top-level keys whose `framework:` matches `engine`, and | ||
| record each key's current `image:`. Keep only single-node keys (they carry a SKU like | ||
| `b200/b300/h100/h200/mi300x/mi325x/mi355x` and map to `benchmarks/single_node/*`); drop | ||
| multi-node/disagg keys. Apply the `filter` if given. Then collapse `-mtp` siblings into | ||
| their base family. | ||
|
|
||
| ## Step 2 — verify the target tag(s) EXIST before bumping | ||
|
|
||
| Per standing guidance, never invent a tag. Check each image repo you'll touch: | ||
|
|
||
| ```bash | ||
| for repo in vllm/vllm-openai vllm/vllm-openai-rocm; do # or lmsysorg/sglang | ||
| code=$(curl -s -o /dev/null -w "%{http_code}" "https://hub.docker.com/v2/repositories/${repo}/tags/<TAG>") | ||
| echo "$repo:<TAG> -> $code" # want 200 | ||
| done | ||
| ``` | ||
|
|
||
| If a vendor-specific variant 404s (e.g. `…-cu130` for a version that only ships | ||
| plain), confirm the correct tag string with the user before proceeding. | ||
|
|
||
| ## Step 3 — confirm scope with the user (AskUserQuestion) | ||
|
|
||
| Before creating anything, present the full recipe list (count + current→target per | ||
| family) and confirm: | ||
| - **Vendor scope**: NVIDIA, AMD, or both. | ||
| - **Agentic**: include `*-agentic` siblings? (default: exclude) | ||
| - **Special pins**: call out any recipe currently on a nightly/non-stable/special tag | ||
| (e.g. `nightly-…`, `…-cu129`, a one-off build) and ask whether to override it. | ||
|
|
||
| Each PR triggers a full GPU sweep, so surface the total PR count explicitly. | ||
|
|
||
| ## Step 4 — create one PR per family | ||
|
|
||
| Use these helpers (write them to /tmp) for precise, per-config-key edits — a blind | ||
| `sed` is unsafe because the same old tag appears under many keys. | ||
|
|
||
| `/tmp/edit_image.py`: | ||
| ```python | ||
| #!/usr/bin/env python3 | ||
| # Usage: edit_image.py <yaml_file> <new_image> <key1> [key2 ...] | ||
| import re, sys | ||
| f, new_image, keys = sys.argv[1], sys.argv[2], sys.argv[3:] | ||
| lines = open(f).read().split('\n') | ||
| for key in keys: | ||
| kre = re.compile(r'^' + re.escape(key) + r':\s*$') | ||
| start = next((i for i,l in enumerate(lines) if kre.match(l)), None) | ||
| if start is None: sys.exit(f"ERROR: key not found: {key}") | ||
| img_i = None | ||
| for j in range(start+1, len(lines)): | ||
| if re.match(r'^[A-Za-z0-9._-]+:\s*$', lines[j]): break # next top-level key | ||
| m = re.match(r'^(\s+)image:\s*(.+?)\s*$', lines[j]) | ||
| if m: img_i, indent, old = j, m.group(1), m.group(2); break | ||
| if img_i is None: sys.exit(f"ERROR: no image: line for key {key}") | ||
| if old != new_image: lines[img_i] = f"{indent}image: {new_image}"; print(f"{key}: {old} -> {new_image}") | ||
| else: print(f"{key}: already {new_image} (no change)") | ||
| open(f,'w').write('\n'.join(lines)) | ||
| ``` | ||
|
|
||
| `/tmp/append_changelog.py`: | ||
| ```python | ||
| #!/usr/bin/env python3 | ||
| # Usage: append_changelog.py <changelog> <description> <key1> [key2 ...] | ||
| import sys | ||
| f, desc, keys = sys.argv[1], sys.argv[2], sys.argv[3:] | ||
| content = open(f).read().rstrip('\n') | ||
| block = ["", "- config-keys:"] + [f" - {k}" for k in keys] | ||
| block += [" description:", f' - "{desc}"', " pr-link: PRLINK_PLACEHOLDER"] | ||
| open(f,'w').write(content + '\n' + '\n'.join(block) + '\n') | ||
| ``` | ||
|
|
||
| For each family (run strictly sequentially — git checkouts can't be parallel): | ||
|
|
||
| ```bash | ||
| git checkout main -q && git reset --hard origin/main -q | ||
| branch="klaud-cold/<basekey>-<TAG>" | ||
| git checkout -b "$branch" -q | ||
|
Check warning on line 112 in .claude/commands/nuke.md
|
||
| python3 /tmp/edit_image.py <master.yaml> <NEW_IMAGE> <key> [<key>-mtp] | ||
| python3 /tmp/append_changelog.py perf-changelog.yaml "<DESC>" <key> [<key>-mtp] | ||
| git add -A | ||
| git commit -q -m "[Klaud Cold] Update <basekey>[ (+mtp)] <PHRASE> to <TAG>" \ | ||
| -m "Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>" | ||
| git push -u origin "$branch" -q --force-with-lease | ||
|
Check failure on line 118 in .claude/commands/nuke.md
|
||
|
Comment on lines
+109
to
+118
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🔴 Step 4 starts each family iteration with Extended reasoning...What the bug isThe new
Neither location checks whether the working tree is clean, neither stashes, and neither uses a temp worktree. Any uncommitted edits, staged changes, untracked-but-tracked-path overwrites, or unpushed local commits on Why existing code doesn't prevent itThere is no clean-tree precondition in the command's body. The Step 1–3 preamble is purely discovery/confirmation about recipes, not about the user's git state. The very first action of the actual mutating phase (Step 4) is the destructive reset. The N-iteration loop then repeats the reset before each new family, so even if the first iteration happened to be safe, intermediate state created mid-sweep is also clobbered. Why this is inconsistent with the project's own patternThe sibling command
Impact
Step-by-step proof
How to fix itTwo acceptable fixes, in order of preference:
|
||
| url=$(gh pr create --repo SemiAnalysisAI/InferenceX --base main --head "$branch" \ | ||
| --title "[Klaud Cold] Update <basekey>[ (+mtp)] <PHRASE> to <TAG>" \ | ||
| --body "<BODY>" --label full-sweep-enabled | grep -o 'https://github.com/[^ ]*') | ||
| # patch the changelog pr-link with the real URL, then amend + force-push | ||
| python3 - perf-changelog.yaml "$url" <<'PY' | ||
| import sys; f,u=sys.argv[1],sys.argv[2] | ||
| open(f,'w').write(open(f).read().replace("PRLINK_PLACEHOLDER",u,1)) | ||
| PY | ||
| git add perf-changelog.yaml && git commit -q --amend --no-edit && git push -q --force-with-lease | ||
|
Check failure on line 127 in .claude/commands/nuke.md
|
||
|
Comment on lines
+119
to
+127
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🔴 If Extended reasoning...What the bug is. The pipeline\n\n
Comment on lines
+119
to
+127
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🟡 The flow on lines 119-127 ( Extended reasoning...What the bug isOn lines 119-127 of
Why the GPU-doubling framing is wrong (refutation acknowledged)The refuting verifier is correct that this does not double GPU spend. Evaluating
Both events land in the same concurrency group with Why it's still worth flaggingThe cancelled-run-per-PR is not just cosmetic for a one-off PR, but /nuke is explicitly a batch operation that creates one PR per recipe family (the v0.22.0 example referenced in the description spans PRs #1595–#1624 — 30 PRs in a single invocation). Each invocation leaves N cancelled sweep entries in CI history, plus N Step-by-step proof
FixCleanest fix is a one-line reorder around the url=$(gh pr create --repo ... --title ... --body "<BODY>" | grep -o 'https://github.com/[^ ]*')
# ... patch perf-changelog.yaml ...
git add perf-changelog.yaml && git commit -q --amend --no-edit && git push -q --force-with-lease
gh pr edit "$url" --add-label full-sweep-enabledThe |
||
| ``` | ||
|
|
||
| Conventions: | ||
| - `<PHRASE>` = `vLLM image` / `vLLM ROCm image` / `SGLang image` / `SGLang ROCm image`. | ||
| - Title gets `(+mtp)` only when the family has an mtp sibling. | ||
| - Every PR carries the **`full-sweep-enabled`** label so CI kicks off. | ||
| - `<DESC>` = `Update <PHRASE> from <old-tag> to <TAG>` (note both tags when the | ||
| base/mtp differ, e.g. base already on target). | ||
| - PR body: | ||
| ``` | ||
| ## Summary | ||
| <DESC> | ||
|
|
||
| Recipes touched: `key1`, `key2` | ||
|
|
||
| ## Test plan | ||
| - [ ] full-sweep-enabled sweep passes. | ||
|
|
||
| 🤖 Generated with [Claude Code](https://claude.com/claude-code) | ||
| ``` | ||
|
|
||
| ## Step 5 — finish | ||
|
|
||
| Return to a clean `main` (`git checkout main && git reset --hard origin/main`). | ||
| Report a table of every PR created (number + URL + recipe), flag any special-pin | ||
| overrides, and note that each PR's sweep will run via the `full-sweep-enabled` label. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🟡 Step 4 (line 110) runs
git reset --hard origin/main -qat the start of each family iteration without a precedinggit fetch, soorigin/mainis whatever was cached at last fetch. On a long-lived shell or after recent unrelated merges, every PR branch gets forked from a stale base, and theperf-changelog.yamlblock is appended on top of an outdated snapshot — leading to avoidable rebase/merge churn. One-line fix: prependgit fetch origin main -qbefore the reset (or fetch once at the top of the command).Extended reasoning...
What the bug is
Step 4 documents the per-family loop as:
origin/mainhere is the local refrefs/remotes/origin/main, which only updates whengit fetchis run. Nothing in the command — neither this loop nor any preceding step — runsgit fetch origin. So the base every PR branch is created from is whatever the local cache happened to be at the last fetch.Why the staleness matters
The refutation correctly notes that
/nukeitself does not merge PRs (they go out withfull-sweep-enabledand need a long CI sweep before landing), so the worst case isn't a self-race within one sweep. But two realistic cold-start scenarios remain:origin/mainlags behind realmain. Every family iteration resets to that stale ref.Klaud Coldwork, etc.) land onmainduring or just before the sweep. Each new family branch misses them and the changelog append is computed against an obsolete file.Concrete proof
Assume
perf-changelog.yamlon realorigin/mainhas been updated by PR #1626 (lands at T=0) to include a new entry block. The user starts/nukeat T+30s on a shell that last fetched at T-10min, so the localrefs/remotes/origin/mainstill points at the pre-#1626 commit.kimik2.5-int4-b300-vllm:git reset --hard origin/mainchecks out the pre-[AMD] Add DeepSeek-V4-Pro FP4 MI355X ATOM DP-attention benchmark #1626 changelog.append_changelog.pyappends the new block at the end of that older file.kimik2.5-int4-b200-vllm: same thing — same stale base, same append point.perf-changelog.yamlagainst the real currentmain, which already has [AMD] Add DeepSeek-V4-Pro FP4 MI355X ATOM DP-attention benchmark #1626's block. The trailing region of the file (where appends happen) is now divergent, producing textual merge conflicts that require manual rebase or GitHub auto-merge intervention.The vendor image edits don't hit this because they're scoped to unique config keys (
kimik2.5-int4-b300-vllm:only appears once), butperf-changelog.yamlis appended sequentially and is the canonical collision point.Why existing code doesn't prevent it
--force-with-leaseprotects the feature branch from being clobbered, butmainis never force-pushed; the issue is the parent commit the feature branch is forked from.git reset --hard origin/mainruns after all PRs are already created, too late to help..claude/commands/fix-klaud-cron-prs.md:50explicitly doesgit fetch origin "$BRANCH"before branch operations, showing this repo's expected hygiene.Fix
One line — either prepend
git fetch origin main -qimmediately before thegit reset --hard origin/main -qin Step 4, or fetch once at the top of the command (cheapest, since most iterations would reuse the same fetched ref).Severity
Nit. The blast radius is bounded to merge-time rebase churn on
perf-changelog.yaml, not silent data loss or broken recipes; image edits are per-unique-key. But the fix is trivial and the documentation is the contract an LLM follows verbatim, so worth tightening.