Skip to content

Add DSV4 GB300 wide-EP sweep configs (EP=12/16/24/32/40)#1586

Open
yhyang201 wants to merge 2 commits into
mainfrom
add-dsv4-gb300-weiliang-wideep-sweep
Open

Add DSV4 GB300 wide-EP sweep configs (EP=12/16/24/32/40)#1586
yhyang201 wants to merge 2 commits into
mainfrom
add-dsv4-gb300-weiliang-wideep-sweep

Conversation

@yhyang201
Copy link
Copy Markdown
Collaborator

@yhyang201 yhyang201 commented May 29, 2026

Summary

  • Add 5 new search-space entries and recipe files for DSV4 GB300 non-MTP wide-EP sweep, matching srt-slurm PR#173 topology (18 nodes total).
  • EP sizes: 12/16/24/32/40, decode nodes: 3/4/6/8/10, concurrencies: 12000/8192/3000/2500/2048.
  • Uses InferenceX env vars and sglang_config (megamoe, W4A4, nightly-20260520 image), with Weiliang's tuned decode params (swa-full-tokens-ratio=0.20, max-running-requests=18432).

Note

Low Risk
Benchmark and CI config only (YAML recipes and search-space entries); no runtime application or auth changes.

Overview
Extends dsv4-fp4-gb300-dynamo-sglang with five wide expert-parallel (EP) sweep points on a fixed 18-node disaggregated layout (srt-slurm PR#173 topology), each wired to a new CONFIG_FILE recipe under benchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/8k1k/.

The sweep varies decode EP/TP (12→40) and prefill worker count while keeping prefill at TP/EP 4 with dp-attn: EP=12 (15P+3D, conc 12000), EP=16 (14P+4D, 8192), EP=24 (12P+6D, 3000), EP=32 (10P+8D, 2500), EP=40 (8P+10D, 2048). New recipes use the nightly-20260520 SGLang image, Dynamo install, 8k/1k sa-bench, and tuned decode settings (swa-full-tokens-ratio=0.20, max-running-requests=18432, moe-dense-tp-size=1); EP=40 adds ep-num-redundant-experts: 16.

perf-changelog.yaml documents the new config keys and parameter alignment.

Reviewed by Cursor Bugbot for commit d5728f5. Bugbot is set up for automated code reviews on this repo. Configure here.

Adds 5 new search-space entries and recipe files matching
srt-slurm PR#173 wide-EP sweep topology (18 nodes total).
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

ep-num-redundant-experts: 16
enable-dp-attention: true
enable-dp-lm-head: true
max-running-requests: 18432
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Nit: max-running-requests: 18432 is not divisible by data-parallel-size: 40 (18432/40 = 460.8), unlike the four sibling EP=12/16/24/32 files added in this same PR which all divide cleanly. Practical impact here is zero since the benchmark sets concurrencies: 2048 (far below the cap), but for consistency with the sibling files consider 18400 (= 460×40) or 18480 (= 462×40).

Extended reasoning...

Bug

In disagg-gb300-8p1d-dep4-dep40-18-c2048.yaml (decode block, line 148), max-running-requests: 18432 is set alongside data-parallel-size: 40. But 18432 / 40 = 460.8 — not an integer.

Why this looks like a copy/paste oversight

The value 18432 = 192 × 96 is the LCM-aligned choice that divides evenly across the other four sweep points added in this same PR:

File (EP) dp_size 18432 / dp_size
dep12 (c12000) 12 1536
dep16 (c8192) 16 1152
dep24 (c3000) 24 768
dep32 (c2500) 32 576
dep40 (c2048) 40 460.8 ← broken

Every other GB300 recipe currently in the tree also has max-running-requests exactly divisible by its data-parallel-size. The EP=40 file is the only outlier, and 40 was clearly just missed when 18432 was selected for the sweep.

Runtime behavior

SGLang's DP-attention path floor-divides max_running_requests across DP ranks rather than erroring out, so this won't crash — it just results in 460 per rank × 40 = 18400 effective capacity (32 slots silently dropped, ~0.17%).

Why this is still worth flagging (but only as a nit)

Addressing the refutation: yes, the benchmark sets concurrencies: 2048, which is ~9× below either 18432 or 18400, so the cap is genuinely never reached and no benchmark number will change. There is no functional regression and no test will fail. That is exactly why this is filed as a nit, not as a blocking issue.

The reason it's still worth a one-line fix:

  1. The four sibling files in this same PR observe the divisibility invariant — the EP=40 file is internally inconsistent with its own cohort.
  2. Anyone copying this recipe to a different concurrency (e.g., raising concurrencies to actually exercise the cap, which is a natural follow-up sweep) will inherit the rounding inconsistency.
  3. The fix is one character: change 1843218400 (or 18480) on the single line at file:148.

Step-by-step proof

  1. Open benchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/8k1k/disagg-gb300-8p1d-dep4-dep40-18-c2048.yaml.
  2. In the backend.sglang_config.decode block, observe:
    • data-parallel-size: 40 (line ~144)
    • max-running-requests: 18432 (line 148)
  3. Compute 18432 ÷ 40 = 460.8 → not an integer.
  4. Repeat the same check on the four sibling files added in this PR (dep12/16/24/32) — all evaluate to clean integers (1536 / 1152 / 768 / 576). The invariant holds for every other file in the PR and every pre-existing GB300 recipe in the tree.

Suggested fix

# decode block, line 148
max-running-requests: 18400   # = 460 × 40, matches the per-rank floor SGLang would compute anyway

@github-actions
Copy link
Copy Markdown
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant