Skip to content

[https://nvbugs/6133201][fix] Bump GEN max_num_tokens in disagg perf YAMLs#14191

Merged
xwang233 merged 1 commit into
NVIDIA:mainfrom
xwang233:nvbugs/6133201-fix-max-num-tokens
May 20, 2026
Merged

[https://nvbugs/6133201][fix] Bump GEN max_num_tokens in disagg perf YAMLs#14191
xwang233 merged 1 commit into
NVIDIA:mainfrom
xwang233:nvbugs/6133201-fix-max-num-tokens

Conversation

@xwang233
Copy link
Copy Markdown
Collaborator

@xwang233 xwang233 commented May 15, 2026

Summary

Fix nvbugs/6133201: five GEN-engine YAMLs added by #13343 set max_num_tokens to exactly half of what max_batch_size × (1 + max_draft_len) requires under MTP/Eagle3. At high concurrency with attention-DP routing, the per-rank check in _prepare_tp_inputs trips, the worker dies, and the rest of the disagg job blocks forever in NIXL/UCX collectives — the benchmark client sees a stuck 0/16384 progress bar until the job-step timeout fires.

AssertionError: total_num_tokens (260) should be less than or equal to max_num_tokens (256)

This PR bumps max_num_tokens in the five mismatched files so the invariant max_num_tokens ≥ max_batch_size × (1 + max_draft_len) holds:

File mbs mdl old mnt new mnt
perf/disaggregated/gb200_qwen3-235b-fp4_…_dep16_bs128_…_mtp3_con2048_ccb-NIXL.yaml 128 3 256 512
perf/disaggregated/gb300_qwen3-235b-fp4_…_dep16_bs128_…_mtp3_con2048_ccb-NIXL.yaml 128 3 256 512
perf/disaggregated/gb200_deepseek-r1-fp4_…_con2048_…_dep16_eplb0_mtp3_ccb-NIXL.yaml 768 3 1536 3072
perf-sanity/disaggregated/gb200_deepseek-r1-fp4_…_con2048_…_dep16_eplb0_mtp3_ccb-NIXL.yaml 768 3 1536 3072
perf-sanity/disaggregated/gb200_deepseek-r1-fp4_…_con2048_…_dep16_eplb288_mtp3_ccb-NIXL.yaml 768 3 1536 3072

The bug report only covers the first row (disagg-e2e-gb200_qwen3-235b-fp4_…_con2048_…_mtp3_ccb-NIXL), but the GB300 sibling was already confirmed reproducing in comment 2 of the nvbug, and the three DeepSeek-R1 entries are the same latent arithmetic mismatch — fixing them together prevents the same failure firing on a different test next run.

This is a test-config fix only — no source-code changes. A separate follow-up should enforce the invariant either in LlmArgs validation or in the ADP router's per-rank scheduling so future YAMLs cannot land with this inconsistency silently.

Test plan

  • Reproduced and verified on GB200 against the original artifact's source SHA + container image: with max_num_tokens=512 the failing test (perf/test_perf_sanity.py::test_e2e[disagg-e2e-gb200_qwen3-235b-fp4_…_con2048_…_mtp3_ccb-NIXL]) completes in 332 s with non-zero throughput; without the bump the assertion fires within seconds of the first NIXL transfer.

Summary by CodeRabbit

  • Chores
    • Updated performance and sanity benchmark configurations for multiple model inference scenarios, including DeepSeek R1 FP4 and Qwen3-235B. Increased generation token limits to enable comprehensive capacity testing and improved performance benchmarking across various hardware configurations and deployment environments.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 15, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 11ac7cc8-d4c7-4fb4-9ecb-65beb66a5799

📥 Commits

Reviewing files that changed from the base of the PR and between 55659c5 and 9349aeb.

📒 Files selected for processing (5)
  • tests/scripts/perf-sanity/disaggregated/gb200_deepseek-r1-fp4_1k1k_con2048_ctx2_dep4_gen1_dep16_eplb0_mtp3_ccb-NIXL.yaml
  • tests/scripts/perf-sanity/disaggregated/gb200_deepseek-r1-fp4_1k1k_con2048_ctx2_dep4_gen1_dep16_eplb288_mtp3_ccb-NIXL.yaml
  • tests/scripts/perf/disaggregated/gb200_deepseek-r1-fp4_1k1k_con2048_ctx2_dep4_gen1_dep16_eplb0_mtp3_ccb-NIXL.yaml
  • tests/scripts/perf/disaggregated/gb200_qwen3-235b-fp4_1k1k_ctx2_gen1_dep16_bs128_eplb0_mtp3_con2048_ccb-NIXL.yaml
  • tests/scripts/perf/disaggregated/gb300_qwen3-235b-fp4_1k1k_ctx2_gen1_dep16_bs128_eplb0_mtp3_con2048_ccb-NIXL.yaml

📝 Walkthrough

Walkthrough

Five performance benchmark YAML configuration files update worker generation token limits: three GB200 DeepSeek-R1 FP4 configurations increase max_num_tokens from 1536 to 3072, and two Qwen3-235b FP4 configurations increase the limit from 256 to 512. All changes modify the same field at line 51 in each file's worker configuration section.

Changes

Performance benchmark worker token limit updates

Layer / File(s) Summary
Worker generation token configuration updates
tests/scripts/perf-sanity/disaggregated/gb200_deepseek-r1-fp4_1k1k_con2048_ctx2_dep4_gen1_dep16_eplb0_mtp3_ccb-NIXL.yaml, tests/scripts/perf-sanity/disaggregated/gb200_deepseek-r1-fp4_1k1k_con2048_ctx2_dep4_gen1_dep16_eplb288_mtp3_ccb-NIXL.yaml, tests/scripts/perf/disaggregated/gb200_deepseek-r1-fp4_1k1k_con2048_ctx2_dep4_gen1_dep16_eplb0_mtp3_ccb-NIXL.yaml, tests/scripts/perf/disaggregated/gb200_qwen3-235b-fp4_1k1k_ctx2_gen1_dep16_bs128_eplb0_mtp3_con2048_ccb-NIXL.yaml, tests/scripts/perf/disaggregated/gb300_qwen3-235b-fp4_1k1k_ctx2_gen1_dep16_bs128_eplb0_mtp3_con2048_ccb-NIXL.yaml
Five YAML benchmark configurations update worker_config.gen.max_num_tokens: three DeepSeek-R1 FP4 files increase the limit from 1536 to 3072, and two Qwen3-235b FP4 files increase from 256 to 512. All changes occur at line 51 in each file's worker configuration section.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Suggested reviewers

  • kaiyux
  • Tabrizian
  • bo-nv
  • qiaoxj07
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title directly describes the main change: bumping GEN max_num_tokens in disaggregated performance YAML files to fix nvbugs/6133201.
Description check ✅ Passed The description provides a clear explanation of the issue, the fix, test coverage validation, and a detailed table of changes. It follows the template structure with Summary, Test Coverage sections and addresses the PR intent comprehensively.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

@xwang233
Copy link
Copy Markdown
Collaborator Author

/bot run

@xwang233 xwang233 requested review from chenfeiz0326 and fredricz-20070104 and removed request for fredricz-20070104 May 15, 2026 19:13
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48628 [ run ] triggered by Bot. Commit: 9349aeb Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48628 [ run ] completed with state FAILURE. Commit: 9349aeb
/LLM/main/L0_MergeRequest_PR pipeline #38410 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@xwang233
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48665 [ run ] triggered by Bot. Commit: 9349aeb Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48665 [ run ] completed with state FAILURE. Commit: 9349aeb
/LLM/main/L0_MergeRequest_PR pipeline #38445 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

…YAMLs

Under MTP/Eagle3 with max_draft_len=D, each scheduled request consumes
up to (1+D) tokens per forward step, so the GEN engine's per-step token
budget must satisfy max_num_tokens >= max_batch_size * (1 + max_draft_len).
The qwen3-235b and deepseek-r1 disagg-perf YAMLs added by NVIDIA#13343 set this
budget to exactly half of what max_batch_size declares.

When attention-DP routing seats more than (max_num_tokens / (1+D))
requests on a single rank (routine at concurrency=2048), the per-rank
check in _prepare_tp_inputs trips:

    AssertionError: total_num_tokens (260) should be less than or equal
    to max_num_tokens (256)

The worker dies; surviving ranks block forever in NIXL/UCX collectives
waiting for it; the benchmark client streams a "0/16384" progress bar
until SLURM's job-step timeout fires. This is the failure captured by
nvbugs/6133201.

Fix the arithmetic in all five mismatched configs (qwen3-235b: 256->512;
deepseek-r1: 1536->3072). Verified on lyris GB200 against the original
artifact's SHA + image: with the bump the test completes in 332s with
non-zero throughput; without it, the assertion fires within seconds of
the first NIXL transfer.

Signed-off-by: Xiao Wang <24860335+xwang233@users.noreply.github.com>
@xwang233 xwang233 force-pushed the nvbugs/6133201-fix-max-num-tokens branch from 9349aeb to 5d9b327 Compare May 19, 2026 22:09
@xwang233
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49269 [ run ] triggered by Bot. Commit: 5d9b327 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49269 [ run ] completed with state FAILURE. Commit: 5d9b327
/LLM/main/L0_MergeRequest_PR pipeline #38935 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@xwang233
Copy link
Copy Markdown
Collaborator Author

/bot skip --comment "config-only YAML bump for post-merge perf-sanity + QA disagg tests; no pre-merge coverage"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49457 [ skip ] triggered by Bot. Commit: 5d9b327 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49457 [ skip ] completed with state SUCCESS. Commit: 5d9b327
Skipping testing for commit 5d9b327

Link to invocation

@xwang233 xwang233 merged commit 7e23597 into NVIDIA:main May 20, 2026
7 checks passed
xxi-nv pushed a commit to xxi-nv/TensorRT-LLM that referenced this pull request May 22, 2026
…YAMLs (NVIDIA#14191)

Signed-off-by: Xiao Wang <24860335+xwang233@users.noreply.github.com>
bmarimuthu-nv pushed a commit to nv-auto-deploy/TensorRT-LLM that referenced this pull request May 28, 2026
…YAMLs (NVIDIA#14191)

Signed-off-by: Xiao Wang <24860335+xwang233@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants