[https://nvbugs/6133201][fix] Bump GEN max_num_tokens in disagg perf YAMLs by xwang233 · Pull Request #14191 · NVIDIA/TensorRT-LLM

xwang233 · 2026-05-15T18:20:30Z

Summary

Fix nvbugs/6133201: five GEN-engine YAMLs added by #13343 set max_num_tokens to exactly half of what max_batch_size × (1 + max_draft_len) requires under MTP/Eagle3. At high concurrency with attention-DP routing, the per-rank check in _prepare_tp_inputs trips, the worker dies, and the rest of the disagg job blocks forever in NIXL/UCX collectives — the benchmark client sees a stuck 0/16384 progress bar until the job-step timeout fires.

AssertionError: total_num_tokens (260) should be less than or equal to max_num_tokens (256)

This PR bumps max_num_tokens in the five mismatched files so the invariant max_num_tokens ≥ max_batch_size × (1 + max_draft_len) holds:

File	mbs	mdl	old mnt	new mnt
`perf/disaggregated/gb200_qwen3-235b-fp4_…_dep16_bs128_…_mtp3_con2048_ccb-NIXL.yaml`	128	3	256	512
`perf/disaggregated/gb300_qwen3-235b-fp4_…_dep16_bs128_…_mtp3_con2048_ccb-NIXL.yaml`	128	3	256	512
`perf/disaggregated/gb200_deepseek-r1-fp4_…_con2048_…_dep16_eplb0_mtp3_ccb-NIXL.yaml`	768	3	1536	3072
`perf-sanity/disaggregated/gb200_deepseek-r1-fp4_…_con2048_…_dep16_eplb0_mtp3_ccb-NIXL.yaml`	768	3	1536	3072
`perf-sanity/disaggregated/gb200_deepseek-r1-fp4_…_con2048_…_dep16_eplb288_mtp3_ccb-NIXL.yaml`	768	3	1536	3072

The bug report only covers the first row (disagg-e2e-gb200_qwen3-235b-fp4_…_con2048_…_mtp3_ccb-NIXL), but the GB300 sibling was already confirmed reproducing in comment 2 of the nvbug, and the three DeepSeek-R1 entries are the same latent arithmetic mismatch — fixing them together prevents the same failure firing on a different test next run.

This is a test-config fix only — no source-code changes. A separate follow-up should enforce the invariant either in LlmArgs validation or in the ADP router's per-rank scheduling so future YAMLs cannot land with this inconsistency silently.

Test plan

Reproduced and verified on GB200 against the original artifact's source SHA + container image: with max_num_tokens=512 the failing test (perf/test_perf_sanity.py::test_e2e[disagg-e2e-gb200_qwen3-235b-fp4_…_con2048_…_mtp3_ccb-NIXL]) completes in 332 s with non-zero throughput; without the bump the assertion fires within seconds of the first NIXL transfer.

Summary by CodeRabbit

Chores
- Updated performance and sanity benchmark configurations for multiple model inference scenarios, including DeepSeek R1 FP4 and Qwen3-235B. Increased generation token limits to enable comprehensive capacity testing and improved performance benchmarking across various hardware configurations and deployment environments.

coderabbitai · 2026-05-15T18:23:00Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 11ac7cc8-d4c7-4fb4-9ecb-65beb66a5799

📥 Commits

Reviewing files that changed from the base of the PR and between 55659c5 and 9349aeb.

📒 Files selected for processing (5)

tests/scripts/perf-sanity/disaggregated/gb200_deepseek-r1-fp4_1k1k_con2048_ctx2_dep4_gen1_dep16_eplb0_mtp3_ccb-NIXL.yaml
tests/scripts/perf-sanity/disaggregated/gb200_deepseek-r1-fp4_1k1k_con2048_ctx2_dep4_gen1_dep16_eplb288_mtp3_ccb-NIXL.yaml
tests/scripts/perf/disaggregated/gb200_deepseek-r1-fp4_1k1k_con2048_ctx2_dep4_gen1_dep16_eplb0_mtp3_ccb-NIXL.yaml
tests/scripts/perf/disaggregated/gb200_qwen3-235b-fp4_1k1k_ctx2_gen1_dep16_bs128_eplb0_mtp3_con2048_ccb-NIXL.yaml
tests/scripts/perf/disaggregated/gb300_qwen3-235b-fp4_1k1k_ctx2_gen1_dep16_bs128_eplb0_mtp3_con2048_ccb-NIXL.yaml

📝 Walkthrough

Walkthrough

Five performance benchmark YAML configuration files update worker generation token limits: three GB200 DeepSeek-R1 FP4 configurations increase max_num_tokens from 1536 to 3072, and two Qwen3-235b FP4 configurations increase the limit from 256 to 512. All changes modify the same field at line 51 in each file's worker configuration section.

Changes

Performance benchmark worker token limit updates

Layer / File(s)	Summary
Worker generation token configuration updates `tests/scripts/perf-sanity/disaggregated/gb200_deepseek-r1-fp4_1k1k_con2048_ctx2_dep4_gen1_dep16_eplb0_mtp3_ccb-NIXL.yaml`, `tests/scripts/perf-sanity/disaggregated/gb200_deepseek-r1-fp4_1k1k_con2048_ctx2_dep4_gen1_dep16_eplb288_mtp3_ccb-NIXL.yaml`, `tests/scripts/perf/disaggregated/gb200_deepseek-r1-fp4_1k1k_con2048_ctx2_dep4_gen1_dep16_eplb0_mtp3_ccb-NIXL.yaml`, `tests/scripts/perf/disaggregated/gb200_qwen3-235b-fp4_1k1k_ctx2_gen1_dep16_bs128_eplb0_mtp3_con2048_ccb-NIXL.yaml`, `tests/scripts/perf/disaggregated/gb300_qwen3-235b-fp4_1k1k_ctx2_gen1_dep16_bs128_eplb0_mtp3_con2048_ccb-NIXL.yaml`	Five YAML benchmark configurations update `worker_config.gen.max_num_tokens`: three DeepSeek-R1 FP4 files increase the limit from 1536 to 3072, and two Qwen3-235b FP4 files increase from 256 to 512. All changes occur at line 51 in each file's worker configuration section.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Suggested reviewers

kaiyux
Tabrizian
bo-nv
qiaoxj07

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title directly describes the main change: bumping GEN max_num_tokens in disaggregated performance YAML files to fix nvbugs/6133201.
Description check	✅ Passed	The description provides a clear explanation of the issue, the fix, test coverage validation, and a detailed table of changes. It follows the template structure with Summary, Test Coverage sections and addresses the PR intent comprehensively.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

xwang233 · 2026-05-15T19:10:50Z

/bot run

tensorrt-cicd · 2026-05-15T19:16:13Z

PR_Github #48628 [ run ] triggered by Bot. Commit: 9349aeb Link to invocation

tensorrt-cicd · 2026-05-15T21:46:18Z

PR_Github #48628 [ run ] completed with state FAILURE. Commit: 9349aeb
/LLM/main/L0_MergeRequest_PR pipeline #38410 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

xwang233 · 2026-05-16T02:15:19Z

/bot run

tensorrt-cicd · 2026-05-16T02:21:33Z

PR_Github #48665 [ run ] triggered by Bot. Commit: 9349aeb Link to invocation

tensorrt-cicd · 2026-05-16T03:22:02Z

PR_Github #48665 [ run ] completed with state FAILURE. Commit: 9349aeb
/LLM/main/L0_MergeRequest_PR pipeline #38445 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

…YAMLs Under MTP/Eagle3 with max_draft_len=D, each scheduled request consumes up to (1+D) tokens per forward step, so the GEN engine's per-step token budget must satisfy max_num_tokens >= max_batch_size * (1 + max_draft_len). The qwen3-235b and deepseek-r1 disagg-perf YAMLs added by NVIDIA#13343 set this budget to exactly half of what max_batch_size declares. When attention-DP routing seats more than (max_num_tokens / (1+D)) requests on a single rank (routine at concurrency=2048), the per-rank check in _prepare_tp_inputs trips: AssertionError: total_num_tokens (260) should be less than or equal to max_num_tokens (256) The worker dies; surviving ranks block forever in NIXL/UCX collectives waiting for it; the benchmark client streams a "0/16384" progress bar until SLURM's job-step timeout fires. This is the failure captured by nvbugs/6133201. Fix the arithmetic in all five mismatched configs (qwen3-235b: 256->512; deepseek-r1: 1536->3072). Verified on lyris GB200 against the original artifact's SHA + image: with the bump the test completes in 332s with non-zero throughput; without it, the assertion fires within seconds of the first NIXL transfer. Signed-off-by: Xiao Wang <24860335+xwang233@users.noreply.github.com>

xwang233 · 2026-05-19T22:10:07Z

/bot run

tensorrt-cicd · 2026-05-19T22:17:28Z

PR_Github #49269 [ run ] triggered by Bot. Commit: 5d9b327 Link to invocation

tensorrt-cicd · 2026-05-20T03:12:42Z

PR_Github #49269 [ run ] completed with state FAILURE. Commit: 5d9b327
/LLM/main/L0_MergeRequest_PR pipeline #38935 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

xwang233 · 2026-05-20T16:47:39Z

/bot skip --comment "config-only YAML bump for post-merge perf-sanity + QA disagg tests; no pre-merge coverage"

tensorrt-cicd · 2026-05-20T16:57:56Z

PR_Github #49457 [ skip ] triggered by Bot. Commit: 5d9b327 Link to invocation

tensorrt-cicd · 2026-05-20T17:04:41Z

PR_Github #49457 [ skip ] completed with state SUCCESS. Commit: 5d9b327
Skipping testing for commit 5d9b327

Link to invocation

…YAMLs (NVIDIA#14191) Signed-off-by: Xiao Wang <24860335+xwang233@users.noreply.github.com>

github-actions Bot assigned xwang233 May 15, 2026

xwang233 requested review from chenfeiz0326 and fredricz-20070104 and removed request for fredricz-20070104 May 15, 2026 19:13

xwang233 force-pushed the nvbugs/6133201-fix-max-num-tokens branch from 9349aeb to 5d9b327 Compare May 19, 2026 22:09

chenfeiz0326 approved these changes May 20, 2026

View reviewed changes

xwang233 merged commit 7e23597 into NVIDIA:main May 20, 2026
7 checks passed

xxi-nv pushed a commit to xxi-nv/TensorRT-LLM that referenced this pull request May 22, 2026

[https://nvbugs/6133201][fix] Bump GEN max_num_tokens in disagg perf …

59da469

…YAMLs (NVIDIA#14191) Signed-off-by: Xiao Wang <24860335+xwang233@users.noreply.github.com>

bmarimuthu-nv pushed a commit to nv-auto-deploy/TensorRT-LLM that referenced this pull request May 28, 2026

[https://nvbugs/6133201][fix] Bump GEN max_num_tokens in disagg perf …

d235470

…YAMLs (NVIDIA#14191) Signed-off-by: Xiao Wang <24860335+xwang233@users.noreply.github.com>

Conversation

xwang233 commented May 15, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 15, 2026

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Uh oh!

xwang233 commented May 15, 2026

Uh oh!

tensorrt-cicd commented May 15, 2026

Uh oh!

tensorrt-cicd commented May 15, 2026

Uh oh!

xwang233 commented May 16, 2026

Uh oh!

tensorrt-cicd commented May 16, 2026

Uh oh!

tensorrt-cicd commented May 16, 2026

Uh oh!

xwang233 commented May 19, 2026

Uh oh!

tensorrt-cicd commented May 19, 2026

Uh oh!

tensorrt-cicd commented May 20, 2026

Uh oh!

xwang233 commented May 20, 2026

Uh oh!

tensorrt-cicd commented May 20, 2026

Uh oh!

tensorrt-cicd commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xwang233 commented May 15, 2026 •

edited by coderabbitai Bot

Loading