Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 17 additions & 0 deletions utils/process_result.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,23 @@
with open(f'{result_filename}.json') as f:
bmk_result = json.load(f)

# A failed/degenerate benchmark (server never came up, disagg warmup deadlock,
# MoRI/KV-transport failure, etc.) still writes a result JSON, but with zeroed
# throughput and latency metrics. Detect that here and fail with the real,
# actionable reason. Otherwise the tpot reciprocal below (1000.0 / tpot) raises
# an opaque "ZeroDivisionError: float division by zero" that masks the true
# server-side cause and makes every such failure look identical.
_output_tput = float(bmk_result.get('output_throughput', 0) or 0)
_total_tput = float(bmk_result.get('total_token_throughput', 0) or 0)
if _output_tput <= 0 or _total_tput <= 0:
raise SystemExit(
"FAIL: benchmark produced no decode throughput "
f"(output_throughput={_output_tput}, total_token_throughput={_total_tput}) "
f"in {result_filename}.json — the server almost certainly failed to serve "
"(disagg warmup deadlock / MoRI transport failure / server never reached "
"ready). Check the multinode_server_logs artifact for the real error."
)

Check warning on line 60 in utils/process_result.py

View check run for this annotation

Claude / Claude Code Review

Error message hard-codes disagg/multinode guidance but guard runs unconditionally

The new zero-throughput guard at `utils/process_result.py:53-60` runs unconditionally (before the `is_multinode` branch at line 76), but its message hard-codes guidance specific to multinode/disagg failures — "Check the multinode_server_logs artifact" and causes like "disagg warmup deadlock / MoRI transport failure". For a single-node zero-throughput failure (the workflows invoke this script with `IS_MULTINODE` unset), this points the user at a non-existent artifact and at irrelevant failure mod
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error message references nonexistent artifact for single-node runs

Low Severity

The zero-throughput guard fires for all run types (single-node and multinode alike), but the error message unconditionally mentions disagg-specific causes ("disagg warmup deadlock / MoRI transport failure") and directs the user to the multinode_server_logs artifact. That artifact is only produced by multinode workflows. On a single-node run with zero throughput, the message points to a non-existent artifact, potentially sending an engineer on a fruitless search. The disagg variable is already available at this point and could be used to tailor the message.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit bd405c1. Configure here.

Comment on lines +53 to +60
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The new zero-throughput guard at utils/process_result.py:53-60 runs unconditionally (before the is_multinode branch at line 76), but its message hard-codes guidance specific to multinode/disagg failures — "Check the multinode_server_logs artifact" and causes like "disagg warmup deadlock / MoRI transport failure". For a single-node zero-throughput failure (the workflows invoke this script with IS_MULTINODE unset), this points the user at a non-existent artifact and at irrelevant failure modes, partially defeating the PR's "actionable message" goal. Easy fix: either scope the guard to disagg/multinode, or make the artifact/cause wording conditional on is_multinode (e.g. fall back to "check the server logs").

Extended reasoning...

What

The guard added in this PR runs before the is_multinode branch:

# lines 51-60 (unconditional)
_output_tput = float(bmk_result.get('output_throughput', 0) or 0)
_total_tput  = float(bmk_result.get('total_token_throughput', 0) or 0)
if _output_tput <= 0 or _total_tput <= 0:
    raise SystemExit(
        "... Check the multinode_server_logs artifact for the real error."
    )

# line 76 — the actual mode split happens AFTER the guard fires
is_multinode = os.environ.get('IS_MULTINODE', 'false').lower() == 'true'

So the message is emitted for every zero-throughput result, including single-node runs.

Why this is misleading for single-node

  • multinode_server_logs.tar.gz is only produced by the multinode launchers (e.g. launch_h100-dgxc-slurm.sh, launch_h200-dgxc-slurm.sh, launch_b300-nv.sh). A single-node run has no such artifact, so the hint points the user at something that does not exist.
  • The listed causes — "disagg warmup deadlock / MoRI transport failure / server never reached ready" — are disagg-specific. Only the last one is generic; the first two are nonsensical for a single-node agg run.
  • The PR description even frames the motivation as disagg-specific ("Vendor-agnostic; affects both NVIDIA and AMD disagg result processing"), but the guard isn't scoped to disagg.

Step-by-step proof

  1. .github/workflows/benchmark-tmpl.yml invokes process_result.py without setting IS_MULTINODE, i.e. single-node path.
  2. Suppose that benchmark writes result.json with output_throughput: 0 (e.g. server crashed after a partial warmup, or the benchmark client wrote a zero-filled result on early exit).
  3. The script loads the JSON at line 43, then evaluates the guard at line 53.
  4. _output_tput == 0.0, so raise SystemExit(...) fires.
  5. The user sees: "Check the multinode_server_logs artifact" — but this is a single-node workflow run, where that artifact is never produced. They will spend time hunting for a tarball that does not exist, and reading "disagg warmup deadlock / MoRI transport failure" while debugging a non-disagg failure.

Impact

Real but small. The first half of the message (the metric values and the JSON filename) is universally actionable, and zero-throughput single-node failures are likely rarer than the disagg case this PR targets — the typical single-node failure mode short-circuits before any result JSON is written at all. Severity is nit, not normal.

Suggested fix

Either:

  • Generalize the wording — replace multinode_server_logs with the server logs and drop the disagg-specific cause list (or hedge it with "if disagg, ..."). One-line change.
  • Scope the artifact hint to multinode/disagg — read IS_MULTINODE (or disagg) before the guard and pick the message accordingly.

The first is simpler and keeps the guard universal, which is desirable since the underlying ZeroDivisionError it prevents exists on the single-node path too (1000.0 / float(value) at line 132 runs in both modes).


data = {
'hw': hw,
'conc': int(bmk_result['max_concurrency']),
Expand Down