fix(process_result): fail loudly on zero-throughput disagg runs (no more masked ZeroDivisionError)#1590
fix(process_result): fail loudly on zero-throughput disagg runs (no more masked ZeroDivisionError)#1590arygupt wants to merge 1 commit into
Conversation
…eroDivisionError A failed/degenerate benchmark (disagg warmup deadlock, MoRI transport failure, server never reaching ready) still writes a result JSON, but with zeroed throughput/latency. The tpot reciprocal `1000.0 / float(value)` then raises an opaque "ZeroDivisionError: float division by zero", masking the real server-side cause and making every such failure look identical. This is the dominant red signature on mi355x-disagg sweeps (e.g. #1584's DEP8/MTP jobs). Guard up front: if output/total throughput is 0, exit with a clear, actionable message pointing at the multinode_server_logs artifact. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit bd405c1. Configure here.
| f"in {result_filename}.json — the server almost certainly failed to serve " | ||
| "(disagg warmup deadlock / MoRI transport failure / server never reached " | ||
| "ready). Check the multinode_server_logs artifact for the real error." | ||
| ) |
There was a problem hiding this comment.
Error message references nonexistent artifact for single-node runs
Low Severity
The zero-throughput guard fires for all run types (single-node and multinode alike), but the error message unconditionally mentions disagg-specific causes ("disagg warmup deadlock / MoRI transport failure") and directs the user to the multinode_server_logs artifact. That artifact is only produced by multinode workflows. On a single-node run with zero throughput, the message points to a non-existent artifact, potentially sending an engineer on a fruitless search. The disagg variable is already available at this point and could be used to tailor the message.
Reviewed by Cursor Bugbot for commit bd405c1. Configure here.
There was a problem hiding this comment.
LGTM — small diagnostic improvement that converts an opaque ZeroDivisionError into an actionable SystemExit.
Extended reasoning...
Overview
Single-file change to utils/process_result.py adding a 10-line guard after the result JSON load. If output_throughput or total_token_throughput is zero/missing, the script exits with a descriptive SystemExit rather than falling through to the later 1000.0 / tpot reciprocal which would raise ZeroDivisionError. Strict diagnostic improvement — successful runs (which always have positive throughput) are unaffected.
Security risks
None. The change reads benchmark result JSON that the same script already consumed, performs numeric coercion with safe defaults (float(... or 0)), and exits early on degenerate input. No new I/O surface, no auth/crypto interaction, no shell execution.
Level of scrutiny
Low. This is benchmark/CI tooling, not production-serving code. The behavior swap is from one fatal error (ZeroDivisionError) to another fatal error (SystemExit) on the same degenerate inputs; healthy runs are bit-identical. The script is invoked by GitHub Actions workflows and only affects how a failed benchmark surfaces its error message.
Other factors
The bug hunter flagged a nit: the error message hard-codes multinode/disagg-specific guidance ("multinode_server_logs artifact", "MoRI transport failure") but the guard runs unconditionally, before the is_multinode branch. For a single-node zero-throughput failure the artifact name is wrong and the cause list is misleading. That's a real wording issue but a nit — the script still fails clearly (with the actual metric values and JSON filename), which is the main goal of the PR, and single-node zero-throughput failures are an edge case (most single-node failures short-circuit before any result JSON is written). Author can address in a follow-up if desired.
| if _output_tput <= 0 or _total_tput <= 0: | ||
| raise SystemExit( | ||
| "FAIL: benchmark produced no decode throughput " | ||
| f"(output_throughput={_output_tput}, total_token_throughput={_total_tput}) " | ||
| f"in {result_filename}.json — the server almost certainly failed to serve " | ||
| "(disagg warmup deadlock / MoRI transport failure / server never reached " | ||
| "ready). Check the multinode_server_logs artifact for the real error." | ||
| ) |
There was a problem hiding this comment.
🟡 The new zero-throughput guard at utils/process_result.py:53-60 runs unconditionally (before the is_multinode branch at line 76), but its message hard-codes guidance specific to multinode/disagg failures — "Check the multinode_server_logs artifact" and causes like "disagg warmup deadlock / MoRI transport failure". For a single-node zero-throughput failure (the workflows invoke this script with IS_MULTINODE unset), this points the user at a non-existent artifact and at irrelevant failure modes, partially defeating the PR's "actionable message" goal. Easy fix: either scope the guard to disagg/multinode, or make the artifact/cause wording conditional on is_multinode (e.g. fall back to "check the server logs").
Extended reasoning...
What
The guard added in this PR runs before the is_multinode branch:
# lines 51-60 (unconditional)
_output_tput = float(bmk_result.get('output_throughput', 0) or 0)
_total_tput = float(bmk_result.get('total_token_throughput', 0) or 0)
if _output_tput <= 0 or _total_tput <= 0:
raise SystemExit(
"... Check the multinode_server_logs artifact for the real error."
)
# line 76 — the actual mode split happens AFTER the guard fires
is_multinode = os.environ.get('IS_MULTINODE', 'false').lower() == 'true'So the message is emitted for every zero-throughput result, including single-node runs.
Why this is misleading for single-node
multinode_server_logs.tar.gzis only produced by the multinode launchers (e.g.launch_h100-dgxc-slurm.sh,launch_h200-dgxc-slurm.sh,launch_b300-nv.sh). A single-node run has no such artifact, so the hint points the user at something that does not exist.- The listed causes — "disagg warmup deadlock / MoRI transport failure / server never reached ready" — are disagg-specific. Only the last one is generic; the first two are nonsensical for a single-node agg run.
- The PR description even frames the motivation as disagg-specific ("Vendor-agnostic; affects both NVIDIA and AMD disagg result processing"), but the guard isn't scoped to disagg.
Step-by-step proof
.github/workflows/benchmark-tmpl.ymlinvokesprocess_result.pywithout settingIS_MULTINODE, i.e. single-node path.- Suppose that benchmark writes
result.jsonwithoutput_throughput: 0(e.g. server crashed after a partial warmup, or the benchmark client wrote a zero-filled result on early exit). - The script loads the JSON at line 43, then evaluates the guard at line 53.
_output_tput == 0.0, soraise SystemExit(...)fires.- The user sees: "Check the multinode_server_logs artifact" — but this is a single-node workflow run, where that artifact is never produced. They will spend time hunting for a tarball that does not exist, and reading "disagg warmup deadlock / MoRI transport failure" while debugging a non-disagg failure.
Impact
Real but small. The first half of the message (the metric values and the JSON filename) is universally actionable, and zero-throughput single-node failures are likely rarer than the disagg case this PR targets — the typical single-node failure mode short-circuits before any result JSON is written at all. Severity is nit, not normal.
Suggested fix
Either:
- Generalize the wording — replace
multinode_server_logswiththe server logsand drop the disagg-specific cause list (or hedge it with "if disagg, ..."). One-line change. - Scope the artifact hint to multinode/disagg — read
IS_MULTINODE(ordisagg) before the guard and pick the message accordingly.
The first is simpler and keeps the guard universal, which is desirable since the underlying ZeroDivisionError it prevents exists on the single-node path too (1000.0 / float(value) at line 132 runs in both modes).


What
When a disagg benchmark fails server-side (warmup deadlock, MoRI/KV-transport failure, server never reaching ready), the harness still writes a result JSON — but with zeroed throughput/latency.
utils/process_result.pythen computes the tpot→interactivity reciprocal1000.0 / float(tpot)withtpot == 0, raising:This opaque error is the dominant red signature on
mi355x-disaggsweeps (e.g. #1584's DEP8/MTP jobs). It masks the real cause and makes every server-side failure look identical, so triage always requires hand-pulling slurm logs.Fix
Guard up front, right after the result JSON is loaded: if
output_throughputortotal_token_throughputis 0, exit with a clear, actionable message pointing at themultinode_server_logsartifact — instead of letting the reciprocal raise.A legitimately successful run always has positive throughput, so this never triggers on healthy results.
Why this matters
This doesn't fix the underlying MoRI warmup deadlock (that's upstream / #1585), but it converts "mystery ZeroDivisionError" into the real story on every future disagg run — the prerequisite for actually debugging AMD multinode. It would, for example, have immediately distinguished #1584 (real disagg failure) from #1581 (deterministic model-not-found, fails before any result is written).
Vendor-agnostic; affects both NVIDIA and AMD disagg result processing.
Note
Low Risk
Small guard on result ingestion only; successful runs with positive throughput are unchanged.
Overview
Adds an early check in
utils/process_result.pyright after the benchmark result JSON is loaded: ifoutput_throughputortotal_token_throughputis zero or missing, the scriptSystemExits with an explicit failure message (including the metric values and a pointer tomultinode_server_logs) instead of continuing into downstream math.That prevents the later
1000.0 / tpotinteractivity conversion from raising aZeroDivisionErrorwhen failed/disagg runs still write a result file with zeroed throughput and latency—so server-side failures (warmup deadlock, transport issues, server not ready) surface as one actionable error rather than an identical division-by-zero on every bad run.Reviewed by Cursor Bugbot for commit bd405c1. Bugbot is set up for automated code reviews on this repo. Configure here.