Skip to content

fix(process_result): fail loudly on zero-throughput disagg runs (no more masked ZeroDivisionError)#1590

Open
arygupt wants to merge 1 commit into
mainfrom
fix/disagg-result-zero-guard
Open

fix(process_result): fail loudly on zero-throughput disagg runs (no more masked ZeroDivisionError)#1590
arygupt wants to merge 1 commit into
mainfrom
fix/disagg-result-zero-guard

Conversation

@arygupt
Copy link
Copy Markdown
Collaborator

@arygupt arygupt commented May 29, 2026

What

When a disagg benchmark fails server-side (warmup deadlock, MoRI/KV-transport failure, server never reaching ready), the harness still writes a result JSON — but with zeroed throughput/latency. utils/process_result.py then computes the tpot→interactivity reciprocal 1000.0 / float(tpot) with tpot == 0, raising:

ZeroDivisionError: float division by zero  (utils/process_result.py:132)

This opaque error is the dominant red signature on mi355x-disagg sweeps (e.g. #1584's DEP8/MTP jobs). It masks the real cause and makes every server-side failure look identical, so triage always requires hand-pulling slurm logs.

Fix

Guard up front, right after the result JSON is loaded: if output_throughput or total_token_throughput is 0, exit with a clear, actionable message pointing at the multinode_server_logs artifact — instead of letting the reciprocal raise.

A legitimately successful run always has positive throughput, so this never triggers on healthy results.

Why this matters

This doesn't fix the underlying MoRI warmup deadlock (that's upstream / #1585), but it converts "mystery ZeroDivisionError" into the real story on every future disagg run — the prerequisite for actually debugging AMD multinode. It would, for example, have immediately distinguished #1584 (real disagg failure) from #1581 (deterministic model-not-found, fails before any result is written).

Vendor-agnostic; affects both NVIDIA and AMD disagg result processing.


Note

Low Risk
Small guard on result ingestion only; successful runs with positive throughput are unchanged.

Overview
Adds an early check in utils/process_result.py right after the benchmark result JSON is loaded: if output_throughput or total_token_throughput is zero or missing, the script SystemExits with an explicit failure message (including the metric values and a pointer to multinode_server_logs) instead of continuing into downstream math.

That prevents the later 1000.0 / tpot interactivity conversion from raising a ZeroDivisionError when failed/disagg runs still write a result file with zeroed throughput and latency—so server-side failures (warmup deadlock, transport issues, server not ready) surface as one actionable error rather than an identical division-by-zero on every bad run.

Reviewed by Cursor Bugbot for commit bd405c1. Bugbot is set up for automated code reviews on this repo. Configure here.

…eroDivisionError

A failed/degenerate benchmark (disagg warmup deadlock, MoRI transport
failure, server never reaching ready) still writes a result JSON, but with
zeroed throughput/latency. The tpot reciprocal `1000.0 / float(value)` then
raises an opaque "ZeroDivisionError: float division by zero", masking the
real server-side cause and making every such failure look identical.

This is the dominant red signature on mi355x-disagg sweeps (e.g. #1584's
DEP8/MTP jobs). Guard up front: if output/total throughput is 0, exit with a
clear, actionable message pointing at the multinode_server_logs artifact.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@arygupt arygupt requested a review from a team May 29, 2026 22:28
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit bd405c1. Configure here.

Comment thread utils/process_result.py
f"in {result_filename}.json — the server almost certainly failed to serve "
"(disagg warmup deadlock / MoRI transport failure / server never reached "
"ready). Check the multinode_server_logs artifact for the real error."
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error message references nonexistent artifact for single-node runs

Low Severity

The zero-throughput guard fires for all run types (single-node and multinode alike), but the error message unconditionally mentions disagg-specific causes ("disagg warmup deadlock / MoRI transport failure") and directs the user to the multinode_server_logs artifact. That artifact is only produced by multinode workflows. On a single-node run with zero throughput, the message points to a non-existent artifact, potentially sending an engineer on a fruitless search. The disagg variable is already available at this point and could be used to tailor the message.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit bd405c1. Configure here.

Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — small diagnostic improvement that converts an opaque ZeroDivisionError into an actionable SystemExit.

Extended reasoning...

Overview

Single-file change to utils/process_result.py adding a 10-line guard after the result JSON load. If output_throughput or total_token_throughput is zero/missing, the script exits with a descriptive SystemExit rather than falling through to the later 1000.0 / tpot reciprocal which would raise ZeroDivisionError. Strict diagnostic improvement — successful runs (which always have positive throughput) are unaffected.

Security risks

None. The change reads benchmark result JSON that the same script already consumed, performs numeric coercion with safe defaults (float(... or 0)), and exits early on degenerate input. No new I/O surface, no auth/crypto interaction, no shell execution.

Level of scrutiny

Low. This is benchmark/CI tooling, not production-serving code. The behavior swap is from one fatal error (ZeroDivisionError) to another fatal error (SystemExit) on the same degenerate inputs; healthy runs are bit-identical. The script is invoked by GitHub Actions workflows and only affects how a failed benchmark surfaces its error message.

Other factors

The bug hunter flagged a nit: the error message hard-codes multinode/disagg-specific guidance ("multinode_server_logs artifact", "MoRI transport failure") but the guard runs unconditionally, before the is_multinode branch. For a single-node zero-throughput failure the artifact name is wrong and the cause list is misleading. That's a real wording issue but a nit — the script still fails clearly (with the actual metric values and JSON filename), which is the main goal of the PR, and single-node zero-throughput failures are an edge case (most single-node failures short-circuit before any result JSON is written). Author can address in a follow-up if desired.

Comment thread utils/process_result.py
Comment on lines +53 to +60
if _output_tput <= 0 or _total_tput <= 0:
raise SystemExit(
"FAIL: benchmark produced no decode throughput "
f"(output_throughput={_output_tput}, total_token_throughput={_total_tput}) "
f"in {result_filename}.json — the server almost certainly failed to serve "
"(disagg warmup deadlock / MoRI transport failure / server never reached "
"ready). Check the multinode_server_logs artifact for the real error."
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The new zero-throughput guard at utils/process_result.py:53-60 runs unconditionally (before the is_multinode branch at line 76), but its message hard-codes guidance specific to multinode/disagg failures — "Check the multinode_server_logs artifact" and causes like "disagg warmup deadlock / MoRI transport failure". For a single-node zero-throughput failure (the workflows invoke this script with IS_MULTINODE unset), this points the user at a non-existent artifact and at irrelevant failure modes, partially defeating the PR's "actionable message" goal. Easy fix: either scope the guard to disagg/multinode, or make the artifact/cause wording conditional on is_multinode (e.g. fall back to "check the server logs").

Extended reasoning...

What

The guard added in this PR runs before the is_multinode branch:

# lines 51-60 (unconditional)
_output_tput = float(bmk_result.get('output_throughput', 0) or 0)
_total_tput  = float(bmk_result.get('total_token_throughput', 0) or 0)
if _output_tput <= 0 or _total_tput <= 0:
    raise SystemExit(
        "... Check the multinode_server_logs artifact for the real error."
    )

# line 76 — the actual mode split happens AFTER the guard fires
is_multinode = os.environ.get('IS_MULTINODE', 'false').lower() == 'true'

So the message is emitted for every zero-throughput result, including single-node runs.

Why this is misleading for single-node

  • multinode_server_logs.tar.gz is only produced by the multinode launchers (e.g. launch_h100-dgxc-slurm.sh, launch_h200-dgxc-slurm.sh, launch_b300-nv.sh). A single-node run has no such artifact, so the hint points the user at something that does not exist.
  • The listed causes — "disagg warmup deadlock / MoRI transport failure / server never reached ready" — are disagg-specific. Only the last one is generic; the first two are nonsensical for a single-node agg run.
  • The PR description even frames the motivation as disagg-specific ("Vendor-agnostic; affects both NVIDIA and AMD disagg result processing"), but the guard isn't scoped to disagg.

Step-by-step proof

  1. .github/workflows/benchmark-tmpl.yml invokes process_result.py without setting IS_MULTINODE, i.e. single-node path.
  2. Suppose that benchmark writes result.json with output_throughput: 0 (e.g. server crashed after a partial warmup, or the benchmark client wrote a zero-filled result on early exit).
  3. The script loads the JSON at line 43, then evaluates the guard at line 53.
  4. _output_tput == 0.0, so raise SystemExit(...) fires.
  5. The user sees: "Check the multinode_server_logs artifact" — but this is a single-node workflow run, where that artifact is never produced. They will spend time hunting for a tarball that does not exist, and reading "disagg warmup deadlock / MoRI transport failure" while debugging a non-disagg failure.

Impact

Real but small. The first half of the message (the metric values and the JSON filename) is universally actionable, and zero-throughput single-node failures are likely rarer than the disagg case this PR targets — the typical single-node failure mode short-circuits before any result JSON is written at all. Severity is nit, not normal.

Suggested fix

Either:

  • Generalize the wording — replace multinode_server_logs with the server logs and drop the disagg-specific cause list (or hedge it with "if disagg, ..."). One-line change.
  • Scope the artifact hint to multinode/disagg — read IS_MULTINODE (or disagg) before the guard and pick the message accordingly.

The first is simpler and keeps the guard universal, which is desirable since the underlying ZeroDivisionError it prevents exists on the single-node path too (1000.0 / float(value) at line 132 runs in both modes).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant