-
Notifications
You must be signed in to change notification settings - Fork 179
fix(process_result): fail loudly on zero-throughput disagg runs (no more masked ZeroDivisionError) #1590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
fix(process_result): fail loudly on zero-throughput disagg runs (no more masked ZeroDivisionError) #1590
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -42,6 +42,23 @@ | |
| with open(f'{result_filename}.json') as f: | ||
| bmk_result = json.load(f) | ||
|
|
||
| # A failed/degenerate benchmark (server never came up, disagg warmup deadlock, | ||
| # MoRI/KV-transport failure, etc.) still writes a result JSON, but with zeroed | ||
| # throughput and latency metrics. Detect that here and fail with the real, | ||
| # actionable reason. Otherwise the tpot reciprocal below (1000.0 / tpot) raises | ||
| # an opaque "ZeroDivisionError: float division by zero" that masks the true | ||
| # server-side cause and makes every such failure look identical. | ||
| _output_tput = float(bmk_result.get('output_throughput', 0) or 0) | ||
| _total_tput = float(bmk_result.get('total_token_throughput', 0) or 0) | ||
| if _output_tput <= 0 or _total_tput <= 0: | ||
| raise SystemExit( | ||
| "FAIL: benchmark produced no decode throughput " | ||
| f"(output_throughput={_output_tput}, total_token_throughput={_total_tput}) " | ||
| f"in {result_filename}.json — the server almost certainly failed to serve " | ||
| "(disagg warmup deadlock / MoRI transport failure / server never reached " | ||
| "ready). Check the multinode_server_logs artifact for the real error." | ||
| ) | ||
|
Check warning on line 60 in utils/process_result.py
|
||
|
Comment on lines
+53
to
+60
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🟡 The new zero-throughput guard at Extended reasoning...WhatThe guard added in this PR runs before the # lines 51-60 (unconditional)
_output_tput = float(bmk_result.get('output_throughput', 0) or 0)
_total_tput = float(bmk_result.get('total_token_throughput', 0) or 0)
if _output_tput <= 0 or _total_tput <= 0:
raise SystemExit(
"... Check the multinode_server_logs artifact for the real error."
)
# line 76 — the actual mode split happens AFTER the guard fires
is_multinode = os.environ.get('IS_MULTINODE', 'false').lower() == 'true'So the message is emitted for every zero-throughput result, including single-node runs. Why this is misleading for single-node
Step-by-step proof
ImpactReal but small. The first half of the message (the metric values and the JSON filename) is universally actionable, and zero-throughput single-node failures are likely rarer than the disagg case this PR targets — the typical single-node failure mode short-circuits before any result JSON is written at all. Severity is nit, not normal. Suggested fixEither:
The first is simpler and keeps the guard universal, which is desirable since the underlying |
||
|
|
||
| data = { | ||
| 'hw': hw, | ||
| 'conc': int(bmk_result['max_concurrency']), | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Error message references nonexistent artifact for single-node runs
Low Severity
The zero-throughput guard fires for all run types (single-node and multinode alike), but the error message unconditionally mentions disagg-specific causes ("disagg warmup deadlock / MoRI transport failure") and directs the user to the
multinode_server_logsartifact. That artifact is only produced by multinode workflows. On a single-node run with zero throughput, the message points to a non-existent artifact, potentially sending an engineer on a fruitless search. Thedisaggvariable is already available at this point and could be used to tailor the message.Reviewed by Cursor Bugbot for commit bd405c1. Configure here.