ci: surface crashed-test failures and add per-step summaries (cpp, python, thirdparty) by rgsl888prabhu · Pull Request #1191 · NVIDIA/cuopt

rgsl888prabhu · 2026-05-08T20:11:01Z

Summary

When a test runner — gtest binary or pytest — is killed by a signal mid-run, it doesn't finalize its --gtest_output=xml: / --junitxml output. ci/utils/nightly_report.py classifies tests purely from those XML files, so a mid-run crash with no other captured failures was reported as "All tests passed." while the script exited non-zero. The GitHub step was correctly red, but the textual summary lied.

Observed in a recent cpp run: PDLP_TEST reported five [ FAILED ] cases and then segfaulted, but the structured summary said Classification: 459 passed, 0 failed, 0 errors, 0 flaky, 0 skipped and concluded All tests passed.

Changes

Crash-XML synthesis (the actual fix)

ci/utils/crash_helpers.sh — add write_pytest_crash_marker helper that writes <junitxml>-crash.xml on pytest signal death, alongside (not over) any partial XML pytest may have emitted.
ci/run_ctests.sh — non-nightly crash path now calls write_crash_xml for the gtest binary; tracks failed binaries in a FAILED_BINARIES array and prints them at end of run.
ci/run_cuopt_pytests.sh, ci/run_cuopt_server_pytests.sh — call write_pytest_crash_marker on the non-nightly crash path.
ci/thirdparty-testing/run_pulp_tests.sh, run_pyomo_tests.sh, run_cvxpy_tests.sh — capture pytest's exit code and synthesize a crash XML on signal death.

Per-step failure summaries

ci/test_cpp.sh, ci/test_python.sh, ci/test_wheel_cuopt.sh, ci/test_wheel_cuopt_server.sh — track each major step in a FAILED_STEPS array via || FAILED_STEPS+=(...) and print a ==== FAILED TEST STEPS ==== block before the final exit ${EXITCODE}.

Out of scope

test_notebooks.sh and test_doc_examples.sh don't produce JUnit XML and exit non-zero on the first failure (notebooks via exit 1, doc-examples via its own EXIT-trap counter), so they don't have the same false-pass mode.

Why this shape

Crash XMLs go to a separate *-crash.xml filename so we never clobber a partial XML emitted by gtest/pytest before the crash.
The nightly retry path is untouched — it already calls write_crash_xml per retry attempt and via pytest_crash_isolate. This PR plugs the gap on the non-nightly path (PR runs and any non-nightly RAPIDS_BUILD_TYPE).
Step summaries print to stdout only (per maintainer preference). nightly_report.py continues to write the structured markdown / HTML / GITHUB_STEP_SUMMARY.

Test plan

On a deliberately-crashing gtest binary, verify <test_name>-crash.xml is written and nightly_report.py lists it as a failure.
On a deliberately-crashing pytest invocation, verify <junit-name>-crash.xml is written and the failure surfaces in the report.
On a normal (non-crash) test failure, verify behavior is unchanged.
On a fully-passing run, verify no FAILED gtest BINARIES / FAILED TEST STEPS block is printed.
pre-commit run --files ci/... passes (shellcheck wired in).

When a gtest binary segfaults (or otherwise dies from a signal), gtest does not write its --gtest_output=xml file because the XML printer only runs at the end of RUN_ALL_TESTS(). On the non-nightly path of run_ctests.sh, the function returned without synthesizing any JUnit record, so nightly_report.py — which classifies tests purely from the XML files — saw zero failures and printed "All tests passed." even though run_ctests.sh exited 1 and the GitHub step was red. Changes: - Non-nightly path now writes a synthetic *-crash.xml on signal death via the existing write_crash_xml helper (separate file so any partial XML from gtest is preserved). - All failure paths (non-nightly crash, non-nightly normal failure, nightly listing failure, nightly retries-exhausted) now record the binary into a FAILED_BINARIES array with a short reason. - After the for-loop, a summary block lists each failed binary with its failure mode, so the cause is visible in the raw run log even before nightly_report.py's structured output.

copy-pr-bot · 2026-05-08T20:11:05Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Apply the same false-pass fix as the prior commit to every test runner that emits JUnit XML, plus a stdout step-summary in each wrapper. Same root cause as cpp: pytest does not finalize its --junitxml output when killed by a signal. nightly_report.py classifies purely from XML files, so a mid-run crash with no other failures was reported as "All tests passed" while the script exited non-zero. Changes: - ci/utils/crash_helpers.sh: add write_pytest_crash_marker helper that writes <junitxml>-crash.xml on signal death, preserving any partial XML pytest may have emitted. - ci/run_cuopt_pytests.sh, ci/run_cuopt_server_pytests.sh: call the new helper on the non-nightly crash path before exit (nightly path is unchanged — it already does per-test isolation via pytest_crash_isolate). - ci/thirdparty-testing/run_pulp_tests.sh, run_pyomo_tests.sh, run_cvxpy_tests.sh: capture pytest's exit code and synthesize a crash XML on signal death. - ci/test_cpp.sh, ci/test_python.sh, ci/test_wheel_cuopt.sh, ci/test_wheel_cuopt_server.sh: track each major step in a FAILED_STEPS array and print an end-of-run block listing which steps failed, so the cause is obvious in the raw run log alongside nightly_report.py's structured output. test_notebooks.sh and test_doc_examples.sh do not produce JUnit XML and exit non-zero on first failure already, so they don't have the same false-pass mode.

coderabbitai · 2026-05-08T20:40:21Z

📝 Walkthrough

Walkthrough

This PR adds crash-marker helpers and changes CI runners to accumulate failures: pytest/gtest runners write synthetic JUnit crash XML on abnormal exits; orchestration scripts record failed steps/binaries and print consolidated summaries before final exit.

Changes

CI Failure Tracking and Crash Reporting

Layer / File(s)	Summary
Crash Reporting Helper Foundation `ci/utils/crash_helpers.sh`	Added `write_pytest_crash_marker` and extended `signal_name` to recognize GNU timeout (124).
GTest Binary Failure Tracking Infrastructure `ci/run_ctests.sh`	Introduced `FAILED_BINARIES`, `record_binary_failure`, and `write_binary_crash_marker`.
GTest Failure Recording Paths `ci/run_ctests.sh`	Records binary failures on signal-death (writes crash marker), when gtest list tests unavailable, when failing testcases are unparseable, when retries are exhausted, and prints consolidated summary.
Pytest Crash Marker Writing `ci/run_cuopt_pytests.sh`, `ci/run_cuopt_server_pytests.sh`, `ci/thirdparty-testing/run_cvxpy_tests.sh`, `ci/thirdparty-testing/run_pulp_tests.sh`, `ci/thirdparty-testing/run_pyomo_tests.sh`	Capture pytest exit codes and call `write_pytest_crash_marker` to synthesize `*-crash.xml` for abnormal pytest exits (timeouts/signals) on non-nightly runs.
Test Script Failure Aggregation and Reporting `ci/test_cpp.sh`, `ci/test_python.sh`, `ci/test_wheel_cuopt.sh`, `ci/test_wheel_cuopt_server.sh`	Orchestration scripts use `FAILED_STEPS` arrays, `ERR` traps, and `set +e` to continue after failures, accumulate step names, and print failure summaries before exiting with appropriate exit code.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically summarizes the main changes: crash-test failure surfacing and per-step summaries across multiple CI scripts.
Description check	✅ Passed	The description is directly related to the changeset, explaining the problem (false-pass on crashed tests), the solution (crash XML synthesis and step summaries), and implementation details.
Docstring Coverage	✅ Passed	Docstring coverage is 85.71% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@ci/thirdparty-testing/run_cvxpy_tests.sh`:
- Around line 53-55: The crash-marker logic currently only writes a crash marker
when pytest_rc > 128 (signal deaths); update the condition around
write_pytest_crash_marker to also treat the timeout exit code 124 as a crash so
the marker is written when timeout kills pytest. Locate the block using the
variable pytest_rc and the function write_pytest_crash_marker and change the
conditional to trigger when pytest_rc == 124 OR pytest_rc > 128 (apply the same
change to the corresponding checks in the other two test scripts that use
pytest_rc and write_pytest_crash_marker).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a2e955e7-efca-4b98-9846-a83e9cbab627

📥 Commits

Reviewing files that changed from the base of the PR and between b311a99 and 4ba7c78.

📒 Files selected for processing (11)

ci/run_ctests.sh
ci/run_cuopt_pytests.sh
ci/run_cuopt_server_pytests.sh
ci/test_cpp.sh
ci/test_python.sh
ci/test_wheel_cuopt.sh
ci/test_wheel_cuopt_server.sh
ci/thirdparty-testing/run_cvxpy_tests.sh
ci/thirdparty-testing/run_pulp_tests.sh
ci/thirdparty-testing/run_pyomo_tests.sh
ci/utils/crash_helpers.sh

bdice

Bash code generally looks fine. CodeRabbit raised a good question about handling other signals, maybe look at that before merging.

…al deaths Addresses review feedback on PR NVIDIA#1191. The previous condition `pytest_rc > 128` only matched signal deaths, but GNU coreutils 'timeout' (which all three thirdparty pytest runners use) exits with 124 when it kills its child — meaning a timeout-killed pytest would leave no JUnit XML AND no synthetic crash marker, and nightly_report.py would once again report "All tests passed." Use `pytest_rc > 5` instead: pytest's normal exit codes are 0-5 (passed / failed / interrupted / internal-error / usage / no-tests). Anything beyond that means pytest did not finalize its JUnit XML. This covers timeout (124), unhandled signals (SIGSEGV, SIGABRT, SIGKILL, SIGTERM), and any other abnormal exit without needing to enumerate them. Also update signal_name() to render rc=124 as "timeout (killed by 'timeout' command)" instead of the misleading "signal -4" (124-128) it would otherwise produce. The same condition is not changed in run_cuopt_pytests.sh / run_cuopt_server_pytests.sh because those scripts do not invoke pytest under an internal `timeout` — their rc=124 case cannot arise from inside the script.

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

ci/thirdparty-testing/run_pulp_tests.sh (1)
43-56: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fallback failure path can skip crash-marker synthesis under set -e.

At line 45, timeout 5m python pulp/tests/run_tests.py is unguarded. The script has set -e on line 5, so if timeout exits non-zero (e.g., 124 on timeout), the shell exits before line 46 (pytest_rc=$?) executes, causing lines 54–56 (crash-marker synthesis) to never run. This can cause nightly_report.py to report "All tests passed" when tests actually timed out or crashed.
Suggested minimal fix
 if [ "$pytest_rc" -eq 5 ]; then
     rapids-logger "No pytest -k cuopt tests found; running PuLP run_tests.py (solver auto-detection, includes cuopt)"
-    timeout 5m python pulp/tests/run_tests.py
-    pytest_rc=$?
+    pytest_rc=0
+    timeout 5m python pulp/tests/run_tests.py || pytest_rc=$?
 fi
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ci/thirdparty-testing/run_pulp_tests.sh` around lines 43 - 56, The unguarded
`timeout 5m python pulp/tests/run_tests.py` can trigger `set -e` to exit the
script before `pytest_rc=$?` and `write_pytest_crash_marker` run; wrap that call
so its exit code is always captured even under `set -e` (e.g., temporarily
disable `set -e` or run the command in a conditional/subshell), assign its exit
code to `pytest_rc` thereafter, and ensure you still call
`write_pytest_crash_marker` when `pytest_rc` > 5; key symbols: the `timeout 5m
python pulp/tests/run_tests.py` invocation, the `pytest_rc` variable, and the
`write_pytest_crash_marker` call.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@ci/thirdparty-testing/run_pulp_tests.sh`:
- Around line 43-56: The unguarded `timeout 5m python pulp/tests/run_tests.py`
can trigger `set -e` to exit the script before `pytest_rc=$?` and
`write_pytest_crash_marker` run; wrap that call so its exit code is always
captured even under `set -e` (e.g., temporarily disable `set -e` or run the
command in a conditional/subshell), assign its exit code to `pytest_rc`
thereafter, and ensure you still call `write_pytest_crash_marker` when
`pytest_rc` > 5; key symbols: the `timeout 5m python pulp/tests/run_tests.py`
invocation, the `pytest_rc` variable, and the `write_pytest_crash_marker` call.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e25aaf2c-28d4-48d0-be02-52e34ab13fdc

📥 Commits

Reviewing files that changed from the base of the PR and between 4ba7c78 and db5d4d6.

📒 Files selected for processing (4)

ci/thirdparty-testing/run_cvxpy_tests.sh
ci/thirdparty-testing/run_pulp_tests.sh
ci/thirdparty-testing/run_pyomo_tests.sh
ci/utils/crash_helpers.sh

🚧 Files skipped from review as they are similar to previous changes (1)

ci/utils/crash_helpers.sh

rgsl888prabhu · 2026-05-12T16:22:53Z

/merge

The assert-failure test already verifies the basic NEW-failure flow. Add a second test that sends SIGSEGV to pytest mid-run so the crash-marker path (write_pytest_crash_marker, added in #1191) writes a PROCESS_CRASH entry that the classifier surfaces under NEW failures with a SIGSEGV message. Named test_zz_* so it runs after the assert test in definition order and the assertion failure still lands in the partial JUnit XML before pytest is killed. DO NOT MERGE. Remove both smoke tests before merging PR #1194. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Crashes (PROCESS_CRASH entries written by PR #1191's write_pytest_crash_marker, or any failure whose message matches "crashed with SIG*") now render in a dedicated GitHub `[!CAUTION]` alert block at the top of the comment. The full crash diagnostic goes in a fenced code block — no more half-sentence truncation when the message gets cut at 140 chars. Bumps the table-cell short_msg cap from 140 to 300 chars so ordinary failure messages also stop ending mid-sentence. Crashes are split out of the NEW and KNOWN buckets before those sections render, so the CAUTION block is the single source of truth for crash entries (no duplication). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

rgsl888prabhu changed the title ~~ci(cpp-tests): record binary-level crashes and print failure summary~~ ci: surface crashed-test failures and add per-step summaries (cpp, python, thirdparty) May 8, 2026

rgsl888prabhu marked this pull request as ready for review May 8, 2026 20:35

rgsl888prabhu requested a review from a team as a code owner May 8, 2026 20:35

rgsl888prabhu requested a review from KyleFromNVIDIA May 8, 2026 20:35

coderabbitai Bot reviewed May 8, 2026

View reviewed changes

Comment thread ci/thirdparty-testing/run_cvxpy_tests.sh Outdated

bdice approved these changes May 8, 2026

View reviewed changes

Comment thread ci/thirdparty-testing/run_cvxpy_tests.sh Outdated

rgsl888prabhu added non-breaking Introduces a non-breaking change improvement Improves an existing functionality labels May 11, 2026

rgsl888prabhu self-assigned this May 11, 2026

coderabbitai Bot reviewed May 11, 2026

View reviewed changes

rapids-bot Bot merged commit 3df9b05 into NVIDIA:main May 12, 2026
409 of 417 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: surface crashed-test failures and add per-step summaries (cpp, python, thirdparty)#1191

ci: surface crashed-test failures and add per-step summaries (cpp, python, thirdparty)#1191
rapids-bot[bot] merged 3 commits into
NVIDIA:mainfrom
rgsl888prabhu:ci/cpp-test-crash-visibility

rgsl888prabhu commented May 8, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented May 8, 2026

Uh oh!

coderabbitai Bot commented May 8, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

bdice left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

rgsl888prabhu commented May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rgsl888prabhu commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Crash-XML synthesis (the actual fix)

Per-step failure summaries

Out of scope

Why this shape

Test plan

Uh oh!

copy-pr-bot Bot commented May 8, 2026

Uh oh!

coderabbitai Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bdice left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

rgsl888prabhu commented May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rgsl888prabhu commented May 8, 2026 •

edited

Loading

coderabbitai Bot commented May 8, 2026 •

edited

Loading