ci: surface crashed-test failures and add per-step summaries (cpp, python, thirdparty)#1191
Conversation
When a gtest binary segfaults (or otherwise dies from a signal), gtest does not write its --gtest_output=xml file because the XML printer only runs at the end of RUN_ALL_TESTS(). On the non-nightly path of run_ctests.sh, the function returned without synthesizing any JUnit record, so nightly_report.py — which classifies tests purely from the XML files — saw zero failures and printed "All tests passed." even though run_ctests.sh exited 1 and the GitHub step was red. Changes: - Non-nightly path now writes a synthetic *-crash.xml on signal death via the existing write_crash_xml helper (separate file so any partial XML from gtest is preserved). - All failure paths (non-nightly crash, non-nightly normal failure, nightly listing failure, nightly retries-exhausted) now record the binary into a FAILED_BINARIES array with a short reason. - After the for-loop, a summary block lists each failed binary with its failure mode, so the cause is visible in the raw run log even before nightly_report.py's structured output.
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Apply the same false-pass fix as the prior commit to every test runner that emits JUnit XML, plus a stdout step-summary in each wrapper. Same root cause as cpp: pytest does not finalize its --junitxml output when killed by a signal. nightly_report.py classifies purely from XML files, so a mid-run crash with no other failures was reported as "All tests passed" while the script exited non-zero. Changes: - ci/utils/crash_helpers.sh: add write_pytest_crash_marker helper that writes <junitxml>-crash.xml on signal death, preserving any partial XML pytest may have emitted. - ci/run_cuopt_pytests.sh, ci/run_cuopt_server_pytests.sh: call the new helper on the non-nightly crash path before exit (nightly path is unchanged — it already does per-test isolation via pytest_crash_isolate). - ci/thirdparty-testing/run_pulp_tests.sh, run_pyomo_tests.sh, run_cvxpy_tests.sh: capture pytest's exit code and synthesize a crash XML on signal death. - ci/test_cpp.sh, ci/test_python.sh, ci/test_wheel_cuopt.sh, ci/test_wheel_cuopt_server.sh: track each major step in a FAILED_STEPS array and print an end-of-run block listing which steps failed, so the cause is obvious in the raw run log alongside nightly_report.py's structured output. test_notebooks.sh and test_doc_examples.sh do not produce JUnit XML and exit non-zero on first failure already, so they don't have the same false-pass mode.
📝 WalkthroughWalkthroughThis PR adds crash-marker helpers and changes CI runners to accumulate failures: pytest/gtest runners write synthetic JUnit crash XML on abnormal exits; orchestration scripts record failed steps/binaries and print consolidated summaries before final exit. ChangesCI Failure Tracking and Crash Reporting
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@ci/thirdparty-testing/run_cvxpy_tests.sh`:
- Around line 53-55: The crash-marker logic currently only writes a crash marker
when pytest_rc > 128 (signal deaths); update the condition around
write_pytest_crash_marker to also treat the timeout exit code 124 as a crash so
the marker is written when timeout kills pytest. Locate the block using the
variable pytest_rc and the function write_pytest_crash_marker and change the
conditional to trigger when pytest_rc == 124 OR pytest_rc > 128 (apply the same
change to the corresponding checks in the other two test scripts that use
pytest_rc and write_pytest_crash_marker).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: a2e955e7-efca-4b98-9846-a83e9cbab627
📒 Files selected for processing (11)
ci/run_ctests.shci/run_cuopt_pytests.shci/run_cuopt_server_pytests.shci/test_cpp.shci/test_python.shci/test_wheel_cuopt.shci/test_wheel_cuopt_server.shci/thirdparty-testing/run_cvxpy_tests.shci/thirdparty-testing/run_pulp_tests.shci/thirdparty-testing/run_pyomo_tests.shci/utils/crash_helpers.sh
bdice
left a comment
There was a problem hiding this comment.
Bash code generally looks fine. CodeRabbit raised a good question about handling other signals, maybe look at that before merging.
…al deaths Addresses review feedback on PR NVIDIA#1191. The previous condition `pytest_rc > 128` only matched signal deaths, but GNU coreutils 'timeout' (which all three thirdparty pytest runners use) exits with 124 when it kills its child — meaning a timeout-killed pytest would leave no JUnit XML AND no synthetic crash marker, and nightly_report.py would once again report "All tests passed." Use `pytest_rc > 5` instead: pytest's normal exit codes are 0-5 (passed / failed / interrupted / internal-error / usage / no-tests). Anything beyond that means pytest did not finalize its JUnit XML. This covers timeout (124), unhandled signals (SIGSEGV, SIGABRT, SIGKILL, SIGTERM), and any other abnormal exit without needing to enumerate them. Also update signal_name() to render rc=124 as "timeout (killed by 'timeout' command)" instead of the misleading "signal -4" (124-128) it would otherwise produce. The same condition is not changed in run_cuopt_pytests.sh / run_cuopt_server_pytests.sh because those scripts do not invoke pytest under an internal `timeout` — their rc=124 case cannot arise from inside the script.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
ci/thirdparty-testing/run_pulp_tests.sh (1)
43-56:⚠️ Potential issue | 🟠 Major | ⚡ Quick winFallback failure path can skip crash-marker synthesis under
set -e.At line 45,
timeout 5m python pulp/tests/run_tests.pyis unguarded. The script hasset -eon line 5, so if timeout exits non-zero (e.g., 124 on timeout), the shell exits before line 46 (pytest_rc=$?) executes, causing lines 54–56 (crash-marker synthesis) to never run. This can cause nightly_report.py to report "All tests passed" when tests actually timed out or crashed.Suggested minimal fix
if [ "$pytest_rc" -eq 5 ]; then rapids-logger "No pytest -k cuopt tests found; running PuLP run_tests.py (solver auto-detection, includes cuopt)" - timeout 5m python pulp/tests/run_tests.py - pytest_rc=$? + pytest_rc=0 + timeout 5m python pulp/tests/run_tests.py || pytest_rc=$? fi🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ci/thirdparty-testing/run_pulp_tests.sh` around lines 43 - 56, The unguarded `timeout 5m python pulp/tests/run_tests.py` can trigger `set -e` to exit the script before `pytest_rc=$?` and `write_pytest_crash_marker` run; wrap that call so its exit code is always captured even under `set -e` (e.g., temporarily disable `set -e` or run the command in a conditional/subshell), assign its exit code to `pytest_rc` thereafter, and ensure you still call `write_pytest_crash_marker` when `pytest_rc` > 5; key symbols: the `timeout 5m python pulp/tests/run_tests.py` invocation, the `pytest_rc` variable, and the `write_pytest_crash_marker` call.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@ci/thirdparty-testing/run_pulp_tests.sh`:
- Around line 43-56: The unguarded `timeout 5m python pulp/tests/run_tests.py`
can trigger `set -e` to exit the script before `pytest_rc=$?` and
`write_pytest_crash_marker` run; wrap that call so its exit code is always
captured even under `set -e` (e.g., temporarily disable `set -e` or run the
command in a conditional/subshell), assign its exit code to `pytest_rc`
thereafter, and ensure you still call `write_pytest_crash_marker` when
`pytest_rc` > 5; key symbols: the `timeout 5m python pulp/tests/run_tests.py`
invocation, the `pytest_rc` variable, and the `write_pytest_crash_marker` call.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: e25aaf2c-28d4-48d0-be02-52e34ab13fdc
📒 Files selected for processing (4)
ci/thirdparty-testing/run_cvxpy_tests.shci/thirdparty-testing/run_pulp_tests.shci/thirdparty-testing/run_pyomo_tests.shci/utils/crash_helpers.sh
🚧 Files skipped from review as they are similar to previous changes (1)
- ci/utils/crash_helpers.sh
|
/merge |
The assert-failure test already verifies the basic NEW-failure flow. Add a second test that sends SIGSEGV to pytest mid-run so the crash-marker path (write_pytest_crash_marker, added in #1191) writes a PROCESS_CRASH entry that the classifier surfaces under NEW failures with a SIGSEGV message. Named test_zz_* so it runs after the assert test in definition order and the assertion failure still lands in the partial JUnit XML before pytest is killed. DO NOT MERGE. Remove both smoke tests before merging PR #1194. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Crashes (PROCESS_CRASH entries written by PR #1191's write_pytest_crash_marker, or any failure whose message matches "crashed with SIG*") now render in a dedicated GitHub `[!CAUTION]` alert block at the top of the comment. The full crash diagnostic goes in a fenced code block — no more half-sentence truncation when the message gets cut at 140 chars. Bumps the table-cell short_msg cap from 140 to 300 chars so ordinary failure messages also stop ending mid-sentence. Crashes are split out of the NEW and KNOWN buckets before those sections render, so the CAUTION block is the single source of truth for crash entries (no duplication). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
When a test runner — gtest binary or pytest — is killed by a signal mid-run, it doesn't finalize its
--gtest_output=xml:/--junitxmloutput.ci/utils/nightly_report.pyclassifies tests purely from those XML files, so a mid-run crash with no other captured failures was reported as "All tests passed." while the script exited non-zero. The GitHub step was correctly red, but the textual summary lied.Observed in a recent cpp run:
PDLP_TESTreported five[ FAILED ]cases and then segfaulted, but the structured summary saidClassification: 459 passed, 0 failed, 0 errors, 0 flaky, 0 skippedand concludedAll tests passed.Changes
Crash-XML synthesis (the actual fix)
ci/utils/crash_helpers.sh— addwrite_pytest_crash_markerhelper that writes<junitxml>-crash.xmlon pytest signal death, alongside (not over) any partial XML pytest may have emitted.ci/run_ctests.sh— non-nightly crash path now callswrite_crash_xmlfor the gtest binary; tracks failed binaries in aFAILED_BINARIESarray and prints them at end of run.ci/run_cuopt_pytests.sh,ci/run_cuopt_server_pytests.sh— callwrite_pytest_crash_markeron the non-nightly crash path.ci/thirdparty-testing/run_pulp_tests.sh,run_pyomo_tests.sh,run_cvxpy_tests.sh— capture pytest's exit code and synthesize a crash XML on signal death.Per-step failure summaries
ci/test_cpp.sh,ci/test_python.sh,ci/test_wheel_cuopt.sh,ci/test_wheel_cuopt_server.sh— track each major step in aFAILED_STEPSarray via|| FAILED_STEPS+=(...)and print a==== FAILED TEST STEPS ====block before the finalexit ${EXITCODE}.Out of scope
test_notebooks.shandtest_doc_examples.shdon't produce JUnit XML and exit non-zero on the first failure (notebooks viaexit 1, doc-examples via its own EXIT-trap counter), so they don't have the same false-pass mode.Why this shape
*-crash.xmlfilename so we never clobber a partial XML emitted by gtest/pytest before the crash.write_crash_xmlper retry attempt and viapytest_crash_isolate. This PR plugs the gap on the non-nightly path (PR runs and any non-nightlyRAPIDS_BUILD_TYPE).nightly_report.pycontinues to write the structured markdown / HTML / GITHUB_STEP_SUMMARY.Test plan
<test_name>-crash.xmlis written andnightly_report.pylists it as a failure.<junit-name>-crash.xmlis written and the failure surfaces in the report.FAILED gtest BINARIES/FAILED TEST STEPSblock is printed.pre-commit run --files ci/...passes (shellcheck wired in).