Skip to content

ci: surface crashed-test failures and add per-step summaries (cpp, python, thirdparty)#1191

Merged
rapids-bot[bot] merged 3 commits into
NVIDIA:mainfrom
rgsl888prabhu:ci/cpp-test-crash-visibility
May 12, 2026
Merged

ci: surface crashed-test failures and add per-step summaries (cpp, python, thirdparty)#1191
rapids-bot[bot] merged 3 commits into
NVIDIA:mainfrom
rgsl888prabhu:ci/cpp-test-crash-visibility

Conversation

@rgsl888prabhu
Copy link
Copy Markdown
Collaborator

@rgsl888prabhu rgsl888prabhu commented May 8, 2026

Summary

When a test runner — gtest binary or pytest — is killed by a signal mid-run, it doesn't finalize its --gtest_output=xml: / --junitxml output. ci/utils/nightly_report.py classifies tests purely from those XML files, so a mid-run crash with no other captured failures was reported as "All tests passed." while the script exited non-zero. The GitHub step was correctly red, but the textual summary lied.

Observed in a recent cpp run: PDLP_TEST reported five [ FAILED ] cases and then segfaulted, but the structured summary said Classification: 459 passed, 0 failed, 0 errors, 0 flaky, 0 skipped and concluded All tests passed.

Changes

Crash-XML synthesis (the actual fix)

  • ci/utils/crash_helpers.sh — add write_pytest_crash_marker helper that writes <junitxml>-crash.xml on pytest signal death, alongside (not over) any partial XML pytest may have emitted.
  • ci/run_ctests.sh — non-nightly crash path now calls write_crash_xml for the gtest binary; tracks failed binaries in a FAILED_BINARIES array and prints them at end of run.
  • ci/run_cuopt_pytests.sh, ci/run_cuopt_server_pytests.sh — call write_pytest_crash_marker on the non-nightly crash path.
  • ci/thirdparty-testing/run_pulp_tests.sh, run_pyomo_tests.sh, run_cvxpy_tests.sh — capture pytest's exit code and synthesize a crash XML on signal death.

Per-step failure summaries

  • ci/test_cpp.sh, ci/test_python.sh, ci/test_wheel_cuopt.sh, ci/test_wheel_cuopt_server.sh — track each major step in a FAILED_STEPS array via || FAILED_STEPS+=(...) and print a ==== FAILED TEST STEPS ==== block before the final exit ${EXITCODE}.

Out of scope

test_notebooks.sh and test_doc_examples.sh don't produce JUnit XML and exit non-zero on the first failure (notebooks via exit 1, doc-examples via its own EXIT-trap counter), so they don't have the same false-pass mode.

Why this shape

  • Crash XMLs go to a separate *-crash.xml filename so we never clobber a partial XML emitted by gtest/pytest before the crash.
  • The nightly retry path is untouched — it already calls write_crash_xml per retry attempt and via pytest_crash_isolate. This PR plugs the gap on the non-nightly path (PR runs and any non-nightly RAPIDS_BUILD_TYPE).
  • Step summaries print to stdout only (per maintainer preference). nightly_report.py continues to write the structured markdown / HTML / GITHUB_STEP_SUMMARY.

Test plan

  • On a deliberately-crashing gtest binary, verify <test_name>-crash.xml is written and nightly_report.py lists it as a failure.
  • On a deliberately-crashing pytest invocation, verify <junit-name>-crash.xml is written and the failure surfaces in the report.
  • On a normal (non-crash) test failure, verify behavior is unchanged.
  • On a fully-passing run, verify no FAILED gtest BINARIES / FAILED TEST STEPS block is printed.
  • pre-commit run --files ci/... passes (shellcheck wired in).

When a gtest binary segfaults (or otherwise dies from a signal), gtest
does not write its --gtest_output=xml file because the XML printer only
runs at the end of RUN_ALL_TESTS(). On the non-nightly path of
run_ctests.sh, the function returned without synthesizing any JUnit
record, so nightly_report.py — which classifies tests purely from the
XML files — saw zero failures and printed "All tests passed." even
though run_ctests.sh exited 1 and the GitHub step was red.

Changes:
- Non-nightly path now writes a synthetic *-crash.xml on signal death
  via the existing write_crash_xml helper (separate file so any partial
  XML from gtest is preserved).
- All failure paths (non-nightly crash, non-nightly normal failure,
  nightly listing failure, nightly retries-exhausted) now record the
  binary into a FAILED_BINARIES array with a short reason.
- After the for-loop, a summary block lists each failed binary with
  its failure mode, so the cause is visible in the raw run log even
  before nightly_report.py's structured output.
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 8, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Apply the same false-pass fix as the prior commit to every test runner
that emits JUnit XML, plus a stdout step-summary in each wrapper.

Same root cause as cpp: pytest does not finalize its --junitxml output
when killed by a signal. nightly_report.py classifies purely from XML
files, so a mid-run crash with no other failures was reported as
"All tests passed" while the script exited non-zero.

Changes:
- ci/utils/crash_helpers.sh: add write_pytest_crash_marker helper that
  writes <junitxml>-crash.xml on signal death, preserving any partial
  XML pytest may have emitted.
- ci/run_cuopt_pytests.sh, ci/run_cuopt_server_pytests.sh: call the new
  helper on the non-nightly crash path before exit (nightly path is
  unchanged — it already does per-test isolation via pytest_crash_isolate).
- ci/thirdparty-testing/run_pulp_tests.sh, run_pyomo_tests.sh,
  run_cvxpy_tests.sh: capture pytest's exit code and synthesize a
  crash XML on signal death.
- ci/test_cpp.sh, ci/test_python.sh, ci/test_wheel_cuopt.sh,
  ci/test_wheel_cuopt_server.sh: track each major step in a
  FAILED_STEPS array and print an end-of-run block listing which steps
  failed, so the cause is obvious in the raw run log alongside
  nightly_report.py's structured output.

test_notebooks.sh and test_doc_examples.sh do not produce JUnit XML
and exit non-zero on first failure already, so they don't have the
same false-pass mode.
@rgsl888prabhu rgsl888prabhu changed the title ci(cpp-tests): record binary-level crashes and print failure summary ci: surface crashed-test failures and add per-step summaries (cpp, python, thirdparty) May 8, 2026
@rgsl888prabhu rgsl888prabhu marked this pull request as ready for review May 8, 2026 20:35
@rgsl888prabhu rgsl888prabhu requested a review from a team as a code owner May 8, 2026 20:35
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 8, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

This PR adds crash-marker helpers and changes CI runners to accumulate failures: pytest/gtest runners write synthetic JUnit crash XML on abnormal exits; orchestration scripts record failed steps/binaries and print consolidated summaries before final exit.

Changes

CI Failure Tracking and Crash Reporting

Layer / File(s) Summary
Crash Reporting Helper Foundation
ci/utils/crash_helpers.sh
Added write_pytest_crash_marker and extended signal_name to recognize GNU timeout (124).
GTest Binary Failure Tracking Infrastructure
ci/run_ctests.sh
Introduced FAILED_BINARIES, record_binary_failure, and write_binary_crash_marker.
GTest Failure Recording Paths
ci/run_ctests.sh
Records binary failures on signal-death (writes crash marker), when gtest list tests unavailable, when failing testcases are unparseable, when retries are exhausted, and prints consolidated summary.
Pytest Crash Marker Writing
ci/run_cuopt_pytests.sh, ci/run_cuopt_server_pytests.sh, ci/thirdparty-testing/run_cvxpy_tests.sh, ci/thirdparty-testing/run_pulp_tests.sh, ci/thirdparty-testing/run_pyomo_tests.sh
Capture pytest exit codes and call write_pytest_crash_marker to synthesize *-crash.xml for abnormal pytest exits (timeouts/signals) on non-nightly runs.
Test Script Failure Aggregation and Reporting
ci/test_cpp.sh, ci/test_python.sh, ci/test_wheel_cuopt.sh, ci/test_wheel_cuopt_server.sh
Orchestration scripts use FAILED_STEPS arrays, ERR traps, and set +e to continue after failures, accumulate step names, and print failure summaries before exiting with appropriate exit code.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically summarizes the main changes: crash-test failure surfacing and per-step summaries across multiple CI scripts.
Description check ✅ Passed The description is directly related to the changeset, explaining the problem (false-pass on crashed tests), the solution (crash XML synthesis and step summaries), and implementation details.
Docstring Coverage ✅ Passed Docstring coverage is 85.71% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@ci/thirdparty-testing/run_cvxpy_tests.sh`:
- Around line 53-55: The crash-marker logic currently only writes a crash marker
when pytest_rc > 128 (signal deaths); update the condition around
write_pytest_crash_marker to also treat the timeout exit code 124 as a crash so
the marker is written when timeout kills pytest. Locate the block using the
variable pytest_rc and the function write_pytest_crash_marker and change the
conditional to trigger when pytest_rc == 124 OR pytest_rc > 128 (apply the same
change to the corresponding checks in the other two test scripts that use
pytest_rc and write_pytest_crash_marker).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a2e955e7-efca-4b98-9846-a83e9cbab627

📥 Commits

Reviewing files that changed from the base of the PR and between b311a99 and 4ba7c78.

📒 Files selected for processing (11)
  • ci/run_ctests.sh
  • ci/run_cuopt_pytests.sh
  • ci/run_cuopt_server_pytests.sh
  • ci/test_cpp.sh
  • ci/test_python.sh
  • ci/test_wheel_cuopt.sh
  • ci/test_wheel_cuopt_server.sh
  • ci/thirdparty-testing/run_cvxpy_tests.sh
  • ci/thirdparty-testing/run_pulp_tests.sh
  • ci/thirdparty-testing/run_pyomo_tests.sh
  • ci/utils/crash_helpers.sh

Comment thread ci/thirdparty-testing/run_cvxpy_tests.sh Outdated
Copy link
Copy Markdown
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bash code generally looks fine. CodeRabbit raised a good question about handling other signals, maybe look at that before merging.

Comment thread ci/thirdparty-testing/run_cvxpy_tests.sh Outdated
…al deaths

Addresses review feedback on PR NVIDIA#1191. The previous condition
`pytest_rc > 128` only matched signal deaths, but GNU coreutils
'timeout' (which all three thirdparty pytest runners use) exits with
124 when it kills its child — meaning a timeout-killed pytest would
leave no JUnit XML AND no synthetic crash marker, and nightly_report.py
would once again report "All tests passed."

Use `pytest_rc > 5` instead: pytest's normal exit codes are 0-5
(passed / failed / interrupted / internal-error / usage / no-tests).
Anything beyond that means pytest did not finalize its JUnit XML.
This covers timeout (124), unhandled signals (SIGSEGV, SIGABRT, SIGKILL,
SIGTERM), and any other abnormal exit without needing to enumerate them.

Also update signal_name() to render rc=124 as
"timeout (killed by 'timeout' command)" instead of the misleading
"signal -4" (124-128) it would otherwise produce.

The same condition is not changed in run_cuopt_pytests.sh /
run_cuopt_server_pytests.sh because those scripts do not invoke
pytest under an internal `timeout` — their rc=124 case cannot arise
from inside the script.
@rgsl888prabhu rgsl888prabhu added non-breaking Introduces a non-breaking change improvement Improves an existing functionality labels May 11, 2026
@rgsl888prabhu rgsl888prabhu self-assigned this May 11, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
ci/thirdparty-testing/run_pulp_tests.sh (1)

43-56: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fallback failure path can skip crash-marker synthesis under set -e.

At line 45, timeout 5m python pulp/tests/run_tests.py is unguarded. The script has set -e on line 5, so if timeout exits non-zero (e.g., 124 on timeout), the shell exits before line 46 (pytest_rc=$?) executes, causing lines 54–56 (crash-marker synthesis) to never run. This can cause nightly_report.py to report "All tests passed" when tests actually timed out or crashed.

Suggested minimal fix
 if [ "$pytest_rc" -eq 5 ]; then
     rapids-logger "No pytest -k cuopt tests found; running PuLP run_tests.py (solver auto-detection, includes cuopt)"
-    timeout 5m python pulp/tests/run_tests.py
-    pytest_rc=$?
+    pytest_rc=0
+    timeout 5m python pulp/tests/run_tests.py || pytest_rc=$?
 fi
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ci/thirdparty-testing/run_pulp_tests.sh` around lines 43 - 56, The unguarded
`timeout 5m python pulp/tests/run_tests.py` can trigger `set -e` to exit the
script before `pytest_rc=$?` and `write_pytest_crash_marker` run; wrap that call
so its exit code is always captured even under `set -e` (e.g., temporarily
disable `set -e` or run the command in a conditional/subshell), assign its exit
code to `pytest_rc` thereafter, and ensure you still call
`write_pytest_crash_marker` when `pytest_rc` > 5; key symbols: the `timeout 5m
python pulp/tests/run_tests.py` invocation, the `pytest_rc` variable, and the
`write_pytest_crash_marker` call.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@ci/thirdparty-testing/run_pulp_tests.sh`:
- Around line 43-56: The unguarded `timeout 5m python pulp/tests/run_tests.py`
can trigger `set -e` to exit the script before `pytest_rc=$?` and
`write_pytest_crash_marker` run; wrap that call so its exit code is always
captured even under `set -e` (e.g., temporarily disable `set -e` or run the
command in a conditional/subshell), assign its exit code to `pytest_rc`
thereafter, and ensure you still call `write_pytest_crash_marker` when
`pytest_rc` > 5; key symbols: the `timeout 5m python pulp/tests/run_tests.py`
invocation, the `pytest_rc` variable, and the `write_pytest_crash_marker` call.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e25aaf2c-28d4-48d0-be02-52e34ab13fdc

📥 Commits

Reviewing files that changed from the base of the PR and between 4ba7c78 and db5d4d6.

📒 Files selected for processing (4)
  • ci/thirdparty-testing/run_cvxpy_tests.sh
  • ci/thirdparty-testing/run_pulp_tests.sh
  • ci/thirdparty-testing/run_pyomo_tests.sh
  • ci/utils/crash_helpers.sh
🚧 Files skipped from review as they are similar to previous changes (1)
  • ci/utils/crash_helpers.sh

@rgsl888prabhu
Copy link
Copy Markdown
Collaborator Author

/merge

@rapids-bot rapids-bot Bot merged commit 3df9b05 into NVIDIA:main May 12, 2026
409 of 417 checks passed
rgsl888prabhu added a commit that referenced this pull request May 12, 2026
The assert-failure test already verifies the basic NEW-failure flow.
Add a second test that sends SIGSEGV to pytest mid-run so the
crash-marker path (write_pytest_crash_marker, added in #1191) writes
a PROCESS_CRASH entry that the classifier surfaces under NEW failures
with a SIGSEGV message.

Named test_zz_* so it runs after the assert test in definition order
and the assertion failure still lands in the partial JUnit XML before
pytest is killed.

DO NOT MERGE.  Remove both smoke tests before merging PR #1194.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
rgsl888prabhu added a commit that referenced this pull request May 12, 2026
Crashes (PROCESS_CRASH entries written by PR #1191's
write_pytest_crash_marker, or any failure whose message matches
"crashed with SIG*") now render in a dedicated GitHub `[!CAUTION]`
alert block at the top of the comment.  The full crash diagnostic
goes in a fenced code block — no more half-sentence truncation when
the message gets cut at 140 chars.

Bumps the table-cell short_msg cap from 140 to 300 chars so ordinary
failure messages also stop ending mid-sentence.

Crashes are split out of the NEW and KNOWN buckets before those
sections render, so the CAUTION block is the single source of truth
for crash entries (no duplication).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improves an existing functionality non-breaking Introduces a non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants