Conversation
Summary:
The macOS unittest job has been hitting its timeout for several runs in a
row with the progress bar frozen partway through pytest. Add
`pytest-timeout` so a stuck test fails with a nodeid and per-thread stack
trace, and set `faulthandler_timeout=180` so every xdist worker dumps its
threads every 3 minutes while tests are still running -- this surfaces the
hung test as it develops, not just at termination.
Diagnostic runs identified the hang fingerprint: tests stuck inside
`torch._inductor.package._package.__call__` (line 736), the AOTI-packaged
`.so` invocation. The same stack was observed for tests in
`examples/models/llama3_2_vision/{preprocess,vision_encoder}` and almost
certainly affects the other AOTI call sites we ship. Local M1 reruns of
the affected tests with the CI-pinned torch wheel all pass quickly, so
the hang is CI-environment-specific (suspect: AOTI dlopen under heavy
xdist + coverage contention on the macos-m1-stable runner).
Skip every AOTI-invoking test on macOS CI only, via
`unittest-macos-cmake.sh` -- linux, windows, and local dev continue to
run them. The skip covers the three vision tests (whole files) plus the
specific `*_aoti` methods in `extension/llm/modules/test/test_*.py`.
Job timeout drops back to 30 minutes for fast iteration; pytest
`--timeout=1500` gives any single test 25 minutes before it is treated
as hung.
Co-Authored-By: Claude <noreply@anthropic.com>
Summary: Test the hypothesis that the macOS AOTI hangs are caused by xdist-worker contention (parallel clang/ld during AOTI compile, dlopen lock on darwin, libomp oversubscription) rather than a true deadlock. Switch the macOS pytest invocation from `-n auto` to `-n 1` and remove the AOTI skips from the previous commit so the previously-hung tests actually run. Bump the job timeout to 90 minutes so a serial run has room to finish; keep the per-test `--timeout=1500`. If the AOTI tests pass under `-n 1`, the right permanent fix is splitting macOS into a serial AOTI lane plus a parallel everything-else lane (or simply lowering the parallelism cap). If they still hang, the issue is deeper than xdist contention and we keep the skips. Co-Authored-By: Claude <noreply@anthropic.com>
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19871
Note: Links to docs will display an error until the docs builds have been completed. ❗ 2 Active SEVsThere are 2 currently active SEVs. If your PR is affected, please view them below:
❌ 4 Cancelled Jobs, 1 Unrelated Failure, 6 Unclassified FailuresAs of commit c43838f with merge base 4de16d0 ( UNCLASSIFIED FAILURES - DrCI could not classify the following jobs because the workflow did not run on the merge base. The failures may be pre-existing on trunk or introduced by this PR:
CANCELLED JOBS - The following jobs were cancelled. Please retry:
FLAKY - The following job failed but was likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
This PR needs a
|
Summary:
Test the hypothesis that the macOS AOTI hangs are caused by xdist-worker
contention (parallel clang/ld during AOTI compile, dlopen lock on darwin,
libomp oversubscription) rather than a true deadlock. Switch the macOS
pytest invocation from
-n autoto-n 1and remove the AOTI skips fromthe previous commit so the previously-hung tests actually run. Bump the
job timeout to 90 minutes so a serial run has room to finish; keep the
per-test
--timeout=1500.If the AOTI tests pass under
-n 1, the right permanent fix is splittingmacOS into a serial AOTI lane plus a parallel everything-else lane (or
simply lowering the parallelism cap). If they still hang, the issue is
deeper than xdist contention and we keep the skips.
Co-Authored-By: Claude noreply@anthropic.com
Stack created with Sapling. Best reviewed with ReviewStack.