Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19875
Note: Links to docs will display an error until the docs builds have been completed. ⏳ No Failures, 279 PendingAs of commit 28f2173 with merge base 88faab2 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
This PR needs a
|
Summary:
The macOS unittest job has been hitting its timeout for several runs in a
row with the progress bar frozen partway through pytest. Add
`pytest-timeout` so a stuck test fails with a nodeid and per-thread stack
trace, and set `faulthandler_timeout=180` so every xdist worker dumps its
threads every 3 minutes while tests are still running -- this surfaces the
hung test as it develops, not just at termination.
Diagnostic runs identified the hang fingerprint: tests stuck inside
`torch._inductor.package._package.__call__` (line 736), the AOTI-packaged
`.so` invocation. The same stack was observed for tests in
`examples/models/llama3_2_vision/{preprocess,vision_encoder}` and almost
certainly affects the other AOTI call sites we ship. Local M1 reruns of
the affected tests with the CI-pinned torch wheel all pass quickly, so
the hang is CI-environment-specific (suspect: AOTI dlopen under heavy
xdist + coverage contention on the macos-m1-stable runner).
Skip every AOTI-invoking test on macOS CI only, via
`unittest-macos-cmake.sh` -- linux, windows, and local dev continue to
run them. The skip covers the three vision tests (whole files) plus the
specific `*_aoti` methods in `extension/llm/modules/test/test_*.py`.
Job timeout drops back to 30 minutes for fast iteration; pytest
`--timeout=1500` gives any single test 25 minutes before it is treated
as hung.
Co-Authored-By: Claude <noreply@anthropic.com>
Summary: Previous experiment used `-n 1` which still spawns an xdist worker process that buffers all output. The CI logs showed 462 tests' worth of progress dots flushed at a single timestamp, making it impossible to identify which test was hanging. Switch to `-p no:xdist` so tests run in the main process with unbuffered output, and add `-v` so each test name prints as it starts/completes. Combined with `faulthandler_timeout=180`, this will show exactly which test is running when the hang occurs. Faulthandler dumps from two prior CI runs pointed to the same test position (~26% through the suite) but caught different XNNPACK passes mid-retrace (FuseBatchNormPass in one, RemoveRedundantCopyPass in the other). The common pattern is `super().call()` retracing large delegate subgraphs — suspect is `test_all_models_with_recipes` running ResNet50 / ViT / DeepLabV3 through the full XNNPACK pass pipeline. Co-Authored-By: Claude <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
Previous experiment used
-n 1which still spawns an xdist workerprocess that buffers all output. The CI logs showed 462 tests' worth of
progress dots flushed at a single timestamp, making it impossible to
identify which test was hanging.
Switch to
-p no:xdistso tests run in the main process withunbuffered output, and add
-vso each test name prints as itstarts/completes. Combined with
faulthandler_timeout=180, this willshow exactly which test is running when the hang occurs.
Faulthandler dumps from two prior CI runs pointed to the same test
position (~26% through the suite) but caught different XNNPACK passes
mid-retrace (FuseBatchNormPass in one, RemoveRedundantCopyPass in the
other). The common pattern is
super().call()retracing large delegatesubgraphs — suspect is
test_all_models_with_recipesrunning ResNet50/ ViT / DeepLabV3 through the full XNNPACK pass pipeline.
Co-Authored-By: Claude noreply@anthropic.com
Stack created with Sapling. Best reviewed with ReviewStack.