CI: experiment — disable xdist, add -v for unbuffered test output by SS-JIA · Pull Request #19875 · pytorch/executorch

SS-JIA · 2026-05-29T15:24:05Z

Summary:
Previous experiment used -n 1 which still spawns an xdist worker
process that buffers all output. The CI logs showed 462 tests' worth of
progress dots flushed at a single timestamp, making it impossible to
identify which test was hanging.

Switch to -p no:xdist so tests run in the main process with
unbuffered output, and add -v so each test name prints as it
starts/completes. Combined with faulthandler_timeout=180, this will
show exactly which test is running when the hang occurs.

Faulthandler dumps from two prior CI runs pointed to the same test
position (~26% through the suite) but caught different XNNPACK passes
mid-retrace (FuseBatchNormPass in one, RemoveRedundantCopyPass in the
other). The common pattern is super().call() retracing large delegate
subgraphs — suspect is test_all_models_with_recipes running ResNet50
/ ViT / DeepLabV3 through the full XNNPACK pass pipeline.

Co-Authored-By: Claude noreply@anthropic.com

Stack created with Sapling. Best reviewed with ReviewStack.

pytorch-bot · 2026-05-29T15:24:10Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19875

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 279 Pending

As of commit 28f2173 with merge base 88faab2 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

linux-foundation-easycla · 2026-05-29T15:24:18Z

✅ login: SS-JIA / name: SS-JIA (28f2173, e80701d)
❌ - login: @claude / name: Claude. The commit (28f2173, e80701d) is not authorized under a signed CLA. Please click here to be authorized. For further assistance with EasyCLA, please visit our EasyCLA portal and chat with our support bot.

github-actions · 2026-05-29T15:25:02Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Summary: The macOS unittest job has been hitting its timeout for several runs in a row with the progress bar frozen partway through pytest. Add `pytest-timeout` so a stuck test fails with a nodeid and per-thread stack trace, and set `faulthandler_timeout=180` so every xdist worker dumps its threads every 3 minutes while tests are still running -- this surfaces the hung test as it develops, not just at termination. Diagnostic runs identified the hang fingerprint: tests stuck inside `torch._inductor.package._package.__call__` (line 736), the AOTI-packaged `.so` invocation. The same stack was observed for tests in `examples/models/llama3_2_vision/{preprocess,vision_encoder}` and almost certainly affects the other AOTI call sites we ship. Local M1 reruns of the affected tests with the CI-pinned torch wheel all pass quickly, so the hang is CI-environment-specific (suspect: AOTI dlopen under heavy xdist + coverage contention on the macos-m1-stable runner). Skip every AOTI-invoking test on macOS CI only, via `unittest-macos-cmake.sh` -- linux, windows, and local dev continue to run them. The skip covers the three vision tests (whole files) plus the specific `*_aoti` methods in `extension/llm/modules/test/test_*.py`. Job timeout drops back to 30 minutes for fast iteration; pytest `--timeout=1500` gives any single test 25 minutes before it is treated as hung. Co-Authored-By: Claude <noreply@anthropic.com>

Summary: Previous experiment used `-n 1` which still spawns an xdist worker process that buffers all output. The CI logs showed 462 tests' worth of progress dots flushed at a single timestamp, making it impossible to identify which test was hanging. Switch to `-p no:xdist` so tests run in the main process with unbuffered output, and add `-v` so each test name prints as it starts/completes. Combined with `faulthandler_timeout=180`, this will show exactly which test is running when the hang occurs. Faulthandler dumps from two prior CI runs pointed to the same test position (~26% through the suite) but caught different XNNPACK passes mid-retrace (FuseBatchNormPass in one, RemoveRedundantCopyPass in the other). The common pattern is `super().call()` retracing large delegate subgraphs — suspect is `test_all_models_with_recipes` running ResNet50 / ViT / DeepLabV3 through the full XNNPACK pass pipeline. Co-Authored-By: Claude <noreply@anthropic.com>

SS-JIA mentioned this pull request May 29, 2026

CI: identify hanging tests in macOS unittest job #19844

Open

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 29, 2026

SS-JIA and others added 2 commits May 29, 2026 13:26

SS-JIA force-pushed the pr19875 branch from 2ad2111 to 28f2173 Compare May 29, 2026 17:36

SS-JIA mentioned this pull request May 29, 2026

CI: re-skip AOTI tests to validate they cause the macOS hang #19880

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI: experiment — disable xdist, add -v for unbuffered test output#19875

CI: experiment — disable xdist, add -v for unbuffered test output#19875
SS-JIA wants to merge 2 commits into
mainfrom
pr19875

SS-JIA commented May 29, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented May 29, 2026 •

edited

Loading

Uh oh!

linux-foundation-easycla Bot commented May 29, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

SS-JIA commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19875

⏳ No Failures, 279 Pending

Uh oh!

linux-foundation-easycla Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented May 29, 2026

This PR needs a release notes: label

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

SS-JIA commented May 29, 2026 •

edited

Loading

pytorch-bot Bot commented May 29, 2026 •

edited

Loading

linux-foundation-easycla Bot commented May 29, 2026 •

edited

Loading

This PR needs a `release notes:` label