From e80701d80dbc8f994b888128241e646cf8a7c8cc Mon Sep 17 00:00:00 2001 From: SS-JIA Date: Thu, 28 May 2026 11:38:10 -0400 Subject: [PATCH 1/2] CI: identify hanging tests in macOS unittest job Summary: The macOS unittest job has been hitting its timeout for several runs in a row with the progress bar frozen partway through pytest. Add `pytest-timeout` so a stuck test fails with a nodeid and per-thread stack trace, and set `faulthandler_timeout=180` so every xdist worker dumps its threads every 3 minutes while tests are still running -- this surfaces the hung test as it develops, not just at termination. Diagnostic runs identified the hang fingerprint: tests stuck inside `torch._inductor.package._package.__call__` (line 736), the AOTI-packaged `.so` invocation. The same stack was observed for tests in `examples/models/llama3_2_vision/{preprocess,vision_encoder}` and almost certainly affects the other AOTI call sites we ship. Local M1 reruns of the affected tests with the CI-pinned torch wheel all pass quickly, so the hang is CI-environment-specific (suspect: AOTI dlopen under heavy xdist + coverage contention on the macos-m1-stable runner). Skip every AOTI-invoking test on macOS CI only, via `unittest-macos-cmake.sh` -- linux, windows, and local dev continue to run them. The skip covers the three vision tests (whole files) plus the specific `*_aoti` methods in `extension/llm/modules/test/test_*.py`. Job timeout drops back to 30 minutes for fast iteration; pytest `--timeout=1500` gives any single test 25 minutes before it is treated as hung. Co-Authored-By: Claude --- .ci/docker/requirements-ci.txt | 1 + .ci/scripts/unittest-macos-cmake.sh | 23 +++++++++++++++++++++-- .github/workflows/_unittest.yml | 1 + 3 files changed, 23 insertions(+), 2 deletions(-) diff --git a/.ci/docker/requirements-ci.txt b/.ci/docker/requirements-ci.txt index c82882d56e6..a29232c6386 100644 --- a/.ci/docker/requirements-ci.txt +++ b/.ci/docker/requirements-ci.txt @@ -11,6 +11,7 @@ zstd==1.5.5.1 pandas>=2.2.2; python_version >= '3.10' pytest==7.2.0 pytest-cov==4.1.0 +pytest-timeout==2.2.0 expecttest==0.1.6 hypothesis==6.84.2 parameterized==0.9.0 diff --git a/.ci/scripts/unittest-macos-cmake.sh b/.ci/scripts/unittest-macos-cmake.sh index 43eb1f21c3c..5ad76452950 100755 --- a/.ci/scripts/unittest-macos-cmake.sh +++ b/.ci/scripts/unittest-macos-cmake.sh @@ -12,8 +12,27 @@ set -eux export TORCHINDUCTOR_CACHE_DIR="$(mktemp -d "${RUNNER_TEMP:-/tmp}/torchinductor_cache_XXXXXX")" trap 'rm -rf "${TORCHINDUCTOR_CACHE_DIR}"' EXIT -# Run pytest with coverage -${CONDA_RUN} pytest -n auto --cov=./ --cov-report=xml +# AOTI-packaged .so invocation (torch._inductor.package._package.__call__) +# hangs on macOS CI runners. Skip every test that loads and calls an +# AOTI-packaged module on macOS until the hang is root-caused. +# TODO(SS-JIA): re-enable once AOTI hang is root-caused. +AOTI_SKIPS=( + --ignore=examples/models/llama3_2_vision/preprocess/test_preprocess.py + --ignore=examples/models/llama3_2_vision/vision_encoder/test/test_vision_encoder.py + --ignore=examples/models/llama3_2_vision/text_decoder/test/test_text_decoder.py + --deselect=extension/llm/modules/test/test_position_embeddings.py::TilePositionalEmbeddingTest::test_tile_positional_embedding_aoti + --deselect=extension/llm/modules/test/test_position_embeddings.py::TiledTokenPositionalEmbeddingTest::test_tiled_token_positional_embedding_aoti + --deselect=extension/llm/modules/test/test_attention.py::AttentionTest::test_attention_aoti +) + +# Run pytest with coverage. --timeout surfaces hung tests with a thread dump +# and faulthandler_timeout periodically dumps every worker's threads while +# tests are still running, so we can see which test is dragging before it +# trips the hard timeout. +${CONDA_RUN} pytest -n auto --cov=./ --cov-report=xml \ + --timeout=1500 --timeout-method=thread \ + -o faulthandler_timeout=180 \ + "${AOTI_SKIPS[@]}" # Run gtest LLVM_PROFDATA="xcrun llvm-profdata" LLVM_COV="xcrun llvm-cov" \ ${CONDA_RUN} test/run_oss_cpp_tests.sh diff --git a/.github/workflows/_unittest.yml b/.github/workflows/_unittest.yml index 15c87bd79e4..e00c3812adc 100644 --- a/.github/workflows/_unittest.yml +++ b/.github/workflows/_unittest.yml @@ -49,6 +49,7 @@ jobs: python-version: '3.11' submodules: 'recursive' ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }} + timeout: 30 script: | set -eux # This is needed to get the prebuilt PyTorch wheel from S3 From 28f2173d7e93e1eb4e988dad533f07fb0ba74b30 Mon Sep 17 00:00:00 2001 From: SS-JIA Date: Thu, 28 May 2026 23:58:53 -0400 Subject: [PATCH 2/2] =?UTF-8?q?CI:=20experiment=20=E2=80=94=20disable=20xd?= =?UTF-8?q?ist,=20add=20-v=20for=20unbuffered=20test=20output?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Summary: Previous experiment used `-n 1` which still spawns an xdist worker process that buffers all output. The CI logs showed 462 tests' worth of progress dots flushed at a single timestamp, making it impossible to identify which test was hanging. Switch to `-p no:xdist` so tests run in the main process with unbuffered output, and add `-v` so each test name prints as it starts/completes. Combined with `faulthandler_timeout=180`, this will show exactly which test is running when the hang occurs. Faulthandler dumps from two prior CI runs pointed to the same test position (~26% through the suite) but caught different XNNPACK passes mid-retrace (FuseBatchNormPass in one, RemoveRedundantCopyPass in the other). The common pattern is `super().call()` retracing large delegate subgraphs — suspect is `test_all_models_with_recipes` running ResNet50 / ViT / DeepLabV3 through the full XNNPACK pass pipeline. Co-Authored-By: Claude --- .ci/scripts/unittest-macos-cmake.sh | 26 ++++++-------------------- .github/workflows/_unittest.yml | 2 +- 2 files changed, 7 insertions(+), 21 deletions(-) diff --git a/.ci/scripts/unittest-macos-cmake.sh b/.ci/scripts/unittest-macos-cmake.sh index 5ad76452950..e4072c30db1 100755 --- a/.ci/scripts/unittest-macos-cmake.sh +++ b/.ci/scripts/unittest-macos-cmake.sh @@ -12,27 +12,13 @@ set -eux export TORCHINDUCTOR_CACHE_DIR="$(mktemp -d "${RUNNER_TEMP:-/tmp}/torchinductor_cache_XXXXXX")" trap 'rm -rf "${TORCHINDUCTOR_CACHE_DIR}"' EXIT -# AOTI-packaged .so invocation (torch._inductor.package._package.__call__) -# hangs on macOS CI runners. Skip every test that loads and calls an -# AOTI-packaged module on macOS until the hang is root-caused. -# TODO(SS-JIA): re-enable once AOTI hang is root-caused. -AOTI_SKIPS=( - --ignore=examples/models/llama3_2_vision/preprocess/test_preprocess.py - --ignore=examples/models/llama3_2_vision/vision_encoder/test/test_vision_encoder.py - --ignore=examples/models/llama3_2_vision/text_decoder/test/test_text_decoder.py - --deselect=extension/llm/modules/test/test_position_embeddings.py::TilePositionalEmbeddingTest::test_tile_positional_embedding_aoti - --deselect=extension/llm/modules/test/test_position_embeddings.py::TiledTokenPositionalEmbeddingTest::test_tiled_token_positional_embedding_aoti - --deselect=extension/llm/modules/test/test_attention.py::AttentionTest::test_attention_aoti -) - -# Run pytest with coverage. --timeout surfaces hung tests with a thread dump -# and faulthandler_timeout periodically dumps every worker's threads while -# tests are still running, so we can see which test is dragging before it -# trips the hard timeout. -${CONDA_RUN} pytest -n auto --cov=./ --cov-report=xml \ +# EXPERIMENT: run without xdist entirely so output is unbuffered and each test +# name prints immediately (with -n 1, xdist still buffers all output in a +# worker process, hiding which test is hanging). -v prints test names as they +# start; faulthandler_timeout dumps threads if a single test stalls. +${CONDA_RUN} pytest -p no:xdist -v --cov=./ --cov-report=xml \ --timeout=1500 --timeout-method=thread \ - -o faulthandler_timeout=180 \ - "${AOTI_SKIPS[@]}" + -o faulthandler_timeout=180 # Run gtest LLVM_PROFDATA="xcrun llvm-profdata" LLVM_COV="xcrun llvm-cov" \ ${CONDA_RUN} test/run_oss_cpp_tests.sh diff --git a/.github/workflows/_unittest.yml b/.github/workflows/_unittest.yml index e00c3812adc..e63a6bc518c 100644 --- a/.github/workflows/_unittest.yml +++ b/.github/workflows/_unittest.yml @@ -49,7 +49,7 @@ jobs: python-version: '3.11' submodules: 'recursive' ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }} - timeout: 30 + timeout: 90 script: | set -eux # This is needed to get the prebuilt PyTorch wheel from S3