From 5d73d63d2bc8f89c40a5278ea657ad5d8fa35ae8 Mon Sep 17 00:00:00 2001 From: SS-JIA Date: Thu, 28 May 2026 11:38:10 -0400 Subject: [PATCH 1/2] CI: identify hanging tests in macOS unittest job Summary: The macOS unittest job has been hitting its timeout for several runs in a row with the progress bar frozen partway through pytest. Add `pytest-timeout` so a stuck test fails with a nodeid and per-thread stack trace, and set `faulthandler_timeout=180` so every xdist worker dumps its threads every 3 minutes while tests are still running -- this surfaces the hung test as it develops, not just at termination. Diagnostic runs identified the hang fingerprint: tests stuck inside `torch._inductor.package._package.__call__` (line 736), the AOTI-packaged `.so` invocation. The same stack was observed for tests in `examples/models/llama3_2_vision/{preprocess,vision_encoder}` and almost certainly affects the other AOTI call sites we ship. Local M1 reruns of the affected tests with the CI-pinned torch wheel all pass quickly, so the hang is CI-environment-specific (suspect: AOTI dlopen under heavy xdist + coverage contention on the macos-m1-stable runner). Skip every AOTI-invoking test on macOS CI only, via `unittest-macos-cmake.sh` -- linux, windows, and local dev continue to run them. The skip covers the three vision tests (whole files) plus the specific `*_aoti` methods in `extension/llm/modules/test/test_*.py`. Job timeout drops back to 30 minutes for fast iteration; pytest `--timeout=1500` gives any single test 25 minutes before it is treated as hung. Co-Authored-By: Claude --- .ci/docker/requirements-ci.txt | 1 + .ci/scripts/unittest-macos-cmake.sh | 23 +++++++++++++++++++++-- .github/workflows/_unittest.yml | 1 + 3 files changed, 23 insertions(+), 2 deletions(-) diff --git a/.ci/docker/requirements-ci.txt b/.ci/docker/requirements-ci.txt index c82882d56e6..a29232c6386 100644 --- a/.ci/docker/requirements-ci.txt +++ b/.ci/docker/requirements-ci.txt @@ -11,6 +11,7 @@ zstd==1.5.5.1 pandas>=2.2.2; python_version >= '3.10' pytest==7.2.0 pytest-cov==4.1.0 +pytest-timeout==2.2.0 expecttest==0.1.6 hypothesis==6.84.2 parameterized==0.9.0 diff --git a/.ci/scripts/unittest-macos-cmake.sh b/.ci/scripts/unittest-macos-cmake.sh index 43eb1f21c3c..5ad76452950 100755 --- a/.ci/scripts/unittest-macos-cmake.sh +++ b/.ci/scripts/unittest-macos-cmake.sh @@ -12,8 +12,27 @@ set -eux export TORCHINDUCTOR_CACHE_DIR="$(mktemp -d "${RUNNER_TEMP:-/tmp}/torchinductor_cache_XXXXXX")" trap 'rm -rf "${TORCHINDUCTOR_CACHE_DIR}"' EXIT -# Run pytest with coverage -${CONDA_RUN} pytest -n auto --cov=./ --cov-report=xml +# AOTI-packaged .so invocation (torch._inductor.package._package.__call__) +# hangs on macOS CI runners. Skip every test that loads and calls an +# AOTI-packaged module on macOS until the hang is root-caused. +# TODO(SS-JIA): re-enable once AOTI hang is root-caused. +AOTI_SKIPS=( + --ignore=examples/models/llama3_2_vision/preprocess/test_preprocess.py + --ignore=examples/models/llama3_2_vision/vision_encoder/test/test_vision_encoder.py + --ignore=examples/models/llama3_2_vision/text_decoder/test/test_text_decoder.py + --deselect=extension/llm/modules/test/test_position_embeddings.py::TilePositionalEmbeddingTest::test_tile_positional_embedding_aoti + --deselect=extension/llm/modules/test/test_position_embeddings.py::TiledTokenPositionalEmbeddingTest::test_tiled_token_positional_embedding_aoti + --deselect=extension/llm/modules/test/test_attention.py::AttentionTest::test_attention_aoti +) + +# Run pytest with coverage. --timeout surfaces hung tests with a thread dump +# and faulthandler_timeout periodically dumps every worker's threads while +# tests are still running, so we can see which test is dragging before it +# trips the hard timeout. +${CONDA_RUN} pytest -n auto --cov=./ --cov-report=xml \ + --timeout=1500 --timeout-method=thread \ + -o faulthandler_timeout=180 \ + "${AOTI_SKIPS[@]}" # Run gtest LLVM_PROFDATA="xcrun llvm-profdata" LLVM_COV="xcrun llvm-cov" \ ${CONDA_RUN} test/run_oss_cpp_tests.sh diff --git a/.github/workflows/_unittest.yml b/.github/workflows/_unittest.yml index 15c87bd79e4..e00c3812adc 100644 --- a/.github/workflows/_unittest.yml +++ b/.github/workflows/_unittest.yml @@ -49,6 +49,7 @@ jobs: python-version: '3.11' submodules: 'recursive' ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }} + timeout: 30 script: | set -eux # This is needed to get the prebuilt PyTorch wheel from S3 From c43838ff7883f87358134a45201c4cf585075997 Mon Sep 17 00:00:00 2001 From: SS-JIA Date: Thu, 28 May 2026 23:58:53 -0400 Subject: [PATCH 2/2] =?UTF-8?q?CI:=20experiment=20=E2=80=94=20drop=20macOS?= =?UTF-8?q?=20pytest=20to=20-n=201,=20remove=20AOTI=20skips?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Summary: Test the hypothesis that the macOS AOTI hangs are caused by xdist-worker contention (parallel clang/ld during AOTI compile, dlopen lock on darwin, libomp oversubscription) rather than a true deadlock. Switch the macOS pytest invocation from `-n auto` to `-n 1` and remove the AOTI skips from the previous commit so the previously-hung tests actually run. Bump the job timeout to 90 minutes so a serial run has room to finish; keep the per-test `--timeout=1500`. If the AOTI tests pass under `-n 1`, the right permanent fix is splitting macOS into a serial AOTI lane plus a parallel everything-else lane (or simply lowering the parallelism cap). If they still hang, the issue is deeper than xdist contention and we keep the skips. Co-Authored-By: Claude --- .ci/scripts/unittest-macos-cmake.sh | 28 ++++++++-------------------- .github/workflows/_unittest.yml | 2 +- 2 files changed, 9 insertions(+), 21 deletions(-) diff --git a/.ci/scripts/unittest-macos-cmake.sh b/.ci/scripts/unittest-macos-cmake.sh index 5ad76452950..8c1d5df0879 100755 --- a/.ci/scripts/unittest-macos-cmake.sh +++ b/.ci/scripts/unittest-macos-cmake.sh @@ -12,27 +12,15 @@ set -eux export TORCHINDUCTOR_CACHE_DIR="$(mktemp -d "${RUNNER_TEMP:-/tmp}/torchinductor_cache_XXXXXX")" trap 'rm -rf "${TORCHINDUCTOR_CACHE_DIR}"' EXIT -# AOTI-packaged .so invocation (torch._inductor.package._package.__call__) -# hangs on macOS CI runners. Skip every test that loads and calls an -# AOTI-packaged module on macOS until the hang is root-caused. -# TODO(SS-JIA): re-enable once AOTI hang is root-caused. -AOTI_SKIPS=( - --ignore=examples/models/llama3_2_vision/preprocess/test_preprocess.py - --ignore=examples/models/llama3_2_vision/vision_encoder/test/test_vision_encoder.py - --ignore=examples/models/llama3_2_vision/text_decoder/test/test_text_decoder.py - --deselect=extension/llm/modules/test/test_position_embeddings.py::TilePositionalEmbeddingTest::test_tile_positional_embedding_aoti - --deselect=extension/llm/modules/test/test_position_embeddings.py::TiledTokenPositionalEmbeddingTest::test_tiled_token_positional_embedding_aoti - --deselect=extension/llm/modules/test/test_attention.py::AttentionTest::test_attention_aoti -) - -# Run pytest with coverage. --timeout surfaces hung tests with a thread dump -# and faulthandler_timeout periodically dumps every worker's threads while -# tests are still running, so we can see which test is dragging before it -# trips the hard timeout. -${CONDA_RUN} pytest -n auto --cov=./ --cov-report=xml \ +# EXPERIMENT: drop xdist (`-n 1`) on macOS to test whether AOTI hangs are +# caused by parallel-worker contention (clang/ld, dlopen lock, libomp +# oversubscription) rather than a true deadlock. AOTI skips removed so we +# can observe whether the previously-hung tests now pass serially. +# --timeout surfaces hung tests with a thread dump and faulthandler_timeout +# periodically dumps every worker's threads while tests are still running. +${CONDA_RUN} pytest -n 1 --cov=./ --cov-report=xml \ --timeout=1500 --timeout-method=thread \ - -o faulthandler_timeout=180 \ - "${AOTI_SKIPS[@]}" + -o faulthandler_timeout=180 # Run gtest LLVM_PROFDATA="xcrun llvm-profdata" LLVM_COV="xcrun llvm-cov" \ ${CONDA_RUN} test/run_oss_cpp_tests.sh diff --git a/.github/workflows/_unittest.yml b/.github/workflows/_unittest.yml index e00c3812adc..e63a6bc518c 100644 --- a/.github/workflows/_unittest.yml +++ b/.github/workflows/_unittest.yml @@ -49,7 +49,7 @@ jobs: python-version: '3.11' submodules: 'recursive' ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }} - timeout: 30 + timeout: 90 script: | set -eux # This is needed to get the prebuilt PyTorch wheel from S3