build: rebalance unit test shards to reduce CI critical path (~29m --> ~23m) and use fewer workers (16 --> 10)#38287
build: rebalance unit test shards to reduce CI critical path (~29m --> ~23m) and use fewer workers (16 --> 10)#38287
Conversation
afef03c to
c17633d
Compare
|
The fix needs to be made upstream: openedx/openedx-events#559 waiting for that to be merged and released before coming back to this PR. |
676809c to
a89e30c
Compare
a89e30c to
19c5643
Compare
Run pytest with extra reporting enabled to generate files with per-test durations. The file is uploaded as a CI artifact so timing data can be downloaded and used to drive optimal shard rebalancing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Redistribute test paths across 9 shards (down from 16) using a greedy bin-packing optimiser driven by real per-test timing data from pytest-reportlog. Predicted critical path: ~18.7m (down from ~29m). Key changes: - Rename shard groups to reflect semantic meaning: lms-*, shared-with-lms-*, shared-with-cms-*, cms-* (openedx/common/xmodule paths explicitly separated from lms-only and cms-only paths) - Split lms/djangoapps/discussion/ into its 4 subdirectories so the heavy rest_api/ shard (15.7m) can be distributed across bins independently - Remove outdated comment referencing unit-tests-gh-hosted.yml Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ubclasses Three test classes in the certificates app were calling CourseFactory() in setUp() despite extending SharedModuleStoreTestCase. Unlike ModuleStoreTestCase, SharedModuleStoreTestCase shares a single modulestore across all tests in the class and only closes MongoDB connections at tearDownClass. Calling CourseFactory() in setUp() created a new MongoDB course (and opened connections) for every test method without releasing them, causing connection accumulation across the full test run. Affected classes: - CertificateFiltersTest (test_filters.py) - CertificateInvalidationTest (test_models.py) - CertificateAllowlistTest (test_models.py) In each case the course is only read by test methods (test data such as users, enrollments and certificates is written via Django ORM and rolled back between tests), so sharing a single course across the class is correct. See: https://github.com/openedx/openedx-platform/blob/master/xmodule/modulestore/tests/django_utils.py Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…vents.testing openedx_events/tests/utils.py was moved to openedx_events/testing.py in openedx/openedx-events#559 so the test utilities are included in the installed package (setup.py excludes the tests/ subpackage from the wheel). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When OpenEdxEventsTestMixin was listed after a TestCase subclass (e.g. Foo(SharedModuleStoreTestCase, OpenEdxEventsTestMixin)), it landed after unittest.case.TestCase in the MRO. Since unittest.case.TestCase.setUpClass and tearDownClass do not call super(), the mixin's lifecycle methods never ran. The workaround was to manually call cls.start_events_isolation() in each class's setUpClass, but there was no corresponding tearDownClass to restore event state, causing events disabled by one test class to leak into subsequent classes in the same process. Fix by placing OpenEdxEventsTestMixin first in the base class list so it appears before unittest.case.TestCase in the MRO. This lets setUpClass and tearDownClass run automatically through the cooperative super() chain, removing the need for manual start_events_isolation() calls. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This mixin is already included via one of the other mixins on this test class so including it again was messing with the MRO for the test classes.
contentstore/ is large enough that the cms-1 runner was being killed mid-run in CI (OOM or runner-level timeout). Splitting it into its own shard keeps each job under the ~20-25 min target. No changes needed to gha_unit_tests_collector.py — it already classifies any shard whose first path starts with "cms/" as a CMS shard. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
19c5643 to
1422307
Compare
| ] | ||
| }, | ||
| "openedx-1-with-cms": { | ||
| "shared-with-cms-1": { |
There was a problem hiding this comment.
It was nice for debugging that for each X-with-lms shard there was a corresponding X-with-cms, especially for those tests which would pass in one system and fail in the other. Would you be willing to change it so that we have parallel shared-with-lms-[1,2] and shared-with-cms-[1,2] shards? It would only add one additional shard and I don't think it'd increase the critical test time.
If not, then could you simplify shared-with-cms-1 definition into just the paths xmodule/, common/, and openedx/?
There was a problem hiding this comment.
I split them for convenience for now. I didn't want to collapse the tests because that will make it harder to re-balance them in the future since it would require more lookups to do the rebalancing.
There was a problem hiding this comment.
Did you catch this feedback?
It was nice for debugging that for each X-with-lms shard there was a corresponding X-with-cms, especially for those tests which would pass in one system and fail in the other. Would you be willing to change it so that we have parallel shared-with-lms-[1,2] and shared-with-cms-[1,2] shards? It would only add one additional shard and I don't think it'd increase the critical test time.
There was a problem hiding this comment.
Nevermind, just saw your commit
There was a problem hiding this comment.
TIL, that apparently we don't run all of the openedx apps under CMS, just some of them and if we try to run all of them there are issues:
- https://github.com/openedx/openedx-platform/actions/runs/24354051533/job/71116066612?pr=38287
- https://github.com/openedx/openedx-platform/actions/runs/24354051533/job/71116066418?pr=38287
These folders are run under LMS but not CMS:
openedx/core/djangoapps/course_live/
openedx/core/djangoapps/notifications/
openedx/core/djangolib/
openedx/core/tests/
openedx/features/
openedx/testing/
What do you think about landing this as is? I think it could be further improved and there's more to investigate but I don't want this to be blocked on existing issues.
There was a problem hiding this comment.
Wow that's crazy
Yeah, in that case, totally OK with there being just one CMS shard
I wrote a followup issue, do you mind linking it here with a TODO comment? #38355
Otherwise, LGTM ✅
Per #38287 (comment) Having the shared-with... tests correspond between the LMS and CMS makes it easier to spot tastes that are failing in one context but not the other more quickly. Since neither of these is the longest run, we pay a bit more in overhead for this but it's still an improvement over what we had.
dcbee9b to
1422307
Compare
|
@kdmccormick take a look at my latest comment in the conversation above, the extra commit revealed issues, I could still split shared-cms to 2 shards removing those highlighted test files but not sure if it's valuable. What do you think? |
Summary
Rebalances the unit test shard configuration to reduce the CI critical path from ~29 minutes down to ~23 minutes, measured across 3 consistent runs.
What changed
Timing data — 3 runs on the new config (all passed)
shared-with-cms-1shared-with-lms-1shared-with-lms-2lms-4lms-1lms-5cms-2lms-2lms-3cms-1For comparison, the old config's critical path (visible in the #38347 run on unmodified master) was ~29m on
lms-4.