speed up catalogue matching with a reverse index#132
Open
SaInekK wants to merge 7 commits into
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
match_questions_with_catalogue_instrumentspreviously contained two independent triple-nested loops that, for every input question, scanned every catalogue instrument and tested membership in its question-index list — an O(N · M · K) pattern in both passes (N input questions, M catalogue instruments, K questions per instrument).This PR builds a single reverse index
catalogue_question_idx → [instrument_idx, ...]once (O(M · K)) and reuses it for both lookups, bringing the per-pass cost down to O(N · R) where R is the average number of instruments that contain a given question (typically small).Verification of byte-identical output
A characterization test (
tests/test_match_catalogue_instruments.py) was added before any refactor commit and pins the fullmodel_dump()output of all returned instruments on a small fixture with overlapping catalogue instruments. The test passed onmain(verifying it correctly predicted existing behaviour) and continues to pass after the refactor — proving the refactor is byte-identical on representative input.Additional unit tests cover the helper's contract:
np.intpfromnp.argmaxis hash-compatible with Pythonint)A perf smoke test (
test_large_catalogue_runs_in_under_one_second) guards against catastrophic regression at 10 000 instruments × 20 questions × 100 inputs (budget 1 s).Benchmark (median of 5, after 1 warm-up)
Workload: 100 input questions × 10 000 catalogue instruments × 20 questions each, dim=16.
main(before)Follow-up out of scope
A third loop in the same function (computing
instrument_idx_to_total_num_question_items_present) could use the same reverse index for a further small win. Left out of this PR to keep the diff focused.Fixes # (no upstream issue — happy to file one with the benchmark numbers if you'd like a tracker)
Type of change
Testing
pytest tests/→ all catalogue-matcher tests pass (6 unrelated pre-existing CSV failures onmainare addressed by sibling PR for pandas 2.3+)main(before refactor) and this branch (after).harmonydata/harmonyapiwith this branch as submodule) —/text/matchviaTestClienton the GAD-7 EN/PT payload passes all 7 assertions fromtests/local_tests/test_match_local.py. (Catalogue-matching path itself requiresAZURE_STORAGE_URL; covered by characterization + perf tests in this PR.)Test Configuration
Checklist