Skip to content

speed up catalogue matching with a reverse index#132

Open
SaInekK wants to merge 7 commits into
harmonydata:mainfrom
SaInekK:pr/catalogue-matcher-reverse-index
Open

speed up catalogue matching with a reverse index#132
SaInekK wants to merge 7 commits into
harmonydata:mainfrom
SaInekK:pr/catalogue-matcher-reverse-index

Conversation

@SaInekK
Copy link
Copy Markdown

@SaInekK SaInekK commented May 20, 2026

Description

match_questions_with_catalogue_instruments previously contained two independent triple-nested loops that, for every input question, scanned every catalogue instrument and tested membership in its question-index list — an O(N · M · K) pattern in both passes (N input questions, M catalogue instruments, K questions per instrument).

This PR builds a single reverse index catalogue_question_idx → [instrument_idx, ...] once (O(M · K)) and reuses it for both lookups, bringing the per-pass cost down to O(N · R) where R is the average number of instruments that contain a given question (typically small).

Verification of byte-identical output

A characterization test (tests/test_match_catalogue_instruments.py) was added before any refactor commit and pins the full model_dump() output of all returned instruments on a small fixture with overlapping catalogue instruments. The test passed on main (verifying it correctly predicted existing behaviour) and continues to pass after the refactor — proving the refactor is byte-identical on representative input.

Additional unit tests cover the helper's contract:

  • dedup of duplicate question indices within a single instrument
  • NumPy / Python int key interchangeability (np.intp from np.argmax is hash-compatible with Python int)

A perf smoke test (test_large_catalogue_runs_in_under_one_second) guards against catastrophic regression at 10 000 instruments × 20 questions × 100 inputs (budget 1 s).

Benchmark (median of 5, after 1 warm-up)

Workload: 100 input questions × 10 000 catalogue instruments × 20 questions each, dim=16.

Version Median time
main (before) 786.9 ms
this branch 349.2 ms
Speed-up 2.25×

Follow-up out of scope

A third loop in the same function (computing instrument_idx_to_total_num_question_items_present) could use the same reverse index for a further small win. Left out of this PR to keep the diff focused.

Fixes # (no upstream issue — happy to file one with the benchmark numbers if you'd like a tracker)

Type of change

  • Non-breaking refactor (no public-API change). Output is byte-identical, verified by snapshot test.

Testing

  • pytest tests/ → all catalogue-matcher tests pass (6 unrelated pre-existing CSV failures on main are addressed by sibling PR for pandas 2.3+)
  • Characterization snapshot test passes on both main (before refactor) and this branch (after).
  • Harmony API integration smoke test (harmonydata/harmonyapi with this branch as submodule) — /text/match via TestClient on the GAD-7 EN/PT payload passes all 7 assertions from tests/local_tests/test_match_local.py. (Catalogue-matching path itself requires AZURE_STORAGE_URL; covered by characterization + perf tests in this PR.)

Test Configuration

  • Library version: this branch
  • OS: macOS 14
  • Toolchain: Python 3.11, numpy 1.26.4, torch 2.2.2

Checklist

  • My PR is for one issue
  • Self-reviewed
  • Code commented in non-obvious areas (helper docstring covers ordering, dedup, int/np.intp interop)
  • No new warnings
  • Added characterization, unit, and perf tests
  • Existing tests pass locally

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant