speed up catalogue matching with a reverse index by SaInekK · Pull Request #132 · harmonydata/harmony

SaInekK · 2026-05-20T23:16:33Z

Description

match_questions_with_catalogue_instruments previously contained two independent triple-nested loops that, for every input question, scanned every catalogue instrument and tested membership in its question-index list — an O(N · M · K) pattern in both passes (N input questions, M catalogue instruments, K questions per instrument).

This PR builds a single reverse index catalogue_question_idx → [instrument_idx, ...] once (O(M · K)) and reuses it for both lookups, bringing the per-pass cost down to O(N · R) where R is the average number of instruments that contain a given question (typically small).

Verification of byte-identical output

A characterization test (tests/test_match_catalogue_instruments.py) was added before any refactor commit and pins the full model_dump() output of all returned instruments on a small fixture with overlapping catalogue instruments. The test passed on main (verifying it correctly predicted existing behaviour) and continues to pass after the refactor — proving the refactor is byte-identical on representative input.

Additional unit tests cover the helper's contract:

dedup of duplicate question indices within a single instrument
NumPy / Python int key interchangeability (np.intp from np.argmax is hash-compatible with Python int)

A perf smoke test (test_large_catalogue_runs_in_under_one_second) guards against catastrophic regression at 10 000 instruments × 20 questions × 100 inputs (budget 1 s).

Benchmark (median of 5, after 1 warm-up)

Workload: 100 input questions × 10 000 catalogue instruments × 20 questions each, dim=16.

Version	Median time
`main` (before)	786.9 ms
this branch	349.2 ms
Speed-up	2.25×

Follow-up out of scope

A third loop in the same function (computing instrument_idx_to_total_num_question_items_present) could use the same reverse index for a further small win. Left out of this PR to keep the diff focused.

Fixes # (no upstream issue — happy to file one with the benchmark numbers if you'd like a tracker)

Type of change

Non-breaking refactor (no public-API change). Output is byte-identical, verified by snapshot test.

Testing

pytest tests/ → all catalogue-matcher tests pass (6 unrelated pre-existing CSV failures on main are addressed by sibling PR for pandas 2.3+)
Characterization snapshot test passes on both main (before refactor) and this branch (after).
Harmony API integration smoke test (harmonydata/harmonyapi with this branch as submodule) — /text/match via TestClient on the GAD-7 EN/PT payload passes all 7 assertions from tests/local_tests/test_match_local.py. (Catalogue-matching path itself requires AZURE_STORAGE_URL; covered by characterization + perf tests in this PR.)

Test Configuration

Library version: this branch
OS: macOS 14
Toolchain: Python 3.11, numpy 1.26.4, torch 2.2.2

Checklist

My PR is for one issue
Self-reviewed
Code commented in non-obvious areas (helper docstring covers ordering, dedup, int/np.intp interop)
No new warnings
Added characterization, unit, and perf tests
Existing tests pass locally

SaInekK added 7 commits May 20, 2026 14:02

add characterization tests for catalogue matcher

560d24b

add reverse-index helper for catalogue question -> instruments

764291c

use reverse index for input-question to instrument lookup

fd6f132

use reverse index for instrument similarity collection

1565a91

add perf smoke test for catalogue matching

25c1129

drop redundant int() cast on np.intp dict key

0f38fef

hoist imports, narrow set type, relax perf budget

8fadbfd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

speed up catalogue matching with a reverse index#132

speed up catalogue matching with a reverse index#132
SaInekK wants to merge 7 commits into
harmonydata:mainfrom
SaInekK:pr/catalogue-matcher-reverse-index

SaInekK commented May 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

SaInekK commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Verification of byte-identical output

Benchmark (median of 5, after 1 warm-up)

Follow-up out of scope

Fixes # (no upstream issue — happy to file one with the benchmark numbers if you'd like a tracker)

Type of change

Testing

Test Configuration

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

SaInekK commented May 20, 2026 •

edited

Loading