Skip to content

perf(brute_force): skip csr_to_coo on inner-product filtered search#2128

Open
maxwbuckley wants to merge 1 commit into
rapidsai:mainfrom
maxwbuckley:perf/brute-force-skip-csr-to-coo-on-ip
Open

perf(brute_force): skip csr_to_coo on inner-product filtered search#2128
maxwbuckley wants to merge 1 commit into
rapidsai:mainfrom
maxwbuckley:perf/brute-force-skip-csr-to-coo-on-ip

Conversation

@maxwbuckley
Copy link
Copy Markdown
Contributor

@maxwbuckley maxwbuckley commented May 26, 2026

Summary

Small perf cleanup for the filtered brute-force CSR path in knn_brute_force.cuh.

  • When sparsity ≥ 0.9, filtered brute-force takes the sparse CSR / masked_matmul path. As part of setup it builds a per-nonzero rows array via raft::sparse::convert::csr_to_coo.
  • That rows array is only consumed by cuvs::neighbors::detail::epilogue_on_csr, which only runs for L2Expanded / L2SqrtExpanded / CosineExpanded (it combines masked inner products with precomputed norms).
  • For InnerProduct, the epilogue is skipped — so the allocation and the csr_to_coo kernel launch were dead work on every IP-metric filtered search hitting the CSR path.

This PR hoists the rmm::device_uvector<IdxT> rows(...) allocation and the csr_to_coo call inside the L2/Cosine branch, next to their only consumer.

What changes

  • No behavior change for L2Expanded, L2SqrtExpanded, CosineExpanded — same kernels, same order.
  • For InnerProduct filtered searches that hit the sparse CSR branch: saves one device allocation of nnz * sizeof(IdxT) and one kernel launch + write pass per call.

Test plan

  • Existing filtered brute-force tests pass (BRUTE_FORCE_TEST, prefiltered variants) for InnerProduct, L2Expanded, L2SqrtExpanded, CosineExpanded.
  • No diff in output for L2 / Cosine vs. base.

🤖 Generated with Claude Code

…ered search

The CSR-path of filtered brute-force builds a per-nonzero `rows` array
via `raft::sparse::convert::csr_to_coo`. That array is only consumed by
`epilogue_on_csr`, which itself only runs for L2 / L2Sqrt / Cosine to
combine the masked inner products with the precomputed norms. For
InnerProduct the epilogue is skipped, so the allocation and kernel
launch were dead work on every IP-metric filtered search.

Move the `rmm::device_uvector<IdxT> rows(...)` allocation and the
`csr_to_coo` call inside the L2/Cosine branch alongside their only
consumer. No behavior change for L2/L2Sqrt/Cosine. For InnerProduct
filtered searches that hit the sparse CSR path (sparsity >= 0.9) this
saves one device allocation of `nnz * sizeof(IdxT)` and one kernel
launch + write pass per call.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Max Buckley <maxwbuckley@gmail.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 26, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@aamijar aamijar added non-breaking Introduces a non-breaking change improvement Improves an existing functionality labels May 26, 2026
@aamijar aamijar moved this to In Progress in Unstructured Data Processing May 26, 2026
@maxwbuckley
Copy link
Copy Markdown
Contributor Author

Local test results

Built and ran the brute-force test suites against this PR cherry-picked onto upstream/main (commit 4addf4e). Built with CUDA 13.2, SM 89, RTX 5090.

NEIGHBORS_TEST (contains brute_force_prefiltered.cu — the suite that exercises the modified brute_force_search_filtered code path):

Suite Cases Result
PrefilteredBruteForceOnBitmapTest_float_int64 39 ✅ pass
PrefilteredBruteForceOnBitmapTest_half_int64 39 ✅ pass
PrefilteredBruteForceOnBitsetTest_float_int64 39 ✅ pass
PrefilteredBruteForceOnBitsetTest_half_int64 39 ✅ pass
KNNTest + RandomBruteForceKNNTest + SparseKNNTest + RefineTest + NNTest (rest of the binary) 157 ✅ pass
NEIGHBORS_TEST total 313/313

NEIGHBORS_ANN_BRUTE_FORCE_TEST (dense ANN brute-force smoke):

Suite Cases Result
AnnBruteForceTest_float 39 ✅ pass
AnnBruteForceTest_half_float 39 ✅ pass
total 78/78

The 156 PrefilteredBruteForce* cases cover the full (metric × bitmap-or-bitset × float-or-half × sparsity) matrix, including:

  • All four metrics that this PR's if guard discriminates on: InnerProduct (the path that now skips the csr_to_coo + rows allocation), L2Expanded, L2SqrtExpanded, CosineExpanded (paths that still hit epilogue_on_csr with a now-locally-allocated rows buffer).
  • Sparsities at 0.01, 0.1, 0.2, 0.4, 0.5 — i.e. both the dense tiled_brute_force_knn branch (sparsity < sparse_filter_threshold = 0.9) and the sparse CSR / masked_matmul branch (sparsity ≥ 0.9). The PR only touches the sparse branch.

Total: 391 / 391 passed, 0 failures. ~17 min wall on the prefilter half.

@maxwbuckley maxwbuckley marked this pull request as ready for review May 26, 2026 23:37
@maxwbuckley maxwbuckley requested a review from a team as a code owner May 26, 2026 23:37
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 26, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: dbb26125-9d4f-426e-b767-cbd477fa9ad7

📥 Commits

Reviewing files that changed from the base of the PR and between 4addf4e and 23059ea.

📒 Files selected for processing (1)
  • cpp/src/neighbors/detail/knn_brute_force.cuh

📝 Walkthrough

Summary by CodeRabbit

  • Refactor
    • Optimized the filtered brute-force search path to defer CSR to COO conversion to the post-processing step, improving efficiency in the epilogue handling for L2 and Cosine distance metrics.

Walkthrough

The PR optimizes the filtered brute-force search path by deferring the construction of the COO rows buffer. The buffer is now built only when needed—just before the CSR epilogue call—rather than unconditionally earlier. dataset_view construction has been reordered to occur before the deferred rows buffer computation.

Changes

Defer COO rows buffer construction in brute-force filtered search

Layer / File(s) Summary
Defer COO rows buffer construction to CSR epilogue
cpp/src/neighbors/detail/knn_brute_force.cuh
The rows buffer construction from the early section is removed, and is now built only at the point just before the CSR epilogue call with a comment indicating it is only needed for L2/Cosine epilogue, while dataset_view construction is moved earlier.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested reviewers

  • cjnolet
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title directly and accurately describes the main optimization: skipping unnecessary csr_to_coo conversion for inner-product filtered searches on the sparse CSR path.
Description check ✅ Passed The description is well-related to the changeset, clearly explaining why the change is made, what it affects, and what the expected outcomes are.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improves an existing functionality non-breaking Introduces a non-breaking change

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

2 participants