perf(brute_force): skip csr_to_coo on inner-product filtered search#2128
perf(brute_force): skip csr_to_coo on inner-product filtered search#2128maxwbuckley wants to merge 1 commit into
Conversation
…ered search The CSR-path of filtered brute-force builds a per-nonzero `rows` array via `raft::sparse::convert::csr_to_coo`. That array is only consumed by `epilogue_on_csr`, which itself only runs for L2 / L2Sqrt / Cosine to combine the masked inner products with the precomputed norms. For InnerProduct the epilogue is skipped, so the allocation and kernel launch were dead work on every IP-metric filtered search. Move the `rmm::device_uvector<IdxT> rows(...)` allocation and the `csr_to_coo` call inside the L2/Cosine branch alongside their only consumer. No behavior change for L2/L2Sqrt/Cosine. For InnerProduct filtered searches that hit the sparse CSR path (sparsity >= 0.9) this saves one device allocation of `nnz * sizeof(IdxT)` and one kernel launch + write pass per call. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Max Buckley <maxwbuckley@gmail.com>
Local test resultsBuilt and ran the brute-force test suites against this PR cherry-picked onto
The 156
Total: 391 / 391 passed, 0 failures. ~17 min wall on the prefilter half. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
📝 WalkthroughSummary by CodeRabbit
WalkthroughThe PR optimizes the filtered brute-force search path by deferring the construction of the COO ChangesDefer COO rows buffer construction in brute-force filtered search
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Summary
Small perf cleanup for the filtered brute-force CSR path in
knn_brute_force.cuh.masked_matmulpath. As part of setup it builds a per-nonzerorowsarray viaraft::sparse::convert::csr_to_coo.rowsarray is only consumed bycuvs::neighbors::detail::epilogue_on_csr, which only runs forL2Expanded/L2SqrtExpanded/CosineExpanded(it combines masked inner products with precomputed norms).InnerProduct, the epilogue is skipped — so the allocation and thecsr_to_cookernel launch were dead work on every IP-metric filtered search hitting the CSR path.This PR hoists the
rmm::device_uvector<IdxT> rows(...)allocation and thecsr_to_coocall inside the L2/Cosine branch, next to their only consumer.What changes
L2Expanded,L2SqrtExpanded,CosineExpanded— same kernels, same order.InnerProductfiltered searches that hit the sparse CSR branch: saves one device allocation ofnnz * sizeof(IdxT)and one kernel launch + write pass per call.Test plan
BRUTE_FORCE_TEST, prefiltered variants) forInnerProduct,L2Expanded,L2SqrtExpanded,CosineExpanded.🤖 Generated with Claude Code