fastaggregation: avoid runContainer16.lazyIOR slow path in FastOr#527
Merged
Conversation
(*Bitmap).lazyOR's same-key branch calls c1.lazyIOR(c2). When c1 is a runContainer16, lazyIOR falls back to ior -> inplaceUnion -> Add -> searchRange, which is O(N · logR) per element merged. This makes FastOr catastrophically slow over inputs containing run-encoded blocks: a synthetic benchmark with 15 bitmaps of ~6000-bit runs takes ~335 ms/op before this change. Pre-promote runContainer16 slots to bitmapContainer before the inner lazyIOR call. The bitmapContainer's lazy union is O(1024) regardless of cardinality, so the K-way fan-in stays linear in K. Mirrors the explicit toBitmapContainer pre-promotion in parallel.go ParHeapOr and the Java BitmapContainer.lazyor(RunContainer) path. Issue RoaringBitmap#81 has been tracking the deeper fix (proper runContainer16.lazyIOR/lazyOR implementations) since 2016; this is the surgical workaround at the single FastOr call site. BenchmarkFastOrRunContainers (added): before: 335 ms/op 12 KB/op 447 allocs/op after: 637 µs/op 335 KB/op 257 allocs/op (~526x) The output container type for slots that started as runContainer16 inputs is now bitmapContainer (or arrayContainer after repairAfterLazy if sparse). repairAfterLazy does not re-encode to runs; callers that want a run-optimised result should call RunOptimize() on the FastOr return value. ParHeapOr already behaves the same way. The full v2 test suite passes. Refs: RoaringBitmap#81
tamirms
added a commit
to stellar/stellar-rpc
that referenced
this pull request
May 27, 2026
The FastOr/runContainer16 fix (RoaringBitmap/roaring#527) we previously carried as a tamirms/roaring fork has merged upstream and shipped in the official v2.18.2 release (the v2.18.2 tag is the #527 merge commit). Drop the replace directive and require v2.18.2 directly. v2.18.2 is the minimum: it's the first upstream release with the #527 FastOr fix, so dropping below it returns to the runContainer16.lazyIOR slow path the fork existed to avoid. The borrow-model invariants the query path relies on (FastAnd/FastOr don't mutate inputs; singleton FastAnd/FastOr Clone) were re-verified against the v2.18.2 source. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
(*Bitmap).lazyOR's same-key branch callsc1.lazyIOR(c2). Whenc1is arunContainer16,lazyIORfalls back toior→inplaceUnion→Add→searchRange, which is O(N log R) per element merged. This makesFastOrcatastrophically slow over inputs containing run-encoded blocks (e.g. the output ofAddRangeover many overlapping ranges, or anything afterRunOptimize).A synthetic benchmark with 15 bitmaps of ~6000-bit runs takes ~335 ms/op before this change.
Fix
Pre-promote
runContainer16slots tobitmapContainerbefore the innerlazyIORcall at the singleFastOrcall site. AbitmapContainer's lazy union is O(1024) regardless of cardinality, so the K-way fan-in stays linear in K.This mirrors the explicit
toBitmapContainerpre-promotion already used inparallel.go'sParHeapOr, and the JavaBitmapContainer.lazyor(RunContainer)path. Issue #81 has tracked the deeper fix (properrunContainer16.lazyIOR/lazyORimplementations) since 2016; this is the surgical workaround at the one hot call site.Notes
The output container type for slots that started as
runContainer16inputs is nowbitmapContainer(orarrayContainerafterrepairAfterLazyif sparse).repairAfterLazydoes not re-encode to runs; callers that want a run-optimised result should callRunOptimize()on theFastOrreturn value.ParHeapOralready behaves the same way.The full v2 test suite passes.
Refs: #81