BitBuffer binary operations support in place updates to lhs#8162
BitBuffer binary operations support in place updates to lhs#8162robert3005 wants to merge 6 commits into
Conversation
Key optimizations: - `Not for &BitBuffer`: allocate directly instead of clone+try_into_mut which always failed (clone shares Arc, so 2 refs = failure) → 27-58% faster bitwise NOT on references - `iter_bits()`: process u64 chunks instead of byte-by-byte → 13% faster iteration - `PartialEq`: fast path using memcmp for byte-aligned buffers - `append_buffer()`: memcpy fast path when both sides are byte-aligned → 20-52% faster buffer appends - `append_false()`: remove unnecessary branch (new bytes are zero-init) → 65% faster single-bit appends - `from_indices()`: use set_bit_unchecked directly on the byte slice - `FromIterator` tail: batch remaining items in u64 words → 13% faster from_iter - `sliced()`: use bitwise_unary_op_copy to avoid clone+fail path, fix byte range bug in aligned path - `filter_bitbuffer_by_indices`: detect contiguous runs for bulk copy - `filter_bitbuffer_by_slices`: use slice()+append_buffer() instead of per-bit get+append - Add #[inline] to hot methods: set, set_to, append, true_count, etc. Signed-off-by: Claude <noreply@anthropic.com> https://claude.ai/code/session_0163XH8LLYAkU2qNQGbmYjhB
BitBuffers almost always have shared backing storage (from slicing, array construction, etc.), so try_into_mut nearly always fails. The owned `Not for BitBuffer` now uses the same direct-copy path as the reference version, and the dead `bitwise_unary_op` function is removed. For true in-place mutation, use `BitBufferMut` directly. Signed-off-by: Claude <noreply@anthropic.com> https://claude.ai/code/session_0163XH8LLYAkU2qNQGbmYjhB
Keep the in-place mutation fast path for owned BitBuffer when the backing storage has exclusive access (Arc refcount == 1). When try_into_mut fails (shared storage), delegate to bitwise_unary_op_copy instead of duplicating the copy logic. Signed-off-by: Claude <noreply@anthropic.com> https://claude.ai/code/session_0163XH8LLYAkU2qNQGbmYjhB
The scan loop pattern `mask = mask.bitand(&conjunct_mask)` previously always allocated a new buffer because all binary ops took references. Now owned-left operands try try_into_mut for zero-allocation in-place mutation when the backing storage has exclusive access. Changes: - Add bitwise_binary_op_lhs_owned in ops.rs: tries in-place on owned left, falls back to allocating bitwise_binary_op - Wire BitAnd/BitOr/BitXor owned-left impls to use in-place path - Add BitBuffer::into_bitand_not for owned variant - Add BitAnd<&Mask>/BitOr<&Mask> for Mask: extracts owned BitBuffer for in-place binary ops in the scan loop - Update Mask::bitand_not to use owned BitBuffer path - Fix flat/zoned readers to capture density before consuming mask Signed-off-by: Claude <noreply@anthropic.com> https://claude.ai/code/session_0163XH8LLYAkU2qNQGbmYjhB
…no-op micro-opts
Rebased onto develop, then verified the optimizations with benchmarks and
hardened the ones that pay off.
Correctness fixes (each with a regression test added first):
- `bitwise_binary_op_lhs_owned`: the in-place path combined operands
byte-for-byte, ignoring bit offsets, so `owned & &rhs` corrupted results when
`left.offset() != right.offset()` (reachable via sliced masks in the scan
loop). Now falls back to the offset-aware `bitwise_binary_op` unless the
offsets match. Added an rstest comparing the owned path against the reference
path across every offset/length/op combination.
- `append_false` dropped its read-modify-write on the assumption that bits past
`len` are zero, but `truncate` left stale bits in the final partial byte —
so `truncate(n)` then `append_false()` read back as true. `truncate` now
clears the vacated bits, restoring the invariant `append_false`/`append_buffer`
rely on (keeps the ~2.3x single-bit append win).
- vortex-cuda's flat reader had the same `mask.density()`-after-move pattern the
flat/zoned readers were fixed for; capture density before consuming the mask.
Simplification / common patterns:
- Replaced three hand-rolled unsafe u64-chunk loops (`bitwise_unary_op_copy`,
`bitwise_unary_op_mut`, owned binary in-place) with safe shared helpers
`read_u64_le` / `map_u64_words_in_place` / `zip_u64_words_in_place` built on
`chunks_exact` + `u64::{from,to}_le_bytes`. No `unsafe`, portable to
big-endian (the raw `read_unaligned::<u64>` was a latent BE bug), and the
owned in-place AND stays 3.7-4.7x faster than allocating at 16K-64K bits.
Dropped optimizations that benchmarks showed don't pay off:
- `FromIterator` 64-bit tail batching: never exercised by exact-size iterators
(the common case), measured ~1.0x, ~60 lines of intricate unsafe. Reverted to
a simple `append` loop now that `append` is fast.
Other notes (no code change here):
- The filter and `from_indices` optimizations from the original branch are
superseded by develop's PEXT filter and u64-word `from_indices`; taken from
develop during the rebase.
- `iter_bits`' u64-chunk rewrite is kept (cleaner/safer than the byte-by-byte
original) but it has no non-test callers and does not feed `iter()`.
Added a `bitand_owned_lhs_vortex_buffer` bench so the owned in-place path has
ongoing coverage.
Signed-off-by: Robert Kruszewski <dzobert@gmail.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Merging this PR will improve performance by 30.89%
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ⚡ | Simulation | chunked_bool_canonical_into[(100, 100)] |
98 µs | 87.6 µs | +11.89% |
| ⚡ | Simulation | chunked_opt_bool_canonical_into[(100, 100)] |
199.6 µs | 180.8 µs | +10.41% |
| ⚡ | Simulation | chunked_opt_bool_canonical_into[(1000, 10)] |
55.1 µs | 35 µs | +57.21% |
| ⚡ | Simulation | chunked_opt_bool_into_canonical[(1000, 10)] |
69 µs | 49.9 µs | +38.06% |
| ❌ | Simulation | chunked_varbinview_opt_canonical_into[(1000, 10)] |
188.1 µs | 215.2 µs | -12.6% |
| ⚡ | Simulation | chunked_bool_canonical_into[(1000, 10)] |
45 µs | 34.8 µs | +29.26% |
| ❌ | Simulation | varbinview_zip_block_mask |
2.9 ms | 3.4 ms | -14.06% |
| ⚡ | Simulation | fsst_compress_string |
13.8 ms | 12.3 ms | +12.67% |
| ⚡ | Simulation | compress_fsst[(10000, 4, 4)] |
1.5 ms | 1.4 ms | +11.41% |
| ⚡ | Simulation | compress_fsst[(10000, 4, 8)] |
1.6 ms | 1.4 ms | +11.14% |
| ⚡ | Simulation | append_vortex_buffer[1024] |
14.7 µs | 7.5 µs | +95.75% |
| 🆕 | Simulation | bitand_owned_lhs_vortex_buffer[65536] |
N/A | 11.3 µs | N/A |
| ⚡ | Simulation | append_vortex_buffer[128] |
2.1 µs | 1.1 µs | +85.46% |
| 🆕 | Simulation | bitand_owned_lhs_vortex_buffer[1024] |
N/A | 3.5 µs | N/A |
| 🆕 | Simulation | bitand_owned_lhs_vortex_buffer[128] |
N/A | 3.3 µs | N/A |
| 🆕 | Simulation | bitand_owned_lhs_vortex_buffer[16384] |
N/A | 5.3 µs | N/A |
| 🆕 | Simulation | bitand_owned_lhs_vortex_buffer[2048] |
N/A | 3.6 µs | N/A |
| ⚡ | Simulation | append_buffer_vortex_buffer[1024] |
15.6 µs | 12.4 µs | +25.48% |
| ⚡ | Simulation | append_buffer_vortex_buffer[128] |
11.5 µs | 10.1 µs | +13.98% |
| ⚡ | Simulation | append_vortex_buffer[16384] |
231.1 µs | 117.2 µs | +97.19% |
| ... | ... | ... | ... | ... | ... |
ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.
Tip
Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.
Comparing rk/optimise-bit-buffer (c8bc973) with develop (73454db)
Revive #7375
This speeds up And/Or/Xor - 50% to 4x time
Append - 2x (by shifting the work elsewhere in the code)
append_buffer - 10-50%