BitBuffer binary operations support in place updates to lhs by robert3005 · Pull Request #8162 · vortex-data/vortex

robert3005 · 2026-05-29T16:59:34Z

This speeds up And/Or/Xor - 50% to 4x time
Append - 2x (by shifting the work elsewhere in the code)
append_buffer - 10-50%

Key optimizations: - `Not for &BitBuffer`: allocate directly instead of clone+try_into_mut which always failed (clone shares Arc, so 2 refs = failure) → 27-58% faster bitwise NOT on references - `iter_bits()`: process u64 chunks instead of byte-by-byte → 13% faster iteration - `PartialEq`: fast path using memcmp for byte-aligned buffers - `append_buffer()`: memcpy fast path when both sides are byte-aligned → 20-52% faster buffer appends - `append_false()`: remove unnecessary branch (new bytes are zero-init) → 65% faster single-bit appends - `from_indices()`: use set_bit_unchecked directly on the byte slice - `FromIterator` tail: batch remaining items in u64 words → 13% faster from_iter - `sliced()`: use bitwise_unary_op_copy to avoid clone+fail path, fix byte range bug in aligned path - `filter_bitbuffer_by_indices`: detect contiguous runs for bulk copy - `filter_bitbuffer_by_slices`: use slice()+append_buffer() instead of per-bit get+append - Add #[inline] to hot methods: set, set_to, append, true_count, etc. Signed-off-by: Claude <noreply@anthropic.com> https://claude.ai/code/session_0163XH8LLYAkU2qNQGbmYjhB

BitBuffers almost always have shared backing storage (from slicing, array construction, etc.), so try_into_mut nearly always fails. The owned `Not for BitBuffer` now uses the same direct-copy path as the reference version, and the dead `bitwise_unary_op` function is removed. For true in-place mutation, use `BitBufferMut` directly. Signed-off-by: Claude <noreply@anthropic.com> https://claude.ai/code/session_0163XH8LLYAkU2qNQGbmYjhB

Keep the in-place mutation fast path for owned BitBuffer when the backing storage has exclusive access (Arc refcount == 1). When try_into_mut fails (shared storage), delegate to bitwise_unary_op_copy instead of duplicating the copy logic. Signed-off-by: Claude <noreply@anthropic.com> https://claude.ai/code/session_0163XH8LLYAkU2qNQGbmYjhB

The scan loop pattern `mask = mask.bitand(&conjunct_mask)` previously always allocated a new buffer because all binary ops took references. Now owned-left operands try try_into_mut for zero-allocation in-place mutation when the backing storage has exclusive access. Changes: - Add bitwise_binary_op_lhs_owned in ops.rs: tries in-place on owned left, falls back to allocating bitwise_binary_op - Wire BitAnd/BitOr/BitXor owned-left impls to use in-place path - Add BitBuffer::into_bitand_not for owned variant - Add BitAnd<&Mask>/BitOr<&Mask> for Mask: extracts owned BitBuffer for in-place binary ops in the scan loop - Update Mask::bitand_not to use owned BitBuffer path - Fix flat/zoned readers to capture density before consuming mask Signed-off-by: Claude <noreply@anthropic.com> https://claude.ai/code/session_0163XH8LLYAkU2qNQGbmYjhB

…no-op micro-opts Rebased onto develop, then verified the optimizations with benchmarks and hardened the ones that pay off. Correctness fixes (each with a regression test added first): - `bitwise_binary_op_lhs_owned`: the in-place path combined operands byte-for-byte, ignoring bit offsets, so `owned & &rhs` corrupted results when `left.offset() != right.offset()` (reachable via sliced masks in the scan loop). Now falls back to the offset-aware `bitwise_binary_op` unless the offsets match. Added an rstest comparing the owned path against the reference path across every offset/length/op combination. - `append_false` dropped its read-modify-write on the assumption that bits past `len` are zero, but `truncate` left stale bits in the final partial byte — so `truncate(n)` then `append_false()` read back as true. `truncate` now clears the vacated bits, restoring the invariant `append_false`/`append_buffer` rely on (keeps the ~2.3x single-bit append win). - vortex-cuda's flat reader had the same `mask.density()`-after-move pattern the flat/zoned readers were fixed for; capture density before consuming the mask. Simplification / common patterns: - Replaced three hand-rolled unsafe u64-chunk loops (`bitwise_unary_op_copy`, `bitwise_unary_op_mut`, owned binary in-place) with safe shared helpers `read_u64_le` / `map_u64_words_in_place` / `zip_u64_words_in_place` built on `chunks_exact` + `u64::{from,to}_le_bytes`. No `unsafe`, portable to big-endian (the raw `read_unaligned::<u64>` was a latent BE bug), and the owned in-place AND stays 3.7-4.7x faster than allocating at 16K-64K bits. Dropped optimizations that benchmarks showed don't pay off: - `FromIterator` 64-bit tail batching: never exercised by exact-size iterators (the common case), measured ~1.0x, ~60 lines of intricate unsafe. Reverted to a simple `append` loop now that `append` is fast. Other notes (no code change here): - The filter and `from_indices` optimizations from the original branch are superseded by develop's PEXT filter and u64-word `from_indices`; taken from develop during the rebase. - `iter_bits`' u64-chunk rewrite is kept (cleaner/safer than the byte-by-byte original) but it has no non-test callers and does not feed `iter()`. Added a `bitand_owned_lhs_vortex_buffer` bench so the owned in-place path has ongoing coverage. Signed-off-by: Robert Kruszewski <dzobert@gmail.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Signed-off-by: Robert Kruszewski <github@robertk.io>

codspeed-hq · 2026-05-29T17:10:46Z

Merging this PR will improve performance by 30.89%

⚠️

Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚠️

Different runtime environments detected

Some benchmarks with significant performance changes were compared across different runtime environments,
which may affect the accuracy of the results.

Open the report in CodSpeed to investigate

⚡ 20 improved benchmarks
❌ 3 regressed benchmarks
✅ 1252 untouched benchmarks
🆕 5 new benchmarks

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

	Mode	Benchmark	`BASE`	`HEAD`	Efficiency
⚡	Simulation	`chunked_bool_canonical_into[(100, 100)]`	98 µs	87.6 µs	+11.89%
⚡	Simulation	`chunked_opt_bool_canonical_into[(100, 100)]`	199.6 µs	180.8 µs	+10.41%
⚡	Simulation	`chunked_opt_bool_canonical_into[(1000, 10)]`	55.1 µs	35 µs	+57.21%
⚡	Simulation	`chunked_opt_bool_into_canonical[(1000, 10)]`	69 µs	49.9 µs	+38.06%
❌	Simulation	`chunked_varbinview_opt_canonical_into[(1000, 10)]`	188.1 µs	215.2 µs	-12.6%
⚡	Simulation	`chunked_bool_canonical_into[(1000, 10)]`	45 µs	34.8 µs	+29.26%
❌	Simulation	`varbinview_zip_block_mask`	2.9 ms	3.4 ms	-14.06%
⚡	Simulation	`fsst_compress_string`	13.8 ms	12.3 ms	+12.67%
⚡	Simulation	`compress_fsst[(10000, 4, 4)]`	1.5 ms	1.4 ms	+11.41%
⚡	Simulation	`compress_fsst[(10000, 4, 8)]`	1.6 ms	1.4 ms	+11.14%
⚡	Simulation	`append_vortex_buffer[1024]`	14.7 µs	7.5 µs	+95.75%
🆕	Simulation	`bitand_owned_lhs_vortex_buffer[65536]`	N/A	11.3 µs	N/A
⚡	Simulation	`append_vortex_buffer[128]`	2.1 µs	1.1 µs	+85.46%
🆕	Simulation	`bitand_owned_lhs_vortex_buffer[1024]`	N/A	3.5 µs	N/A
🆕	Simulation	`bitand_owned_lhs_vortex_buffer[128]`	N/A	3.3 µs	N/A
🆕	Simulation	`bitand_owned_lhs_vortex_buffer[16384]`	N/A	5.3 µs	N/A
🆕	Simulation	`bitand_owned_lhs_vortex_buffer[2048]`	N/A	3.6 µs	N/A
⚡	Simulation	`append_buffer_vortex_buffer[1024]`	15.6 µs	12.4 µs	+25.48%
⚡	Simulation	`append_buffer_vortex_buffer[128]`	11.5 µs	10.1 µs	+13.98%
⚡	Simulation	`append_vortex_buffer[16384]`	231.1 µs	117.2 µs	+97.19%
...	...	...	...	...	...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.

_{Comparing rk/optimise-bit-buffer (c8bc973) with develop (73454db)}

claude and others added 6 commits May 29, 2026 16:33

less

c8bc973

Signed-off-by: Robert Kruszewski <github@robertk.io>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BitBuffer binary operations support in place updates to lhs#8162

BitBuffer binary operations support in place updates to lhs#8162
robert3005 wants to merge 6 commits into
developfrom
rk/optimise-bit-buffer

robert3005 commented May 29, 2026

Uh oh!

codspeed-hq Bot commented May 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

robert3005 commented May 29, 2026

Uh oh!

codspeed-hq Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will improve performance by 30.89%

Performance Changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codspeed-hq Bot commented May 29, 2026 •

edited

Loading