perf: optimize varbinview construction#8146
Conversation
build_views is the hot loop that turns a decoded string heap plus per-element lengths into VarBinView views. It is shared by FSST, VarBin, zstd and dict canonicalization. The previous implementation checked for a 2 GiB buffer rollover on every element and called the out-of-line (#[inline(never)]) BinaryView::make_view per element. Add a fast path for the common case where the whole decoded heap fits in a single buffer (bytes.len() <= max_buffer_len, which bounds every offset). In that path there can be no rollover, so the loop drops the per-element rollover branch and builds the reference-view variant inline, avoiding the make_view call for long (>12 byte) values. Output is byte-for-byte identical to the previous code; the multi-buffer path is unchanged. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Polar Signals Profiling ResultsLatest Run
Previous Runs (1)
Powered by Polar Signals Cloud |
Benchmarks: PolarSignals ProfilingVortex (geomean): 0.999x ➖ How to read Verdict and Engines
datafusion / vortex-file-compressed (0.999x ➖, 1↑ 0↓)
File Size Changes (545 files changed, -98.9% overall, 0↑ 545↓)
Totals:
|
Merging this PR will improve performance by 11.42%
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ⚡ | Simulation | fsst_decompress_string |
3.5 ms | 3.1 ms | +11.53% |
| ⚡ | Simulation | chunked_canonicalize_into[(10, 10000, 16, 4)] |
5.9 ms | 5.3 ms | +12.23% |
| ⚡ | Simulation | canonicalize_compare[(10000, 16, 4)] |
704.7 µs | 640.5 µs | +10.03% |
| ⚡ | Simulation | chunked_into_canonical[(10, 10000, 16, 4)] |
5.9 ms | 5.3 ms | +12.13% |
| ⚡ | Simulation | decompress_fsst[(10000, 4, 4)] |
388.3 µs | 352.7 µs | +10.09% |
| ⚡ | Simulation | decompress_fsst[(10000, 16, 4)] |
579.4 µs | 514.8 µs | +12.56% |
| ⚡ | Simulation | decompress_fsst[(10000, 16, 8)] |
631.8 µs | 567.2 µs | +11.39% |
Tip
Curious why this is faster? Comment @codspeedbot explain why this is faster on this PR, or directly use the CodSpeed MCP with your agent.
Comparing claude/avx512-decode-benchmark-wRhkZ (fc6b839) with develop (73454db)
Benchmarks: FineWeb NVMeVerdict: No clear signal (low confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (1.018x ➖, 0↑ 2↓)
datafusion / vortex-compact (1.017x ➖, 0↑ 0↓)
datafusion / parquet (1.021x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (1.012x ➖, 1↑ 1↓)
duckdb / vortex-compact (1.017x ➖, 0↑ 0↓)
duckdb / parquet (1.030x ➖, 0↑ 1↓)
File Size Changes (542 files changed, -95.1% overall, 0↑ 542↓)
Totals:
Full attributed analysis
|
2042c37 to
50b0e93
Compare
Benchmarks: TPC-H SF=1 on NVMEVerdict: No clear signal (low confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (1.010x ➖, 0↑ 0↓)
datafusion / vortex-compact (1.001x ➖, 0↑ 0↓)
datafusion / parquet (1.010x ➖, 1↑ 1↓)
datafusion / arrow (1.004x ➖, 1↑ 0↓)
duckdb / vortex-file-compressed (1.008x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.013x ➖, 0↑ 0↓)
duckdb / parquet (1.021x ➖, 0↑ 3↓)
duckdb / duckdb (1.007x ➖, 0↑ 0↓)
File Size Changes (526 files changed, -99.3% overall, 0↑ 526↓)
Totals:
Full attributed analysis
|
Benchmarks: TPC-DS SF=1 on NVMEVerdict: No clear signal (low confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (0.921x ➖, 28↑ 1↓)
datafusion / vortex-compact (0.932x ➖, 17↑ 0↓)
datafusion / parquet (0.939x ➖, 19↑ 1↓)
duckdb / vortex-file-compressed (0.927x ➖, 27↑ 0↓)
duckdb / vortex-compact (0.927x ➖, 32↑ 1↓)
duckdb / parquet (0.942x ➖, 8↑ 0↓)
duckdb / duckdb (0.917x ➖, 22↑ 0↓)
File Size Changes (496 files changed, -99.2% overall, 0↑ 496↓)
Totals:
Full attributed analysis
|
Benchmarks: FineWeb S3Verdict: No clear signal (environment too noisy confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (1.123x ➖, 0↑ 2↓)
datafusion / vortex-compact (0.817x ➖, 2↑ 0↓)
datafusion / parquet (0.926x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (0.913x ➖, 1↑ 0↓)
duckdb / vortex-compact (1.008x ➖, 0↑ 0↓)
duckdb / parquet (0.953x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: Statistical and Population GeneticsVerdict: No clear signal (low confidence) How to read Verdict and Engines
duckdb / vortex-file-compressed (0.978x ➖, 0↑ 0↓)
duckdb / vortex-compact (0.994x ➖, 0↑ 0↓)
duckdb / parquet (0.978x ➖, 0↑ 0↓)
File Size Changes (542 files changed, -95.3% overall, 0↑ 542↓)
Totals:
Full attributed analysis
|
Benchmarks: TPC-H SF=10 on NVMEVerdict: No clear signal (low confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (0.816x ✅, 22↑ 0↓)
datafusion / vortex-compact (0.825x ✅, 22↑ 0↓)
datafusion / parquet (0.849x ✅, 20↑ 0↓)
datafusion / arrow (0.830x ✅, 22↑ 0↓)
duckdb / vortex-file-compressed (0.861x ✅, 18↑ 0↓)
duckdb / vortex-compact (0.866x ✅, 21↑ 0↓)
duckdb / parquet (0.928x ➖, 4↑ 0↓)
duckdb / duckdb (0.906x ➖, 9↑ 0↓)
File Size Changes (496 files changed, -92.4% overall, 0↑ 496↓)
Totals:
Full attributed analysis
|
Benchmarks: Random AccessVortex (geomean): 1.041x ➖ How to read Verdict and Engines
unknown / unknown (1.069x ➖, 0↑ 2↓)
|
Benchmarks: Clickbench on NVMEVerdict: No clear signal (low confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (1.006x ➖, 1↑ 2↓)
datafusion / parquet (1.023x ➖, 0↑ 2↓)
duckdb / vortex-file-compressed (1.014x ➖, 1↑ 6↓)
duckdb / parquet (1.005x ➖, 0↑ 0↓)
duckdb / duckdb (1.002x ➖, 0↑ 0↓)
File Size Changes (345 files changed, -65.8% overall, 0↑ 345↓)
Totals:
Full attributed analysis
|
Benchmarks: Appian on NVMEVerdict: No clear signal (low confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (0.936x ➖, 1↑ 0↓)
datafusion / parquet (0.922x ➖, 1↑ 0↓)
duckdb / vortex-file-compressed (0.959x ➖, 0↑ 0↓)
duckdb / parquet (0.952x ➖, 0↑ 0↓)
duckdb / duckdb (0.966x ➖, 0↑ 0↓)
File Size Changes (527 files changed, -98.7% overall, 0↑ 527↓)
Totals:
Full attributed analysis
|
Benchmarks: TPC-H SF=1 on S3Verdict: No clear signal (environment too noisy confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (0.809x ➖, 7↑ 1↓)
datafusion / vortex-compact (0.905x ➖, 3↑ 2↓)
datafusion / parquet (0.991x ➖, 1↑ 1↓)
duckdb / vortex-file-compressed (0.859x ➖, 1↑ 0↓)
duckdb / vortex-compact (0.891x ➖, 0↑ 0↓)
duckdb / parquet (0.845x ➖, 0↑ 0↓)
Full attributed analysis
|
|
Looks like small but real wins |
Benchmarks: CompressionVortex (geomean): 1.002x ➖ How to read Verdict and Engines
unknown / unknown (0.998x ➖, 0↑ 0↓)
|
Benchmarks: TPC-H SF=10 on S3Verdict: No clear signal (environment too noisy confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (0.890x ➖, 1↑ 0↓)
datafusion / vortex-compact (0.895x ➖, 2↑ 0↓)
datafusion / parquet (0.824x ➖, 3↑ 0↓)
duckdb / vortex-file-compressed (0.929x ➖, 0↑ 0↓)
duckdb / vortex-compact (0.834x ➖, 1↑ 0↓)
duckdb / parquet (0.918x ➖, 0↑ 0↓)
Full attributed analysis
|
Builds on the single-buffer fast path with two changes to the per-view write, measured at ~12% faster on the long-string canonicalization path (8.13 -> 7.10 ns/view, isolated build_views over 600k reference views, native/AVX-512, pinned core, 6 samples each): 1. Write views into the reserved spare capacity via an iterator instead of `push_unchecked`. `push_unchecked` advances the backing buffer's length on every call, which the optimizer cannot prove is loop-invariant, so it reloads and rewrites the output cursor through the stack each iteration. Writing into the spare slice and calling `set_len` once keeps the cursor in a register. 2. Assemble the reference view as a single `u128` via a new `BinaryView::new_ref` constructor, so the compiler emits one wide store per view rather than separate field stores. The constructor matches the little-endian field order already assumed by `BinaryView` (`le_bytes`, `From<u128>`, `as_u128`). A new `new_ref_matches_make_view` test asserts the `u128` assembly is byte-identical to the value-inspecting `make_view` across several lengths. Existing build_views and vortex-fsst canonicalization tests pass; clippy and rustfmt are clean. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
| use super::*; | ||
|
|
||
| #[test] | ||
| fn new_ref_matches_make_view() { |
There was a problem hiding this comment.
Any more test cases you can come up with? Is the change sufficiently covered?
| prefix.copy_from_slice(&value[..4]); | ||
| BinaryView::new_ref(len.as_(), prefix, start_buf_index, offset.as_()) | ||
| } else { | ||
| BinaryView::make_view(value, start_buf_index, offset.as_()) |
There was a problem hiding this comment.
offset.as_() can implicitly truncate?
Add tests exercising the new single-buffer fast path in build_views: a parameterized roundtrip over mixed inline/reference lengths, the 12/13 byte inline boundary, empty and interleaved-empty values, all-inlined and all-reference sets, and a single large value. Also verify the fast path is taken at the exact bytes.len() == max_buffer_len boundary, that it agrees with the slow rollover path on reconstructed values, handles empty input, and produces views byte-identical to make_view. Allow cast_possible_truncation in the view.rs test module so the new_ref test compiles under clippy. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Add fast_path_large_offsets, which builds a ~9 MiB heap so the running offset climbs past 2^23 while staying far below MAX_BUFFER_LEN. Each value encodes its index, and the test asserts every Ref records the exact cumulative offset and size, guarding against any accidental narrowing of the u32 offset/size fields written via as_ truncation. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Add fsst_canonicalize_offsets_overflow_i32, the decode-side companion to fsst_compress_offsets_overflow_i32. It builds an FSST array whose decompressed heap crosses i32::MAX so canonicalization feeds build_views a heap larger than MAX_BUFFER_LEN, forcing the single-buffer fast path to be declined in favor of buffer rollover. The test asserts the row count is preserved, that more than one data buffer is produced, that no buffer exceeds MAX_BUFFER_LEN, and that rows straddling the rollover boundary reconstruct exactly. Gated with the same #[test_with::env(CI)] / #[test_with::no_env( VORTEX_SKIP_SLOW_TESTS)] attributes as the existing slow test. Highly compressible bodies keep the FSST array tiny so peak memory is dominated by the input and decompressed heaps. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Drop the FSST-based fsst_canonicalize_offsets_overflow_i32 test in favor of a direct build_views test in vortex-array, avoiding the compressor/training and the doubled input+decompressed heaps. build_views_offsets_overflow_i32 builds a ~2.25 GiB heap (just past i32::MAX) and calls build_views with MAX_BUFFER_LEN, asserting the heap rolls over into multiple buffers, that no buffer exceeds MAX_BUFFER_LEN, and that rows straddling the rollover boundary reconstruct exactly. Validated that the test catches the overflow: with the single-buffer fast-path guard intact it passes, and forcing the fast path to swallow the whole heap makes it fail on the multi-buffer assertion. Gated with #[test_with::env(CI)] / #[test_with::no_env(VORTEX_SKIP_SLOW_TESTS)], matching the existing slow-test convention; adds test-with as a vortex-array dev-dependency. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Summary
Opptimizes the hot path for constructing binary views when the entire decoded heap fits within a single buffer.
Changes
Optimized view construction (
vortex-array/src/arrays/varbinview/build_views.rs):BinaryView::make_viewcalls in the hot loop