fix: gc StringView/BinaryView arrays before spilling to prevent write amplification#21325
Open
damahua wants to merge 2 commits intoapache:mainfrom
Open
fix: gc StringView/BinaryView arrays before spilling to prevent write amplification#21325damahua wants to merge 2 commits intoapache:mainfrom
damahua wants to merge 2 commits intoapache:mainfrom
Conversation
… amplification After operations like `take` or `interleave`, view-type arrays retain shared references to large original data buffers. When these batches are written to spill files individually, the IPC writer duplicates all referenced buffers for every batch — measured at 8.5× write amplification with 10 chunks of 1000 rows from a 10,000-row StringView dataset. Changes: - Add `gc_view_arrays()` utility in spill/mod.rs that compacts both StringViewArray and BinaryViewArray in a RecordBatch - Apply gc to hash aggregation spill path (IncrementalSortIterator output chunks share parent batch buffers via take_record_batch) - Apply gc to sort-merge join bitwise_stream spill path (inner_key_buffer contains sliced batches sharing original batch buffers) - Extend sort operator's organize_stringview_arrays to also handle BinaryViewArray (was previously StringView-only) The sort operator already had StringView gc (organize_stringview_arrays), but the hash aggregation and sort-merge join spill paths were missing it. This is the same class of bug fixed by PR apache#19444 for sort spilling. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…B test Adds a targeted benchmark that exercises the hash aggregation spill path with StringViewArray columns (Utf8View, non-inline 50+ byte strings). Uses EXPLAIN ANALYZE to capture spill_count and spilled_bytes metrics. A/B results (20 MB pool, 100K rows, 50K groups, N=3): Baseline: 39.50 MB spilled (5× write amplification from shared buffers) Optimized: 7.90 MB spilled (80% reduction) Query time: unchanged (~320 ms) With 8 MB pool: baseline OOMs during sort reservation, optimized succeeds. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Related to #19444 (sort spill StringView gc) and #20500 (repartition StringView gc). This PR extends the same fix to hash aggregation and sort-merge join spill paths, and adds BinaryViewArray support to the sort operator.
Rationale
After operations like
takeorslice,StringViewArrayandBinaryViewArrayretain shared references to all original data buffers. When these batches are written to spill files individually, the IPC writer must include every referenced buffer for every batch, causing massive write amplification.The sort operator already had a fix for this (
organize_stringview_arraysinsort.rs), but the hash aggregation and sort-merge join spill paths were missing it. Additionally, the sort operator's fix only handledStringViewArray, notBinaryViewArray.Hash aggregation spill path
In
row_hash.rs,IncrementalSortIteratorproduces output chunks viatake_record_batch. Each chunk shares the same StringView data buffers as the parent emitted batch. Withoutgc(), spilling N chunks writes N copies of all shared buffers.Sort-merge join spill path
In
bitwise_stream.rs,inner_key_buffercontains sliced batches that share StringView data buffers with the original unsliced batches.Changes
gc_view_arrays()utility inspill/mod.rs— compacts bothStringViewArrayandBinaryViewArrayin aRecordBatch, returning the batch unchanged (no allocation) when no view-type columns existrow_hash.rs) — gc eachIncrementalSortIteratoroutput batch before writing to the spill filebitwise_stream.rs) — gc slicedinner_key_bufferbatches before spillingsort.rs) — extended existingorganize_stringview_arraysto also handleBinaryViewArrayA/B Benchmark Results
Workload:
SELECT group_key, COUNT(*), SUM(value) FROM t GROUP BY group_keygroup_key:Utf8View(StringViewArray) with 50+ byte non-inline stringsWith a tighter 8 MB pool: baseline OOMs (
ResourcesExhausted: Failed to allocate additional 10.4 MB for GroupedHashAggregateStream) because inflated StringView buffers cause the sort memory estimate to exceed the pool. Optimized completes successfully (5 spills, 20.9 MB spilled).Tests
datafusion-physical-planlib tests pass (1 pre-existing zstd feature failure)memory_limitintegration tests passsort_merge_jointests passtest_gc_view_arrays_reduces_spill_size— verifies gc compacts taken StringView/BinaryView batchestest_gc_view_arrays_write_amplification— demonstrates 8.5× write amplification without gctest_gc_view_arrays_noop_for_non_view_types— verifies no overhead for non-view typesbench_stringview_aggregate_spill— end-to-end benchmark with EXPLAIN ANALYZE metrics