GH-3489: Fix Binary.copy() to materialize direct ByteBuffer-backed values to heap#3490
Open
arouel wants to merge 1 commit intoapache:masterfrom
Open
GH-3489: Fix Binary.copy() to materialize direct ByteBuffer-backed values to heap#3490arouel wants to merge 1 commit intoapache:masterfrom
arouel wants to merge 1 commit intoapache:masterfrom
Conversation
…ked values to heap Why Binary.copy() is a no-op for ByteBufferBackedBinary instances when isBackingBytesReused is false, returning this instead of a heap-backed copy. When the off-heap decompression path is enabled, column readers produce ByteBufferBackedBinary values that are zero-copy views into Arena-managed direct buffers. Code that calls copy() to take ownership of the data -- such as DictionaryValuesWriter.writeBytes() -- silently receives the original object still pointing into the direct buffer. When the source PageReadStore is closed and the Arena is freed, those Binary values become dangling references, causing IllegalStateException: Already closed on subsequent access. What Override copy() in Binary.ByteBufferBackedBinary to always materialize to a heap-backed Binary via fromConstantByteArray(getBytes()) when the backing ByteBuffer is direct. Heap-backed ByteBuffer values delegate to the existing super.copy() behavior to avoid unnecessary copies. The change is in parquet-column/src/main/java/org/apache/parquet/io/api/Binary.java.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Rationale for this change
Binary.copy()is expected to return a value that the caller can safely store beyond the lifetime of the source buffer. The base implementation only copies whenisBackingBytesReusedistrue, otherwise it returnsthis. ForByteBufferBackedBinaryinstances backed by a directByteBuffer,isBackingBytesReusedisfalse, socopy()is a no-op. The returned object still points into the original direct buffer. When that buffer's memory is freed (e.g., after closing aPageReadStorethat used the off-heap decompression path), any code that relied oncopy()for ownership, such asDictionaryValuesWriter.writeBytes(), hitsIllegalStateException: Already closed on the next access.What changes are included in this PR?
ByteBufferBackedBinary.copy()is overridden to checkvalue.isDirect(). When the backingByteBufferis direct, it unconditionally materializes to a heap-backedBinaryviaBinary.fromConstantByteArray(getBytes()). When the backing buffer is heap-based, it delegates tosuper.copy()to preserve the existing behavior and avoid unnecessary copies.Are these changes tested?
Tests were added that verify
copy()on a directByteBuffer-backedBinaryreturns an independent heap-backed instance whose data remains accessible after the source buffer's is closed. The existing test suites continue to pass.Are there any user-facing changes?
No API changes.
Binary.copy()now returns a heap-backed copy instead of this when the source is backed by a directByteBuffer. This is a correctness fix, callers already expectedcopy()to return an owned value, and the previous behavior silently violated that contract for direct buffers.Closes #3489