Skip to content

GH-3489: Fix Binary.copy() to materialize direct ByteBuffer-backed values to heap#3490

Open
arouel wants to merge 1 commit intoapache:masterfrom
arouel:binary-byte-buffer-copy
Open

GH-3489: Fix Binary.copy() to materialize direct ByteBuffer-backed values to heap#3490
arouel wants to merge 1 commit intoapache:masterfrom
arouel:binary-byte-buffer-copy

Conversation

@arouel
Copy link
Copy Markdown

@arouel arouel commented Apr 18, 2026

Rationale for this change

Binary.copy() is expected to return a value that the caller can safely store beyond the lifetime of the source buffer. The base implementation only copies when isBackingBytesReused is true, otherwise it returns this. For ByteBufferBackedBinary instances backed by a direct ByteBuffer, isBackingBytesReused is false, so copy() is a no-op. The returned object still points into the original direct buffer. When that buffer's memory is freed (e.g., after closing a PageReadStore that used the off-heap decompression path), any code that relied on copy() for ownership, such as DictionaryValuesWriter.writeBytes(), hits IllegalStateException: Already closed on the next access.

What changes are included in this PR?

ByteBufferBackedBinary.copy() is overridden to check value.isDirect(). When the backing ByteBuffer is direct, it unconditionally materializes to a heap-backed Binary via Binary.fromConstantByteArray(getBytes()). When the backing buffer is heap-based, it delegates to super.copy() to preserve the existing behavior and avoid unnecessary copies.

Are these changes tested?

Tests were added that verify copy() on a direct ByteBuffer-backed Binary returns an independent heap-backed instance whose data remains accessible after the source buffer's is closed. The existing test suites continue to pass.

Are there any user-facing changes?

No API changes. Binary.copy() now returns a heap-backed copy instead of this when the source is backed by a direct ByteBuffer. This is a correctness fix, callers already expected copy() to return an owned value, and the previous behavior silently violated that contract for direct buffers.

Closes #3489

…ked values to heap

Why

Binary.copy() is a no-op for ByteBufferBackedBinary instances when isBackingBytesReused is false, returning this instead of a heap-backed copy. When the off-heap decompression path is enabled, column readers produce ByteBufferBackedBinary values that are zero-copy views into Arena-managed direct buffers. Code that calls copy() to take ownership of the data -- such as DictionaryValuesWriter.writeBytes() -- silently receives the original object still pointing into the direct buffer. When the source PageReadStore is closed and the Arena is freed, those Binary values become dangling references, causing IllegalStateException: Already closed on subsequent access.

What

Override copy() in Binary.ByteBufferBackedBinary to always materialize to a heap-backed Binary via fromConstantByteArray(getBytes()) when the backing ByteBuffer is direct. Heap-backed ByteBuffer values delegate to the existing super.copy() behavior to avoid unnecessary copies. The change is in parquet-column/src/main/java/org/apache/parquet/io/api/Binary.java.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Binary.copy() does not materialize direct ByteBuffer-backed values to heap, causing use-after-free

1 participant