Skip to content

GH-3484 Eliminate per-page heap allocation for CRC32 checksums when using direct ByteBufferAllocator#3485

Open
arouel wants to merge 1 commit intoapache:masterfrom
arouel:crc32-use-byte-buffer
Open

GH-3484 Eliminate per-page heap allocation for CRC32 checksums when using direct ByteBufferAllocator#3485
arouel wants to merge 1 commit intoapache:masterfrom
arouel:crc32-use-byte-buffer

Conversation

@arouel
Copy link
Copy Markdown

@arouel arouel commented Apr 17, 2026

Rationale for this change

ColumnChunkPageWriter.writePage() and writePageV2() call BytesInput.toByteArray() to feed compressed page data into CRC32.update(byte[]). When the writer uses a direct ByteBufferAllocator, this forces a full heap copy of every compressed page solely for checksumming. Since page write checksums are enabled by default (DEFAULT_PAGE_WRITE_CHECKSUM_ENABLED = true), this allocation occurs on every page write and negates part of the benefit of using a direct allocator.

What changes are included in this PR?

Replace crc.update(x.toByteArray()) with crc.update(x.toByteBuffer(releaser))

  • writePage() (V1): 1 call for compressedBytes
  • writePageV2(): 3 calls for repetitionLevels, definitionLevels, and compressedData

CRC32.update(ByteBuffer) has been available since Java 9 and operates directly on the buffer's memory without copying. The releaser field already exists on ColumnChunkPageWriter and provides the allocator-aware ByteBuffer lifecycle management needed. When the allocator is heap-based, toByteBuffer(releaser) returns a heap buffer and behavior is functionally equivalent to the previous code.

Are these changes tested?

The existing TestColumnChunkPageWriteStore covers both heap and direct allocator paths (test and testWithDirectBuffers) and exercises writePageV2 with checksums enabled by default. TestDataPageChecksums covers V1 and V2 pages with checksums on/off.

Are there any user-facing changes?

No. This is an internal optimization. CRC32 checksums are computed identically; only the intermediate memory representation changes.

Closes #3484

…when using direct `ByteBufferAllocator`

Why this is safe

- CRC32.update(ByteBuffer) exists since Java 9, processes bytes from position to limit, advancing position.
- toByteBuffer(releaser) returns either a slice() of the internal buffer (independent position) or a freshly allocated copy. Either way, the original BytesInput is unaffected for the subsequent buf.collect() call, because ByteBufferBytesInput.writeInto() uses buffer.duplicate().
- When the allocator is direct, toByteBuffer(releaser) returns the direct buffer directly -- zero heap copy. When the allocator is heap-based, behavior is functionally equivalent to the old toByteArray() path.
- The releaser field already exists on ColumnChunkPageWriter (line 124) and manages buffer lifecycle.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Avoid heap materialization for CRC32 page checksums in ColumnChunkPageWriteStore

1 participant