GH-3487: Fix PageReadStore buffer leak in ParquetFileReader#3488
Open
arouel wants to merge 1 commit intoapache:masterfrom
Open
GH-3487: Fix PageReadStore buffer leak in ParquetFileReader#3488arouel wants to merge 1 commit intoapache:masterfrom
PageReadStore buffer leak in ParquetFileReader#3488arouel wants to merge 1 commit intoapache:masterfrom
Conversation
Why ParquetFileReader never closes the ColumnChunkPageReadStore it returns from readNextRowGroup(). When a subsequent readNextRowGroup() call is made, the previous row group's reference is silently overwritten without releasing its buffers. Likewise, ParquetFileReader.close() does not close the last row group it returned. With the default heap allocator this leak is masked by GC, but with a direct ByteBufferAllocator (e.g., one backed by Arena.ofShared()) the compressed column chunk data and decompressed page buffers become a hard native memory leak that grows with every row group read. What Add a closeCurrentRowGroup() helper to ParquetFileReader that null-safely closes and nulls the currentRowGroup field. Call it in readNextRowGroup() and readNextFilteredRowGroup() before assigning the new row group, and include currentRowGroup in the AutoCloseables.uncheckedClose() chain in close(). This matches the lifecycle pattern already implemented manually by InternalParquetRecordReader but bakes it into ParquetFileReader itself so all direct callers benefit.
PageReadStore buffer leak in ParquetFileReader
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Rationale for this change
ParquetFileReadernever closes theColumnChunkPageReadStoreit returns fromreadNextRowGroup(). When a subsequent call replacescurrentRowGroup, the previous instance'sByteBufferReleaseris abandoned without releasing the compressed I/O buffers and any off-heap decompressed page buffers it holds. With the defaultHeapByteBufferAllocatorthis is masked by GC, but with a directByteBufferAllocatorit becomes a hard native memory leak that grows with every row group read.InternalParquetRecordReaderworks around this by manually closing thePageReadStorebefore each read and in its ownclose(), but any direct caller ofParquetFileReaderthat does not replicate this pattern will leak buffers.What changes are included in this PR?
A private
closeCurrentRowGroup()method is added toParquetFileReaderthat null-safely closes and nulls thecurrentRowGroupfield. It is called inreadNextRowGroup()andreadNextFilteredRowGroup()before assigning the new row group, andcurrentRowGroupis included in theAutoCloseables.uncheckedClose()chain inclose(). This brings the buffer lifecycle management intoParquetFileReaderitself so all callers benefit automatically.Are these changes tested?
The existing test suites in parquet-hadoop continue to pass. Additional tests got added to verify that
PageReadStorebuffers are properly released.Are there any user-facing changes?
No API changes. Callers that already close the
PageReadStorethemselves (likeInternalParquetRecordReader) will see a harmless double-close sinceColumnChunkPageReadStore.close()is idempotent viaByteBufferReleaser. Callers that did not close thePageReadStorewill now have their buffers released automatically, reducing memory usage.Closes #3487