GH-3487: Fix `PageReadStore` buffer leak in `ParquetFileReader` by arouel · Pull Request #3488 · apache/parquet-java

arouel · 2026-04-18T19:55:42Z

Rationale for this change

ParquetFileReader never closes the ColumnChunkPageReadStore it returns from readNextRowGroup(). When a subsequent call replaces currentRowGroup, the previous instance's ByteBufferReleaser is abandoned without releasing the compressed I/O buffers and any off-heap decompressed page buffers it holds. With the default HeapByteBufferAllocator this is masked by GC, but with a direct ByteBufferAllocator it becomes a hard native memory leak that grows with every row group read. InternalParquetRecordReader works around this by manually closing the PageReadStore before each read and in its own close(), but any direct caller of ParquetFileReader that does not replicate this pattern will leak buffers.

What changes are included in this PR?

A private closeCurrentRowGroup() method is added to ParquetFileReader that null-safely closes and nulls the currentRowGroup field. It is called in readNextRowGroup() and readNextFilteredRowGroup() before assigning the new row group, and currentRowGroup is included in the AutoCloseables.uncheckedClose() chain in close(). This brings the buffer lifecycle management into ParquetFileReader itself so all callers benefit automatically.

Are these changes tested?

The existing test suites in parquet-hadoop continue to pass. Additional tests got added to verify that PageReadStore buffers are properly released.

Are there any user-facing changes?

No API changes. Callers that already close the PageReadStore themselves (like InternalParquetRecordReader) will see a harmless double-close since ColumnChunkPageReadStore.close() is idempotent via ByteBufferReleaser. Callers that did not close the PageReadStore will now have their buffers released automatically, reducing memory usage.

Closes #3487

Why ParquetFileReader never closes the ColumnChunkPageReadStore it returns from readNextRowGroup(). When a subsequent readNextRowGroup() call is made, the previous row group's reference is silently overwritten without releasing its buffers. Likewise, ParquetFileReader.close() does not close the last row group it returned. With the default heap allocator this leak is masked by GC, but with a direct ByteBufferAllocator (e.g., one backed by Arena.ofShared()) the compressed column chunk data and decompressed page buffers become a hard native memory leak that grows with every row group read. What Add a closeCurrentRowGroup() helper to ParquetFileReader that null-safely closes and nulls the currentRowGroup field. Call it in readNextRowGroup() and readNextFilteredRowGroup() before assigning the new row group, and include currentRowGroup in the AutoCloseables.uncheckedClose() chain in close(). This matches the lifecycle pattern already implemented manually by InternalParquetRecordReader but bakes it into ParquetFileReader itself so all direct callers benefit.

arouel changed the title ~~GH-3487: Fix PageReadStore buffer leak in ParquetFileReader~~ GH-3487: Fix PageReadStore buffer leak in ParquetFileReader Apr 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-3487: Fix `PageReadStore` buffer leak in `ParquetFileReader`#3488

GH-3487: Fix `PageReadStore` buffer leak in `ParquetFileReader`#3488
arouel wants to merge 1 commit intoapache:masterfrom
arouel:close-page-read-store

arouel commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

arouel commented Apr 18, 2026

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant