Skip to content

Fix LZ4_RAW heap decompression failure on chunked BytesInput (#3478)#3486

Open
yadavay-amzn wants to merge 1 commit intoapache:masterfrom
yadavay-amzn:fix/3478-lz4-raw-chunked-decompression
Open

Fix LZ4_RAW heap decompression failure on chunked BytesInput (#3478)#3486
yadavay-amzn wants to merge 1 commit intoapache:masterfrom
yadavay-amzn:fix/3478-lz4-raw-chunked-decompression

Conversation

@yadavay-amzn
Copy link
Copy Markdown

@yadavay-amzn yadavay-amzn commented Apr 18, 2026

What changes were proposed?

Eagerly materialize the decompressed stream for LZ4_RAW in CodecFactory, matching the existing pattern used for ZSTD. Added a regression test.

Closes #3478

Why are the changes needed?

When reading LZ4_RAW-compressed data through the heap codec path, decompression fails if the decompressed page exceeds ~8KB. The lazy StreamBytesInput.writeInto() path reads via Channels.newChannel() in ~8KB chunks, but LZ4_RAW requires all compressed input in a single buffer for one-shot decompression. This causes MalformedInputException: all input must be consumed in production reads, particularly during dictionary filter evaluation (DictionaryPageReader.reusableCopy()).

The fix extends the existing ZSTD eager-materialization pattern (added for parquet-format#398) to also cover LZ4_RAW:

if (codec instanceof ZstandardCodec || codec instanceof Lz4RawCodec) {
  decompressed = BytesInput.copy(BytesInput.from(is, decompressedSize));
  is.close();
}

How was this tested?

Added testLz4RawHeapDecompressorCanCopyLargePage to TestCompressionCodec:

  • Compresses a 16KB random page via CodecFactory heap path
  • Decompresses and calls BytesInput.copy() to exercise the chunked materialization path
  • Fails without the fix, passes with it

All existing tests continue to pass (6/6 in TestCompressionCodec, 11/11 across all codec tests).

…3478)

Eagerly materialize the decompressed stream for LZ4_RAW in CodecFactory,
matching the existing pattern used for ZSTD. Without this, the lazy
StreamBytesInput.writeInto() path reads via Channels.newChannel() in
~8KB chunks, but LZ4_RAW requires all compressed input in a single
buffer for one-shot decompression.

Added a regression test that compresses and decompresses a 16KB page
through the CodecFactory heap path, then calls BytesInput.copy() to
exercise the chunked materialization code path. The test fails without
the fix and passes with it.
@yadavay-amzn yadavay-amzn force-pushed the fix/3478-lz4-raw-chunked-decompression branch from b1e91f5 to 5998c9d Compare April 18, 2026 06:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

LZ4_RAW heap decompression fails on chunked BytesInput materialization for large pages

2 participants