-
-
Notifications
You must be signed in to change notification settings - Fork 50
Description
When writing Parquet files with BYTE_ARRAY string columns (STRING, JSON logical types), the statistics min_value and max_value fields are serialized using plain encoding (4-byte little-endian length
prefix + raw bytes). According to the Parquet spec, the deprecated min/max fields should use plain encoding, but min_value/max_value should contain only the raw value bytes without any length prefix.
This causes two problems:
- Written min_value/max_value are non-compliant with the Parquet spec and unreadable by other Parquet implementations.
- StatisticsReader has a shortcut that returns the raw buffer directly when mb_check_encoding($value, 'UTF-8') is true — which is almost always the case for ASCII strings — so it returns
\x05\x00\x00\x00hello instead of hello.
Root cause
In StatisticsCounter::toStatistics() (line 129–130), both min/max and min_value/max_value are packed using PlainValuesPacker, which calls BinaryBufferWriter::writeStrings():
// BinaryBufferWriter::writeStrings() — always prepends 4-byte length
$this->buffer .= \pack($format, $length); // 4-byte length prefix
$this->buffer .= $string;
Then in StatisticsReader::min() / max() (and their minValue/maxValue variants):
// This shortcut fires for almost all valid UTF-8 strings,
// returning the raw buffer WITH the 4-byte length prefix
if (ColumnPrimitiveType::isString($column) && \mb_check_encoding($this->statistics->min, 'UTF-8')) {
return $this->statistics->min; // BUG: includes \x05\x00\x00\x00 prefix
}
// Only reaches PlainValueUnpacker (which correctly strips the prefix) if mb_check_encoding fails
Expected behavior
- min_value / max_value for BYTE_ARRAY string columns should contain the raw string bytes only (no length prefix), as per the Parquet spec §Statistics.
- StatisticsReader should consistently unpack via PlainValueUnpacker for min/max (which are length-prefixed), and return raw bytes for min_value/max_value (which should not be).
Steps to reproduce
$writer = new Writer(...);
$writer->open($path);
$writer->write(Rows::fromArray([['name' => 'hello'], ['name' => 'world']]));
$writer->close();
$reader = new Reader();
$file = $reader->read($path);
$chunk = $file->metadata()->rowGroups()[0]->columnChunks()[0];
$stats = $chunk->statistics(); // min_value = "\x05\x00\x00\x00hello" instead of "hello"