Skip to content

StatisticsCounter::toStatistics() writes length-prefixed bytes for min_value/max_value on BYTE_ARRAY string columns, causing corrupted statistics #2247

@mcarbonell-paymefy

Description

@mcarbonell-paymefy

When writing Parquet files with BYTE_ARRAY string columns (STRING, JSON logical types), the statistics min_value and max_value fields are serialized using plain encoding (4-byte little-endian length
prefix + raw bytes). According to the Parquet spec, the deprecated min/max fields should use plain encoding, but min_value/max_value should contain only the raw value bytes without any length prefix.

This causes two problems:

  1. Written min_value/max_value are non-compliant with the Parquet spec and unreadable by other Parquet implementations.
  2. StatisticsReader has a shortcut that returns the raw buffer directly when mb_check_encoding($value, 'UTF-8') is true — which is almost always the case for ASCII strings — so it returns
    \x05\x00\x00\x00hello instead of hello.

Root cause

In StatisticsCounter::toStatistics() (line 129–130), both min/max and min_value/max_value are packed using PlainValuesPacker, which calls BinaryBufferWriter::writeStrings():

// BinaryBufferWriter::writeStrings() — always prepends 4-byte length
$this->buffer .= \pack($format, $length); // 4-byte length prefix
$this->buffer .= $string;

Then in StatisticsReader::min() / max() (and their minValue/maxValue variants):

// This shortcut fires for almost all valid UTF-8 strings,
// returning the raw buffer WITH the 4-byte length prefix
if (ColumnPrimitiveType::isString($column) && \mb_check_encoding($this->statistics->min, 'UTF-8')) {
return $this->statistics->min; // BUG: includes \x05\x00\x00\x00 prefix
}
// Only reaches PlainValueUnpacker (which correctly strips the prefix) if mb_check_encoding fails

Expected behavior

  • min_value / max_value for BYTE_ARRAY string columns should contain the raw string bytes only (no length prefix), as per the Parquet spec §Statistics.
  • StatisticsReader should consistently unpack via PlainValueUnpacker for min/max (which are length-prefixed), and return raw bytes for min_value/max_value (which should not be).

Steps to reproduce

$writer = new Writer(...);
$writer->open($path);
$writer->write(Rows::fromArray([['name' => 'hello'], ['name' => 'world']]));
$writer->close();

$reader = new Reader();
$file = $reader->read($path);
$chunk = $file->metadata()->rowGroups()[0]->columnChunks()[0];
$stats = $chunk->statistics(); // min_value = "\x05\x00\x00\x00hello" instead of "hello"

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    Status

    Done

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions