Lazy register parquet file metrics#22353
Conversation
5b3a697 to
eb2ee9b
Compare
|
run benchmarks |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing xudong963/issue-22189 (eb2ee9b) to c8b784a (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing xudong963/issue-22189 (eb2ee9b) to c8b784a (merge-base) diff using: tpcds File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing xudong963/issue-22189 (eb2ee9b) to c8b784a (merge-base) diff using: tpch File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpch — base (merge-base)
tpch — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
eb2ee9b to
7c67d99
Compare
Which issue does this PR close?
Rationale for this change
Parquet scans currently register a full set of per-file metrics when each
ParquetFileMetricsis created, even when most of those metrics remain zero.This is costly for scans over many parquet files because metric registration
clones labels such as
filenameand appends many unused metrics to the planmetrics set.
What changes are included in this PR?
metric.
metrics when the file metrics are dropped.
handles can be updated throughout streaming before registration occurs.
page_index_pages_skipped_by_fully_matched.requiring metrics that are expected to have non-zero values.
Count,Gauge,Time,PruningMetrics, andRatioMetrics.Are these changes tested?
Yes.
I also ran a local release microbenchmark for repeated
ParquetFileMetricscreation against
origin/main; for 50k file metrics, zero-metric registrationimproved from about 57.9ms to 25.2ms, and the
bytes_scannedcase improvedfrom about 57.7ms to 28.4ms.
Are there any user-facing changes?
There are no SQL result or schema changes.
The observable metrics output changes: parquet file metrics that remain zero
may now be absent from the metrics set instead of being present with value
0.Low-level integrations should treat missing parquet file metrics as zero when
appropriate.
ParquetFileMetricsis a low-level public struct that is documented as subjectto change. This PR adds an internal registration guard field, so external Rust
code should construct it via
ParquetFileMetrics::new.