refactor: Split Parquet BloomFilter CPU and IO into separate states#21285
refactor: Split Parquet BloomFilter CPU and IO into separate states#21285alamb merged 1 commit intoapache:mainfrom
Conversation
|
run benchmarks |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing alamb/split_bloom (ac46396) to 5ff80e4 (merge-base) diff using: tpch File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing alamb/split_bloom (ac46396) to 5ff80e4 (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing alamb/split_bloom (ac46396) to 5ff80e4 (merge-base) diff using: tpcds File an issue against this benchmark runner |
ac46396 to
bf0d992
Compare
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpch — base (merge-base)
tpch — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
|
run benchmarks |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing alamb/split_bloom (bf0d992) to 6412c3a (merge-base) diff using: tpch File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing alamb/split_bloom (bf0d992) to 6412c3a (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing alamb/split_bloom (bf0d992) to 6412c3a (merge-base) diff using: tpcds File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpch — base (merge-base)
tpch — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
adriangb
left a comment
There was a problem hiding this comment.
I like the direction these refactors are going in @alamb , this is the right way to organize opener.rs (I know at some point we spoke about splitting it up into more modules, this is either a good first step towards that or a better solution).
| .into_iter() | ||
| .filter_map(|column_name| { | ||
| let parquet_schema = builder.parquet_schema(); | ||
| let (column_idx, _) = parquet_column( |
There was a problem hiding this comment.
It'd be nice if we could get this to also support struct columns / struct filters cc @friendlymatthew
|
Thanks again @Dandandan and @adriangb I am now working on the next steps |
…pache#21285) ## Which issue does this PR close? - part of apache#20529 - Broken out of apache#20820 ## Rationale for this change We are try to be explicit about CPU and IO in the Parquet reader, however, the code for applying bloom filters still did both CPU and IO in the same future ## What changes are included in this PR? 1. Split the states for IO and CPU work when applying bloom filters ## Are these changes tested? Functionally by existing tests I also ran performance tests and they didn't show any substantial performance change ## Are there any user-facing changes? No, this is internal code reorganization
Which issue does this PR close?
Rationale for this change
We are try to be explicit about CPU and IO in the Parquet reader, however, the code for applying bloom filters still did both CPU and IO in the same future
What changes are included in this PR?
Are these changes tested?
Functionally by existing tests
I also ran performance tests and they didn't show any substantial performance change
Are there any user-facing changes?
No, this is internal code reorganization