Skip to content

POC: Window function intra-operator parallelism and vectorization (up to 50× faster)#22356

Draft
2010YOUY01 wants to merge 1 commit into
apache:mainfrom
2010YOUY01:parallel-window
Draft

POC: Window function intra-operator parallelism and vectorization (up to 50× faster)#22356
2010YOUY01 wants to merge 1 commit into
apache:mainfrom
2010YOUY01:parallel-window

Conversation

@2010YOUY01
Copy link
Copy Markdown
Contributor

@2010YOUY01 2010YOUY01 commented May 19, 2026

Which issue does this PR close?

POC for #22355

Rationale for this change

This optimization only targets a specific type of window functions:

  • Fixed ROWS window frame
  • Regular aggregate function

See Issue for the implementation idea:

Benchmark

In datafusion-cli:

create external table lineitem
stored as parquet
location '/Users/yongting/Code/datafusion/benchmarks/data/tpch_sf10/lineitem'
with order (l_orderkey asc);

-- PR only config
set datafusion.optimizer.prefer_window_agg_exec=true;

-- Q1: Single global partition, no sort, low scanning overhead
SELECT
  v1,
  avg(v1) OVER (
    ORDER BY v1
    ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
  ) AS moving_avg
FROM generate_series(1000000000) AS t1(v1);

-- Q2: Single global partition, window expr matches parquet existing order
SELECT
  l_orderkey,
  avg(l_partkey) OVER (
    ORDER BY l_orderkey
    ROWS BETWEEN 1000 PRECEDING AND 1000 FOLLOWING
  ) AS moving_avg
FROM lineitem;

-- Q3: 3 partitions
SELECT
  l_orderkey,
  avg(l_partkey) OVER (
    PARTITION BY l_returnflag
    ORDER BY l_orderkey
    ROWS BETWEEN 100 PRECEDING AND CURRENT ROW
  ) AS moving_avg
FROM lineitem;

-- Q4: 100k partitions
SELECT
  l_orderkey,
  avg(l_partkey) OVER (
    PARTITION BY l_suppkey
    ORDER BY l_orderkey
    ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
  ) AS moving_avg
FROM lineitem;

Result

| Query | PR   | Main | Speedup |
|-------|------|------|----------|
| Q1    | 1.40 | 79   | 56.43x  |
| Q2    | 0.99 | 5.44 | 5.49x   |
| Q3    | 2.66 | 5.15 | 1.94x   |
| Q4    | 1.08 | 1.55 | 1.44x   |

(Q3 and Q4 include ~1s of sorting time, so the window execution is likely much faster)

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions Bot added documentation Improvements or additions to documentation logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates optimizer Optimizer rules core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) common Related to common crate execution Related to the execution crate functions Changes to functions implementation physical-plan Changes to the physical-plan crate labels May 19, 2026
@2010YOUY01 2010YOUY01 marked this pull request as draft May 19, 2026 08:59
@2010YOUY01
Copy link
Copy Markdown
Contributor Author

This PoC only targets a specific window expression case. I am still prototyping and thinking through a design that is easy to extend to other possible optimizations for different window function types. After I figure that out, I will split the work into smaller PRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate core Core DataFusion crate documentation Improvements or additions to documentation execution Related to the execution crate functions Changes to functions implementation logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Changes to the physical-expr crates physical-plan Changes to the physical-plan crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant