Skip to content

feat: accelerate Iceberg RewriteDataFiles reads via Comet native scan#4251

Open
jordepic wants to merge 1 commit intoapache:mainfrom
jordepic:main
Open

feat: accelerate Iceberg RewriteDataFiles reads via Comet native scan#4251
jordepic wants to merge 1 commit intoapache:mainfrom
jordepic:main

Conversation

@jordepic
Copy link
Copy Markdown

@jordepic jordepic commented May 6, 2026

Which issue does this PR close?

Closes #4250.

Rationale for this change

A large number of query resources are devoted across the industry to rewriting data files using spark procedures for iceberg tables. Using native code here where possible can significantly speed up this process!

What changes are included in this PR?

Detect spark scans (SparkStagedScan) that are created during the RewriteDataFilesSparkAction and replace them with comet scans. Extract their associated tasks and pass in the lack of filter (see SparkStagedScan line 50 in the apache iceberg project).

How are these changes tested?

We write two tests to inspect the spark plan associated with rewriting data files and ensure that the operators get replaced. Before this change is merged I can also try to run it locally and pick up some benchmarks for table compactions (on tables that are only data files, and those with delete files associated).

@jordepic jordepic force-pushed the main branch 2 times, most recently from 5b382c0 to 352861a Compare May 7, 2026 00:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support Iceberg "Rewrite Data Files Procedure"

1 participant