Skip to content

perf: [iceberg] Single-pass FileScanTask validation#3443

Merged
mbutrovich merged 39 commits intoapache:mainfrom
mbutrovich:faster_filescantask
Feb 10, 2026
Merged

perf: [iceberg] Single-pass FileScanTask validation#3443
mbutrovich merged 39 commits intoapache:mainfrom
mbutrovich:faster_filescantask

Conversation

@mbutrovich
Copy link
Contributor

Which issue does this PR close?

Closes #.

Rationale for this change

We currently iterate through an Iceberg scan's FileScanTasks three times to validate them. This adds overhead for large Iceberg table scans.

What changes are included in this PR?

Collapse the three loops down to one.

How are these changes tested?

Existing tests, new benchmark for running CometScanRule.

…oned columns) and run a representative test.
1. findAllIcebergSplitData() collected perPartitionByLocation (all partitions' data)
2. This map was captured in the createCometExecIter closure
3. ZippedPartitionsRDD serialized that closure to every task
4. Each task received ALL partitions' data (925 bytes to both tasks)

Instead we now use CometIcebergSplitRDD which puts per-partition data in Partition objects.
… columns), fixes TestRuntimeFiltering Iceberg Java tests with column renames.

CometIcebergSplitRDD registers subqueries so native code can look them up, fixes TestViews Iceberg Java tests with rewritten filter.
… assertion at index lookup, and defensive fallback if future Spark behavior changes.
# Conflicts:
#	.github/workflows/iceberg_spark_test.yml
# Conflicts:
#	spark/src/main/scala/org/apache/comet/rules/CometScanRule.scala
@mbutrovich mbutrovich marked this pull request as ready for review February 8, 2026 15:51
@mbutrovich mbutrovich added this to the 0.14.0 milestone Feb 8, 2026
Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @mbutrovich

@mbutrovich mbutrovich merged commit 1ccfa14 into apache:main Feb 10, 2026
158 checks passed
@mbutrovich mbutrovich deleted the faster_filescantask branch February 10, 2026 21:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants