Skip to content

Speed up logical overwrites by pruning manifests in Snapshot._manifests #3039

@gabeiglio

Description

@gabeiglio

Feature Request / Improvement

We’ve observed a large performance gap between the Python and Java implementations for logical overwrites (metadata-only). Profiling shows most time is spent in snapshot.py (_manifests), where we are not pruning manifests when computing _existing_manifests and _deleted_entries.

After adding manifest pruning, we see the following benchmark results (100 overwrite iterations):

Scenario Avg (s) Min (s) Max (s)
Current PyIceberg – same partition 1.15 0.78 1.51
Current PyIceberg – random partitions 0.96 0.77 1.26
Pruning PyIceberg – same partition 0.50 0.28 0.78
Pruning PyIceberg – random partitions 0.38 0.27 0.49

Benchmark script: https://gist.github.com/gabeiglio/0092970c144228ef6d333a873dc1d316

Here is the PR for the optimization

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions