Feature Request / Improvement
We’ve observed a large performance gap between the Python and Java implementations for logical overwrites (metadata-only). Profiling shows most time is spent in snapshot.py (_manifests), where we are not pruning manifests when computing _existing_manifests and _deleted_entries.
After adding manifest pruning, we see the following benchmark results (100 overwrite iterations):
| Scenario |
Avg (s) |
Min (s) |
Max (s) |
| Current PyIceberg – same partition |
1.15 |
0.78 |
1.51 |
| Current PyIceberg – random partitions |
0.96 |
0.77 |
1.26 |
| Pruning PyIceberg – same partition |
0.50 |
0.28 |
0.78 |
| Pruning PyIceberg – random partitions |
0.38 |
0.27 |
0.49 |
Benchmark script: https://gist.github.com/gabeiglio/0092970c144228ef6d333a873dc1d316
Here is the PR for the optimization