Apache Iceberg version
0.11.0 (latest release)
Please describe the bug 🐞
Description
The snapshot_id property setter on ManifestEntry writes to self._data[0] (the status field) instead of self._data[1] (the snapshot_id field). The getter correctly reads from self._data[1].
# pyiceberg/manifest.py
@property
def snapshot_id(self) -> int | None:
return self._data[1] # reads index 1 (correct)
@snapshot_id.setter
def snapshot_id(self, value: int) -> None:
self._data[0] = value # writes index 0 (bug: should be index 1)
When entry.snapshot_id = some_value is called, the status field is overwritten with the snapshot ID value, and snapshot_id remains unchanged.
Impact
This setter is called in _inherit_from_manifest() when a manifest entry has a null snapshot_id:
# pyiceberg/manifest.py, inheritfrom_manifest()
if entry.snapshot_id is None and manifest.added_snapshot_id is not None:
entry.snapshot_id = manifest.added_snapshot_id # triggers the buggy setter
Per the Iceberg v2 spec, ADDED entries may have a null snapshot_id that is inherited from the manifest's added_snapshot_id at read time. If a manifest is written with null snapshot_id for ADDED entries (which the spec allows), reading it through PyIceberg would:
- Corrupt the entry's
status field (overwritten with the snapshot ID integer)
- Leave
snapshot_id as None (not inherited)
In practice, Iceberg Java (Spark) writes snapshot_id explicitly for ADDED entries, so this code path is not triggered when reading Spark produced tables. However, the spec permits null snapshot_id for ADDED entries in v2, and other implementations could write null values.
Reproduction
from pyiceberg.manifest import (
ManifestEntry, ManifestEntryStatus, ManifestFile,
ManifestContent, DataFile, DataFileContent, inheritfrom_manifest,
)
entry = ManifestEntry.from_args(
status=ManifestEntryStatus.ADDED,
snapshot_id=None,
sequence_number=None,
file_sequence_number=None,
data_file=DataFile.from_args(
content=DataFileContent.DATA,
file_path="s3://bucket/data/file.parquet",
file_format="PARQUET",
partition={},
record_count=100,
file_size_in_bytes=1024,
),
)
manifest = ManifestFile.from_args(
manifest_path="s3://bucket/metadata/manifest.avro",
manifest_length=1000,
partition_spec_id=0,
content=ManifestContent.DATA,
sequence_number=1,
min_sequence_number=1,
added_snapshot_id=3051729675574597004,
added_files_count=1,
existing_files_count=0,
deleted_files_count=0,
added_rows_count=100,
existing_rows_count=0,
deleted_rows_count=0,
)
result = inheritfrom_manifest(entry, manifest)
print(result.status) # 3051729675574597004 (expected: ManifestEntryStatus.ADDED)
print(result.snapshot_id) # None (expected: 3051729675574597004)
Willingness to contribute
Apache Iceberg version
0.11.0 (latest release)
Please describe the bug 🐞
Description
The
snapshot_idproperty setter onManifestEntrywrites toself._data[0](thestatusfield) instead ofself._data[1](thesnapshot_idfield). The getter correctly reads fromself._data[1].When
entry.snapshot_id = some_valueis called, thestatusfield is overwritten with the snapshot ID value, andsnapshot_idremains unchanged.Impact
This setter is called in
_inherit_from_manifest()when a manifest entry has a nullsnapshot_id:Per the Iceberg v2 spec, ADDED entries may have a null
snapshot_idthat is inherited from the manifest'sadded_snapshot_idat read time. If a manifest is written with nullsnapshot_idfor ADDED entries (which the spec allows), reading it through PyIceberg would:statusfield (overwritten with the snapshot ID integer)snapshot_idasNone(not inherited)In practice, Iceberg Java (Spark) writes
snapshot_idexplicitly for ADDED entries, so this code path is not triggered when reading Spark produced tables. However, the spec permits nullsnapshot_idfor ADDED entries in v2, and other implementations could write null values.Reproduction
Willingness to contribute