Skip to content

ManifestEntry.snapshot_id setter writes to wrong index, corrupting status field #3256

@lawofcycles

Description

@lawofcycles

Apache Iceberg version

0.11.0 (latest release)

Please describe the bug 🐞

Description

The snapshot_id property setter on ManifestEntry writes to self._data[0] (the status field) instead of self._data[1] (the snapshot_id field). The getter correctly reads from self._data[1].

# pyiceberg/manifest.py

@property
def snapshot_id(self) -> int | None:
  return self._data[1]          # reads index 1 (correct)

@snapshot_id.setter
def snapshot_id(self, value: int) -> None:
  self._data[0] = value          # writes index 0 (bug: should be index 1)

When entry.snapshot_id = some_value is called, the status field is overwritten with the snapshot ID value, and snapshot_id remains unchanged.

Impact

This setter is called in _inherit_from_manifest() when a manifest entry has a null snapshot_id:

# pyiceberg/manifest.py, inheritfrom_manifest()
if entry.snapshot_id is None and manifest.added_snapshot_id is not None:
   entry.snapshot_id = manifest.added_snapshot_id  # triggers the buggy setter

Per the Iceberg v2 spec, ADDED entries may have a null snapshot_id that is inherited from the manifest's added_snapshot_id at read time. If a manifest is written with null snapshot_id for ADDED entries (which the spec allows), reading it through PyIceberg would:

  1. Corrupt the entry's status field (overwritten with the snapshot ID integer)
  2. Leave snapshot_id as None (not inherited)

In practice, Iceberg Java (Spark) writes snapshot_id explicitly for ADDED entries, so this code path is not triggered when reading Spark produced tables. However, the spec permits null snapshot_id for ADDED entries in v2, and other implementations could write null values.

Reproduction

from pyiceberg.manifest import (
   ManifestEntry, ManifestEntryStatus, ManifestFile,
   ManifestContent, DataFile, DataFileContent, inheritfrom_manifest,
)

entry = ManifestEntry.from_args(
   status=ManifestEntryStatus.ADDED,
   snapshot_id=None,
   sequence_number=None,
   file_sequence_number=None,
   data_file=DataFile.from_args(
       content=DataFileContent.DATA,
       file_path="s3://bucket/data/file.parquet",
       file_format="PARQUET",
       partition={},
       record_count=100,
       file_size_in_bytes=1024,
   ),
)

manifest = ManifestFile.from_args(
   manifest_path="s3://bucket/metadata/manifest.avro",
   manifest_length=1000,
   partition_spec_id=0,
   content=ManifestContent.DATA,
   sequence_number=1,
   min_sequence_number=1,
   added_snapshot_id=3051729675574597004,
   added_files_count=1,
   existing_files_count=0,
   deleted_files_count=0,
   added_rows_count=100,
   existing_rows_count=0,
   deleted_rows_count=0,
)

result = inheritfrom_manifest(entry, manifest)

print(result.status)       # 3051729675574597004  (expected: ManifestEntryStatus.ADDED)
print(result.snapshot_id)  # None                  (expected: 3051729675574597004)

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions