Skip to content

How to add_files for BucketTransform partitioned data? #3032

@dhallam

Description

@dhallam

Question

Versions:

pyiceberg == 0.11.0
Glue 5.1
Spark 3.5.6
Python 3.13
Amazon S3 (not S3Tables)

Question:

I have an iceberg table that is partitioned as

partitions=[  
    Partition("meta_received", DayTransform(), "received_day"),  
    Partition("meta_id", BucketTransform(5), "id_bucket"),  
    Partition("meta_digest", BucketTransform(5), "digest_bucket"),  
]

When writing to iceberg, if there is a failed commit, multiple parquet files are left in S3 under prefixes like

my_table/data/received_day=2026-01-22/id_bucket=2/digest_bucket=2/00000-19-9e0ba586-49c3-4f1e-b273-3db0c7fd0bda-0-00002.parquet

I want to add these files to the table to incorporate the data. I'd rather add the files in place to remove the need to move them, load them and insert them "manually" into the table.

Using

with table.transaction() as tx:  
    tx.add_files(  
        [            
	        "s3://my_bucket/my_table/data/received_day=2026-01-22/id_bucket=2/digest_bucket=2/00000-19-9e0ba586-49c3-4f1e-b273-3db0c7fd0bda-0-00002.parquet",
	        ...
        ],  
        check_duplicate_files=True,  
    )

I get

Cannot infer partition value from parquet metadata for a non-linear Partition Field: id_bucket with transform bucket[5]

because BucketTransform's preserve_order is False.

The partition info is present in the prefix.

Is there a way (or what is the best way) to commit these parquet files into the table?

Many thanks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions