Hi everyone, This is something I've been mulling about for a while and I thought this would be the right forum to discuss this topic as a follow up to a similar topic discussion thread on using python bindings from iceberg-rust to support pyiceberg.
As soon as we released 0.7.0 which supports writes into tables with TimeTransform partitions <https://github.com/apache/iceberg-python/pull/784/files>, our prospective users started asking about the support for Bucket Transform partitions. Iceberg has a custom logic for Bucket partitions (Thanks for the link <https://iceberg.apache.org/spec/#bucket-transform-details> Fokko). I took a look into the Java code <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/transforms/Bucket.java#L99> and I think it looks somewhat like: * mmh3_hash(val) mod (num_buckets) And has field type specific logic so that each type is hashed appropriately. Unfortunately there is no existing pyarrow compute function that does this, so I'd like to propose that we write the function in iceberg-rust that takes an Arrow Array reference and the bucket number as the input, that returns a new Arrow Array reference with the bucket values evaluated that corresponds to the input Arrow Array in the same order. When iceberg-rust becomes more mature, I believe that the same underlying transform function can be reused for bucket partitions within this repository, and in the interim we could support writes into Bucket partitioned tables on PyIceberg by exposing this function as a Python binding that we import into PyIceberg. I'd love to hear how folks feel about this idea! Cross posted Discussion on iceberg-rust: #514 <https://github.com/apache/iceberg-rust/discussions/514> Sung