Hi everyone,

This is something I've been mulling about for a while and I thought this
would be the right forum to discuss this topic as a follow up to a similar
topic discussion thread on using python bindings from iceberg-rust to
support pyiceberg.

As soon as we released 0.7.0 which supports writes into tables with
TimeTransform partitions
<https://github.com/apache/iceberg-python/pull/784/files>, our prospective
users started asking about the support for Bucket Transform partitions.

Iceberg has a custom logic for Bucket partitions (Thanks for the link
<https://iceberg.apache.org/spec/#bucket-transform-details> Fokko). I took
a look into the Java code
<https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/transforms/Bucket.java#L99>
and I think it looks somewhat like:

* mmh3_hash(val) mod (num_buckets)

And has field type specific logic so that each type is hashed appropriately.

Unfortunately there is no existing pyarrow compute function that does this,
so I'd like to propose that we write the function in iceberg-rust that
takes an Arrow Array reference and the bucket number as the input, that
returns a new Arrow Array reference with the bucket values evaluated that
corresponds to the input Arrow Array in the same order.

When iceberg-rust becomes more mature, I believe that the same underlying
transform function can be reused for bucket partitions within this
repository, and in the interim we could support writes into Bucket
partitioned tables on PyIceberg by exposing this function as a Python
binding that we import into PyIceberg.

I'd love to hear how folks feel about this idea!


Cross posted Discussion on iceberg-rust: #514
<https://github.com/apache/iceberg-rust/discussions/514>


Sung

Reply via email to