Hi, I've filed https://github.com/apache/iceberg/issues/2837 for this as well.
Best PF On Sat, Jul 17, 2021 at 12:48 AM Piotr Findeisen <pi...@starburstdata.com> wrote: > Hi, > > It was discovered by @Mateusz Gajewski > <mateusz.gajew...@starburstdata.com> that Iceberg bucketing > transformation for string isn't regular Murmur3 32-bit hash. > > Upon closer investigation we found out that the code: > > > https://github.com/apache/iceberg/blob/0c50b2074cd5dad59bbcb4b4599ec3ae11a34b49/api/src/main/java/org/apache/iceberg/transforms/Bucket.java#L239 > > is affected by Guava issue https://github.com/google/guava/issues/5648 > that causes wrong results for input containing surrogate pairs (Unicode > codepooints outside of Basic Multilingual Plane). > > Assuming it's indeed a bug and it gets fixed (I posted a PR to Guava with > the proposed fix), this can cause incorrect query results, since bucketing > function definition will effectively change. > > This is mostly FYI, unless we can do something more about it. > > Best > PF > >