Hi,

I've filed https://github.com/apache/iceberg/issues/2837 for this as well.

Best
PF



On Sat, Jul 17, 2021 at 12:48 AM Piotr Findeisen <pi...@starburstdata.com>
wrote:

> Hi,
>
> It was discovered by @Mateusz Gajewski
> <mateusz.gajew...@starburstdata.com> that Iceberg bucketing
> transformation for string isn't regular Murmur3 32-bit hash.
>
> Upon closer investigation we found out that the code:
>
>
> https://github.com/apache/iceberg/blob/0c50b2074cd5dad59bbcb4b4599ec3ae11a34b49/api/src/main/java/org/apache/iceberg/transforms/Bucket.java#L239
>
> is affected by Guava issue https://github.com/google/guava/issues/5648
> that causes wrong results for input containing surrogate pairs (Unicode
> codepooints outside of Basic Multilingual Plane).
>
> Assuming it's indeed a bug and it gets fixed (I posted a PR to Guava with
> the proposed fix), this can cause incorrect query results, since bucketing
> function definition will effectively change.
>
> This is mostly FYI, unless we can do something more about it.
>
> Best
> PF
>
>

Reply via email to