Re: Data skew after keyBy (even with a good number of key groups)

Zakelly Lan Wed, 12 Mar 2025 22:48:23 -0700

Hi Vadim,

Could you please check if there are records with identical keys or hash
code of keys. The keyby redistribution relies on an even distribution of
hash codes. If there are identical hash codes, there probably be a data
skew.



Best,
Zakelly

On Tue, Mar 11, 2025 at 8:11 PM Vararu, Vadim via user <
[email protected]> wrote:

> Hello,
>
>
>
> I’ve got two tasks:
>
>    - one reading from the source (parallelism 1)
>    - second, a keyed function (parallelism 50)
>
>
>
> Having the max parallelism set to 1500 and the parallelism of 50, I expect
> the second task to have incoming data equally spread when distributing the
> keys to the key groups (1500 key groups / 50 parallelism = 30 keys/group).
>
>
>
> However, looking in the UI stats I see a big data skew (variates between
> 75 and 250 records received per TM).
>
>
>
> What could cause skew after keyBy, even if the maxParallelism /
> parallelism gives an even number?
>
>
>
> Thanks,
>
> Vadim.
>
>
>

Re: Data skew after keyBy (even with a good number of key groups)

Reply via email to