Re: Custom partitioning of keys with keyBy

2021-11-04 Thread naitong Xiao
The keyBy argument function is a deterministic function under same MaxParallelism,  which make sure the key group is always same for same key. My goal here is making keys distributed evenly among operators even with different parallelism. I implement the mapping in a different way by exhausting

RE: Custom partitioning of keys with keyBy

2021-11-04 Thread Schwalbe Matthias
Itzchakov Sent: Donnerstag, 4. November 2021 08:25 To: naitong Xiao Cc: user Subject: Re: Custom partitioning of keys with keyBy Thank you Schwalbe, David and Naitong for your answers! David: This is what we're currently doing ATM, and I wanted to know if there's any simplified approach to

Re: Custom partitioning of keys with keyBy

2021-11-04 Thread Yuval Itzchakov
Thank you Schwalbe, David and Naitong for your answers! *David*: This is what we're currently doing ATM, and I wanted to know if there's any simplified approach to this. This is what we have so far: https://gist.github.com/YuvalItzchakov/9441a4a0e80609e534e69804e94cb57b *Naitong*: The keyBy intern

Re: Custom partitioning of keys with keyBy

2021-11-03 Thread naitong Xiao
I think I had a similar scenario several months ago, here is my related code: val MAX_PARALLELISM = 16 val KEY_RAND_SALT = “73b46” logSource.keyBy{ value =>  val keyGroup = KeyGroupRangeAssignment.assignToKeyGroup(value.deviceUdid, MAX_PARALLELISM)  s"$KEY_RAND_SALT$keyGroup" } The keyGroup is

Re: Custom partitioning of keys with keyBy

2021-11-03 Thread David Anderson
Another possibility, if you know in advance the values of the keys, is to find a mapping that transforms the original keys into new keys that will, in fact, end up in disjoint key groups that will, in turn, be assigned to different slots (given a specific parallelism). This is ugly, but feasible.

RE: Custom partitioning of keys with keyBy

2021-11-03 Thread Schwalbe Matthias
Hi Yuval, Just a couple of comments: * Assuming that all your 4 different keys are evenly distributed, and you send them to (only) 3 buckets, you would expect at least one bucket to cover 2 of your keys, hence the 50% * With low entropy keys avoiding data skew is quite difficult *