Hi all, I am currently facing the problem of having a pipeline (DataStream API) where I need to split a GenericRecord into its fields and then aggregate all the values of a particular field into 30 minute windows. Therefore, if I were to use only a keyBy field name, I would send all the values of a field to the same parallel instance of the cluster. This is bad as the stream is quite large (15k events/second). What I want to achieve is a more even distribution of events across the different taskmanagers.
Currently, I assign a "rolling" number (between 0 and maximum parallelism) to the field name as a secondary key component and use this combination as keyBy. This leads to "partitioned" events, which I have to recombine in a second step by using only the field name of the composite key. [cid:image001.png@01DA656B.0703DA50] I tested this approach and it works but when looking at the Flink WebUI, I see that some taskmanagers handle substantially more load than others (few million records difference). I also had a look at the partitionCustom() but this doesn’t work for KeyedStreams right? Did someone else face a related issue? Any suggestions how I can distribute events with the same key more evenly? Kind Regards Dominik