Re: Flink performance with multiple operators reshuffling data

2021-08-31 Thread JING ZHANG
Hi Jason, > In our case, our input/output ratio of these Flin operators are all 1 to 1, so I guess it doesn't matter that much.. Yes > But I think the keys we are using in general are pretty uniform. Cool. You could run for a period of time to see if there is data skew. If there is indeed a data sk

Re: Flink performance with multiple operators reshuffling data

2021-08-31 Thread Jason Liu
Thanks for the help guys! Yea we can potentially append random strings to the keys and duplicate data across them to avoid skewness, if necessary. But I think the keys we are using in general are pretty uniform. The lowest selectivity at the up fornt method is really interesting though. In our cas

Re: Flink performance with multiple operators reshuffling data

2021-08-30 Thread JING ZHANG
Hi Jason, A job with multiple reshuffle data could be scalable under normal circumstances. But we should carefully avoid data skew. Because if input stream has data skew, add more resources would not help. Besides that, if we could adjust the order of the functions, we could put the keyed process f

Re: Flink performance with multiple operators reshuffling data

2021-08-30 Thread Caizhi Weng
Hi! Key-by operations can scale with parallelisms. Flink will shuffle your record to different sub-task according to the hash value of the key modulo number of parallelism, so the more parallelism you have the faster Flink can process data, unless there is a data skew. Jason Liu 于2021年8月31日周二 上午

Flink performance with multiple operators reshuffling data

2021-08-30 Thread Jason Liu
Hi there, We have this use case where we need to have multiple keybys operators with its own MapState, all with different keys, in a single Flink app. This obviously means we'll be reshuffling our data a lot. Our TPS is around 1-2k, with ~2kb per event and we use Kinesis Data Analytics as