Re: Flink performance with multiple operators reshuffling data

2021-08-31 Thread JING ZHANG
Hi Jason, > In our case, our input/output ratio of these Flin operators are all 1 to 1, so I guess it doesn't matter that much.. Yes > But I think the keys we are using in general are pretty uniform. Cool. You could run for a period of time to see if there is data skew. If there is indeed a data sk

Re: Flink performance with multiple operators reshuffling data

2021-08-31 Thread Jason Liu
Thanks for the help guys! Yea we can potentially append random strings to the keys and duplicate data across them to avoid skewness, if necessary. But I think the keys we are using in general are pretty uniform. The lowest selectivity at the up fornt method is really interesting though. In our cas

Re: Flink performance with multiple operators reshuffling data

2021-08-30 Thread JING ZHANG
Hi Jason, A job with multiple reshuffle data could be scalable under normal circumstances. But we should carefully avoid data skew. Because if input stream has data skew, add more resources would not help. Besides that, if we could adjust the order of the functions, we could put the keyed process f

Re: Flink performance with multiple operators reshuffling data

2021-08-30 Thread Caizhi Weng
Hi! Key-by operations can scale with parallelisms. Flink will shuffle your record to different sub-task according to the hash value of the key modulo number of parallelism, so the more parallelism you have the faster Flink can process data, unless there is a data skew. Jason Liu 于2021年8月31日周二 上午