Hello, We planning to use apache flink as a core component of our new streaming system for internal processes (finance, banking business) based on apache kafka. So we starting some research with apache flink and one of the question, arises during that work, is how flink handle with data locality. I`ll try to explain: suppose we have a kafka topic with some kind of events. And this events groups by topic partitions so that the handler (or a job worker), consuming message from a partition, have all necessary information for further processing. As an example, say we have client's payment transaction in a kafka topic. We grouping by clientId (transaction with the same clientId goes to one same kafka topic partition) and the task is to find max transaction per client in sliding windows. In terms of map\reduce there is no needs to shuffle data between all topic consumers, may be it`s worth to do within each consumer to gain some speedup due to increasing number of executors within each partition data. And my question is how flink will work in this case. Do it shuffle all data, or it have some settings to avoid this extra unnecessary shuffle/sorting operations? Thanks in advance!
With best regards, Sergey Smirnov