kafka partitions, data locality

Smirnov Sergey Vladimirovich (39833) Wed, 17 Apr 2019 02:34:10 -0700

Hello,

We planning to use apache flink as a core component of our new streaming system 
for internal processes (finance, banking business) based on apache kafka.
So we starting some research with apache flink and one of the question, arises 
during that work, is how flink handle with data locality.
I`ll try to explain: suppose we have a kafka topic with some kind of events. 
And this events groups by topic partitions so that the handler (or a job 
worker), consuming message from a partition, have all necessary information for 
further processing.
As an example, say we have client's payment transaction in a kafka topic. We 
grouping by clientId (transaction with the same clientId goes to one same kafka 
topic partition) and the task is to find max transaction per client in sliding 
windows. In terms of map\reduce there is no needs to shuffle data between all 
topic consumers, may be it`s worth to do within each consumer to gain some 
speedup due to increasing number of executors within each partition data.
And my question is how flink will work in this case. Do it shuffle all data, or 
it have some settings to avoid this extra unnecessary shuffle/sorting 
operations?
Thanks in advance!



With best regards,
Sergey Smirnov

kafka partitions, data locality

Reply via email to