Thanks Timo/LiYue, your responses were helpful.

I was worried about the network shuffle with the second keyBy. The first
keyBy is indeed evenly spreading the load across the nodes. As I mentioned
my concern was around the amount of state in each key. Maybe I am trying to
optimize pre-maturely here. 

My follow-up question is: How much state per key is considered big thus
causing performance overheads? If I am within this limit after the first
keyBy I wouldn't need the second keyBy and thus prevent the network shuffle.

Thanks,
Arun


Timo Walther wrote
> Hi Andre,
> 
> every keyBy is a shuffle over the network and thus introduces some 
> overhead. Esp. serialization of records between operators if object 
> reuse is disabled by default. If you think that not all slots (and thus 
> all nodes) are not fully occupied evenly in the first keyBy operation 
> (e.g. if you key space is just 2 values) than it makes sense to have a 
> second keyBy to do the heavy computation on the more granular key to 
> have as much parallelism as possible. It really depends on your job.
> 
> I hope this helps.
> 
> Regards,
> Timo
> 
> 
> Am 03.04.18 um 03:22 schrieb 李玥:
>> Hello,
>> In my opinion , it would be meaningful only on this situation:
>> 1. The total size of all your stats is huge enough, e.g. 1GB+.
>> 2. Splitting  you job to multiple KeyBy process would reduce the size 
>> of your stats.
>>
>> Because operation of saving stats is synchronized and all working 
>> threads are blocked until the saving stats operation finished.
>> Our team is trying to make the process of saving stats async, plz 
>> refer to : 
>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Slow-flink-checkpoint-td18946.html
>>
>> LiYue
>> http://tig.jd.com
>> 

> liyue2008@

>>
>>
>>
>>> 在 2018年4月3日,上午8:30,au.fp2018 <

> au.fp2018@

>  
> >> <mailto:

> au.fp2018@

> >> 写道:
>>>
>>> Hello Flink Community,
>>>
>>> I am relatively new to Flink. In the project I am currently working 
>>> on I've
>>> a dataflow with a keyBy() operator, which I want to convert to 
>>> dataflow with
>>> multiple keyBy() operators like this:
>>>
>>>
>>>  Source -->
>>>  KeyBy() -->
>>>  Stateful process() function that generates a more granular key -->
>>>  KeyBy(
> <id generated in the previous step>
> ) -->
>>>  More stateful computation(s) -->
>>>  Sink
>>>
>>> Are there any downsides to this approach?
>>> My reasoning behind the second keyBy() is to reduce the amount of 
>>> state and
>>> hence improve the processing speed.
>>>
>>> Thanks,
>>> Andre
>>>
>>>
>>>
>>>
>>> --
>>> Sent from: 
>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
>>





--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Reply via email to