Re: Multiple (non-consecutive) keyBy operators in a dataflow

2018-04-03 Thread Stefan Richter
I don’t think there are any particular implications. I would suggest to go for a simple keyBy and think about optimization if there should actually be a problem at hand. Best, Stefan > Am 03.04.2018 um 17:08 schrieb Timo Walther : > > @Richter: Are you aware of any per-key state size performan

Re: Multiple (non-consecutive) keyBy operators in a dataflow

2018-04-03 Thread Timo Walther
@Richter: Are you aware of any per-key state size performance implications? Am 03.04.18 um 16:56 schrieb au.fp2018: Thanks Timo/LiYue, your responses were helpful. I was worried about the network shuffle with the second keyBy. The first keyBy is indeed evenly spreading the load across the node

Re: Multiple (non-consecutive) keyBy operators in a dataflow

2018-04-03 Thread au.fp2018
Thanks Timo/LiYue, your responses were helpful. I was worried about the network shuffle with the second keyBy. The first keyBy is indeed evenly spreading the load across the nodes. As I mentioned my concern was around the amount of state in each key. Maybe I am trying to optimize pre-maturely here

Re: Multiple (non-consecutive) keyBy operators in a dataflow

2018-04-03 Thread Timo Walther
Hi Andre, every keyBy is a shuffle over the network and thus introduces some overhead. Esp. serialization of records between operators if object reuse is disabled by default. If you think that not all slots (and thus all nodes) are not fully occupied evenly in the first keyBy operation (e.g.

Re: Multiple (non-consecutive) keyBy operators in a dataflow

2018-04-02 Thread 李玥
Hello, In my opinion , it would be meaningful only on this situation: 1. The total size of all your stats is huge enough, e.g. 1GB+. 2. Splitting you job to multiple KeyBy process would reduce the size of your stats. Because operation of saving stats is synchronized and all working threa