Thanks Timo/LiYue, your responses were helpful. I was worried about the network shuffle with the second keyBy. The first keyBy is indeed evenly spreading the load across the nodes. As I mentioned my concern was around the amount of state in each key. Maybe I am trying to optimize pre-maturely here.
My follow-up question is: How much state per key is considered big thus causing performance overheads? If I am within this limit after the first keyBy I wouldn't need the second keyBy and thus prevent the network shuffle. Thanks, Arun Timo Walther wrote > Hi Andre, > > every keyBy is a shuffle over the network and thus introduces some > overhead. Esp. serialization of records between operators if object > reuse is disabled by default. If you think that not all slots (and thus > all nodes) are not fully occupied evenly in the first keyBy operation > (e.g. if you key space is just 2 values) than it makes sense to have a > second keyBy to do the heavy computation on the more granular key to > have as much parallelism as possible. It really depends on your job. > > I hope this helps. > > Regards, > Timo > > > Am 03.04.18 um 03:22 schrieb 李玥: >> Hello, >> In my opinion , it would be meaningful only on this situation: >> 1. The total size of all your stats is huge enough, e.g. 1GB+. >> 2. Splitting you job to multiple KeyBy process would reduce the size >> of your stats. >> >> Because operation of saving stats is synchronized and all working >> threads are blocked until the saving stats operation finished. >> Our team is trying to make the process of saving stats async, plz >> refer to : >> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Slow-flink-checkpoint-td18946.html >> >> LiYue >> http://tig.jd.com >> > liyue2008@ >> >> >> >>> 在 2018年4月3日,上午8:30,au.fp2018 < > au.fp2018@ > > >> <mailto: > au.fp2018@ > >> 写道: >>> >>> Hello Flink Community, >>> >>> I am relatively new to Flink. In the project I am currently working >>> on I've >>> a dataflow with a keyBy() operator, which I want to convert to >>> dataflow with >>> multiple keyBy() operators like this: >>> >>> >>> Source --> >>> KeyBy() --> >>> Stateful process() function that generates a more granular key --> >>> KeyBy( > <id generated in the previous step> > ) --> >>> More stateful computation(s) --> >>> Sink >>> >>> Are there any downsides to this approach? >>> My reasoning behind the second keyBy() is to reduce the amount of >>> state and >>> hence improve the processing speed. >>> >>> Thanks, >>> Andre >>> >>> >>> >>> >>> -- >>> Sent from: >>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ >> -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/