Hi Vino, Thanks for the proposal, I think it is a very good feature!
One thing I want to make sure is the semantics for the `localKeyBy`. From the document, the `localKeyBy` API returns an instance of `KeyedStream` which can also perform sum(), so in this case, what's the semantics for `localKeyBy()`. For example, will the following code share the same result? and what're the differences between them? 1. input.keyBy(0).sum(1) 2. input.localKeyBy(0).sum(1).keyBy(0).sum(1) 3. input.localKeyBy(0).countWindow(5).sum(1).keyBy(0).sum(1) Would also be great if we can add this into the document. Thank you very much. Best, Hequn On Fri, Jun 14, 2019 at 11:34 AM vino yang <yanghua1...@gmail.com> wrote: > Hi Aljoscha, > > I have looked at the "*Process*" section of FLIP wiki page.[1] This mail > thread indicates that it has proceeded to the third step. > > When I looked at the fourth step(vote step), I didn't find the > prerequisites for starting the voting process. > > Considering that the discussion of this feature has been done in the old > thread. [2] So can you tell me when should I start voting? Can I start now? > > Best, > Vino > > [1]: > > https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals#FlinkImprovementProposals-FLIPround-up > [2]: > > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Support-Local-Aggregation-in-Flink-td29307.html#a29308 > > leesf <leesf0...@gmail.com> 于2019年6月13日周四 上午9:19写道: > > > +1 for the FLIP, thank vino for your efforts. > > > > Best, > > Leesf > > > > vino yang <yanghua1...@gmail.com> 于2019年6月12日周三 下午5:46写道: > > > > > Hi folks, > > > > > > I would like to start the FLIP discussion thread about supporting local > > > aggregation in Flink. > > > > > > In short, this feature can effectively alleviate data skew. This is the > > > FLIP: > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-44%3A+Support+Local+Aggregation+in+Flink > > > > > > > > > *Motivation* (copied from FLIP) > > > > > > Currently, keyed streams are widely used to perform aggregating > > operations > > > (e.g., reduce, sum and window) on the elements that have the same key. > > When > > > executed at runtime, the elements with the same key will be sent to and > > > aggregated by the same task. > > > > > > The performance of these aggregating operations is very sensitive to > the > > > distribution of keys. In the cases where the distribution of keys > > follows a > > > powerful law, the performance will be significantly downgraded. More > > > unluckily, increasing the degree of parallelism does not help when a > task > > > is overloaded by a single key. > > > > > > Local aggregation is a widely-adopted method to reduce the performance > > > degraded by data skew. We can decompose the aggregating operations into > > two > > > phases. In the first phase, we aggregate the elements of the same key > at > > > the sender side to obtain partial results. Then at the second phase, > > these > > > partial results are sent to receivers according to their keys and are > > > combined to obtain the final result. Since the number of partial > results > > > received by each receiver is limited by the number of senders, the > > > imbalance among receivers can be reduced. Besides, by reducing the > amount > > > of transferred data the performance can be further improved. > > > > > > *More details*: > > > > > > Design documentation: > > > > > > > > > https://docs.google.com/document/d/1gizbbFPVtkPZPRS8AIuH8596BmgkfEa7NRwR6n3pQes/edit?usp=sharing > > > > > > Old discussion thread: > > > > > > > > > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Support-Local-Aggregation-in-Flink-td29307.html#a29308 > > > > > > JIRA: FLINK-12786 <https://issues.apache.org/jira/browse/FLINK-12786> > > > > > > We are looking forwards to your feedback! > > > > > > Best, > > > Vino > > > > > >