Excited and Big +1 for this feature. SHI Xiaogang <shixiaoga...@gmail.com> 于2019年6月3日周一 下午3:37写道:
> Nice feature. > Looking forward to having it in Flink. > > Regards, > Xiaogang > > vino yang <yanghua1...@gmail.com> 于2019年6月3日周一 下午3:31写道: > > > Hi all, > > > > As we mentioned in some conference, such as Flink Forward SF 2019 and > QCon > > Beijing 2019, our team has implemented "Local aggregation" in our inner > > Flink fork. This feature can effectively alleviate data skew. > > > > Currently, keyed streams are widely used to perform aggregating > operations > > (e.g., reduce, sum and window) on the elements that having the same key. > > When executed at runtime, the elements with the same key will be sent to > > and aggregated by the same task. > > > > The performance of these aggregating operations is very sensitive to the > > distribution of keys. In the cases where the distribution of keys > follows a > > powerful law, the performance will be significantly downgraded. More > > unluckily, increasing the degree of parallelism does not help when a task > > is overloaded by a single key. > > > > Local aggregation is a widely-adopted method to reduce the performance > > degraded by data skew. We can decompose the aggregating operations into > two > > phases. In the first phase, we aggregate the elements of the same key at > > the sender side to obtain partial results. Then at the second phase, > these > > partial results are sent to receivers according to their keys and are > > combined to obtain the final result. Since the number of partial results > > received by each receiver is limited by the number of senders, the > > imbalance among receivers can be reduced. Besides, by reducing the amount > > of transferred data the performance can be further improved. > > > > The design documentation is here: > > > > > https://docs.google.com/document/d/1gizbbFPVtkPZPRS8AIuH8596BmgkfEa7NRwR6n3pQes/edit?usp=sharing > > > > Any comment and feedback are welcome and appreciated. > > > > Best, > > Vino > > >