Hi Vino,

Thanks for the proposal, I think it is a very good feature!

One thing I want to make sure is the semantics for the `localKeyBy`. From
the document, the `localKeyBy` API returns an instance of `KeyedStream`
which can also perform sum(), so in this case, what's the semantics for
`localKeyBy()`. For example, will the following code share the same result?
and what're the differences between them?

1. input.keyBy(0).sum(1)
2. input.localKeyBy(0).sum(1).keyBy(0).sum(1)
3. input.localKeyBy(0).countWindow(5).sum(1).keyBy(0).sum(1)

Would also be great if we can add this into the document. Thank you very
much.

Best, Hequn


On Fri, Jun 14, 2019 at 11:34 AM vino yang <yanghua1...@gmail.com> wrote:

> Hi Aljoscha,
>
> I have looked at the "*Process*" section of FLIP wiki page.[1] This mail
> thread indicates that it has proceeded to the third step.
>
> When I looked at the fourth step(vote step), I didn't find the
> prerequisites for starting the voting process.
>
> Considering that the discussion of this feature has been done in the old
> thread. [2] So can you tell me when should I start voting? Can I start now?
>
> Best,
> Vino
>
> [1]:
>
> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals#FlinkImprovementProposals-FLIPround-up
> [2]:
>
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Support-Local-Aggregation-in-Flink-td29307.html#a29308
>
> leesf <leesf0...@gmail.com> 于2019年6月13日周四 上午9:19写道:
>
> > +1 for the FLIP, thank vino for your efforts.
> >
> > Best,
> > Leesf
> >
> > vino yang <yanghua1...@gmail.com> 于2019年6月12日周三 下午5:46写道:
> >
> > > Hi folks,
> > >
> > > I would like to start the FLIP discussion thread about supporting local
> > > aggregation in Flink.
> > >
> > > In short, this feature can effectively alleviate data skew. This is the
> > > FLIP:
> > >
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-44%3A+Support+Local+Aggregation+in+Flink
> > >
> > >
> > > *Motivation* (copied from FLIP)
> > >
> > > Currently, keyed streams are widely used to perform aggregating
> > operations
> > > (e.g., reduce, sum and window) on the elements that have the same key.
> > When
> > > executed at runtime, the elements with the same key will be sent to and
> > > aggregated by the same task.
> > >
> > > The performance of these aggregating operations is very sensitive to
> the
> > > distribution of keys. In the cases where the distribution of keys
> > follows a
> > > powerful law, the performance will be significantly downgraded. More
> > > unluckily, increasing the degree of parallelism does not help when a
> task
> > > is overloaded by a single key.
> > >
> > > Local aggregation is a widely-adopted method to reduce the performance
> > > degraded by data skew. We can decompose the aggregating operations into
> > two
> > > phases. In the first phase, we aggregate the elements of the same key
> at
> > > the sender side to obtain partial results. Then at the second phase,
> > these
> > > partial results are sent to receivers according to their keys and are
> > > combined to obtain the final result. Since the number of partial
> results
> > > received by each receiver is limited by the number of senders, the
> > > imbalance among receivers can be reduced. Besides, by reducing the
> amount
> > > of transferred data the performance can be further improved.
> > >
> > > *More details*:
> > >
> > > Design documentation:
> > >
> > >
> >
> https://docs.google.com/document/d/1gizbbFPVtkPZPRS8AIuH8596BmgkfEa7NRwR6n3pQes/edit?usp=sharing
> > >
> > > Old discussion thread:
> > >
> > >
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Support-Local-Aggregation-in-Flink-td29307.html#a29308
> > >
> > > JIRA: FLINK-12786 <https://issues.apache.org/jira/browse/FLINK-12786>
> > >
> > > We are looking forwards to your feedback!
> > >
> > > Best,
> > > Vino
> > >
> >
>

Reply via email to