I think this is independent of streaming. If you want to compute the aggregate over all keys and data you need to do this in a single task, e.g. use a (flat)map with parallelism 1, do the aggregation there and then broadcast to downstream operators. Does this make sense or am I overlooking something?
On 12 November 2016 at 12:18:04, Felix Neutatz (neut...@googlemail.com) wrote: > > want to calculate a local aggregation for each task and then > combine all these local aggregates to one global aggregate and > push this global aggregate to all nodes and continue processing > the data stream. If you don't understand my description, I also > made some drawings of what I mean: > https://docs.google.com/presentation/d/13ei6pzhwNKqNShhdNWXqJaYCG1z0Hsrxfy5sRnqun5M/edit?usp=sharing > >