Hi Urs,
ad 1) Yes, my motivation for the bound was to prevent OOMEs. If you have
enough memory to hold the AggregateT for each key in memory, you should be
fine without a bound. If the size of AggregateT depends on the number of
aggregated elements, you might run into skew issues though.
ad 2) AFA
Hi Fabian,
thanks, that is very helpful indeed - I now understand why the DataSet
drivers insist on sorting the buffers and then processing instead of
keeping state.
In our case, the state should easily fit into the heap of the cluster,
though. In a quick&dirty example I tried just now, the MapPa
Hi Urs,
on the DataSet API, the only memory-safe way to do it is
GroupReduceFunction.
As you observed this requires a full sort of the dataset which can be quite
expensive but after the sort the computation is streamed.
You could also try to manually implement a hash-based combiner using a
MapPart