Re: DataSet: combineGroup/reduceGroup with large number of groups

2017-06-19 Thread Fabian Hueske
Hi Urs, ad 1) Yes, my motivation for the bound was to prevent OOMEs. If you have enough memory to hold the AggregateT for each key in memory, you should be fine without a bound. If the size of AggregateT depends on the number of aggregated elements, you might run into skew issues though. ad 2) AFA

Re: DataSet: combineGroup/reduceGroup with large number of groups

2017-06-16 Thread Urs Schoenenberger
Hi Fabian, thanks, that is very helpful indeed - I now understand why the DataSet drivers insist on sorting the buffers and then processing instead of keeping state. In our case, the state should easily fit into the heap of the cluster, though. In a quick&dirty example I tried just now, the MapPa

Re: DataSet: combineGroup/reduceGroup with large number of groups

2017-06-16 Thread Fabian Hueske
Hi Urs, on the DataSet API, the only memory-safe way to do it is GroupReduceFunction. As you observed this requires a full sort of the dataset which can be quite expensive but after the sort the computation is streamed. You could also try to manually implement a hash-based combiner using a MapPart