subject:"Re\: DataSet\: combineGroup\/reduceGroup with large number of groups"

Re: DataSet: combineGroup/reduceGroup with large number of groups

2017-06-19 Thread Fabian Hueske

Hi Urs, ad 1) Yes, my motivation for the bound was to prevent OOMEs. If you have enough memory to hold the AggregateT for each key in memory, you should be fine without a bound. If the size of AggregateT depends on the number of aggregated elements, you might run into skew issues though. ad 2) AFA

Re: DataSet: combineGroup/reduceGroup with large number of groups

2017-06-16 Thread Urs Schoenenberger

Hi Fabian, thanks, that is very helpful indeed - I now understand why the DataSet drivers insist on sorting the buffers and then processing instead of keeping state. In our case, the state should easily fit into the heap of the cluster, though. In a quick&dirty example I tried just now, the MapPa

Re: DataSet: combineGroup/reduceGroup with large number of groups

2017-06-16 Thread Fabian Hueske

Hi Urs, on the DataSet API, the only memory-safe way to do it is GroupReduceFunction. As you observed this requires a full sort of the dataset which can be quite expensive but after the sort the computation is streamed. You could also try to manually implement a hash-based combiner using a MapPart

Re: DataSet: combineGroup/reduceGroup with large number of groups

Re: DataSet: combineGroup/reduceGroup with large number of groups

Re: DataSet: combineGroup/reduceGroup with large number of groups

3 matches

Site Navigation

Mail list logo

Footer information