Hi Urs,

on the DataSet API, the only memory-safe way to do it is
GroupReduceFunction.
As you observed this requires a full sort of the dataset which can be quite
expensive but after the sort the computation is streamed.
You could also try to manually implement a hash-based combiner using a
MapPartitionFunction. The function would have a HashMap on the key with a
fixed size that needs to be manually tuned.
When you have to insert a new record into the HashMap but it reached the
max size, you have to evict a record first. Since all of this happens on
the heap, it won't be memory-safe and might fail with an OOME.

On the DataStream API you can use a ProcessFunction with keyed ValueState
for the current AggregateT of each key. For each record you fetch the
Aggregate from the state and update it.
To emit the results at the end, you'll need to register a timer to emit the
results at the end because the final aggregates are stored in the local
state but never emitted.
Another thing to consider is the state backend. You'll probably have to use
the RocksDBStateBackend to be able to spill state to disk.

Hope this helps,
Fabian


2017-06-16 17:00 GMT+02:00 Urs Schoenenberger <
urs.schoenenber...@tngtech.com>:

> Hi,
>
> I'm working on a batch job (roughly 10 billion records of input, 10
> million groups) that is essentially a 'fold' over each group, that is, I
> have a function
>
> AggregateT addToAggrate(AggregateT agg, RecordT record) {...}
>
> and want to fold this over each group in my DataSet.
>
> My understanding is that I cannot use .groupBy(0).reduce(...) since the
> ReduceFunction only supports the case where AggregateT is the same as
> RecordT.
>
> A simple solution using .reduceGroup(...) works, but spills all input
> data in the reduce step, which produces a lot of slow & expensive Disk IO.
>
> Therefore, we tried using .combineGroup(...).reduceGroup(...), but
> experienced a similar amount of spilling. Checking the source of the
> *Combine drivers, it seems that they accumulate events in a buffer, sort
> the buffer by key, and combine adjacent records in the same group. This
> does not work in my case due to the large number of groups - the records
> in the buffer are most likely to all belong to different groups. The
> "combine" phase therefore becomes a noop turning a single RecordT into
> an AggregateT, and the reduce phase has 10 billion AggregateTs to combine.
>
> Is there a way of modelling this computation efficiently with the
> DataSet API? Alternatively, can I turn this into a DataStream job? (The
> implementation there would simply be a MapFunction on a KeyedStream with
> the AggregateT residing in keyed state, although I don't know how I
> would emit this state at the end of the data stream only.)
>
> Thanks,
> Urs
>
> --
> Urs Schönenberger
> TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
> Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
> Sitz: Unterföhring * Amtsgericht München * HRB 135082
>

Reply via email to