I think that roughly, an approach like the compacting hash table is the right one. Go ahead and take a stab at it, if you want, ping us if you run into obstacles.
Here are a few thoughts on the hash-aggregator from discussions between Fabian and me: 1) It may be worth to have a specialized implementations for aggregates of constant-length values. Such as counts, or sum/max/min of types like int/long/double. The value can be updated in place, there is no need ever for compaction logic. The compacting logic is in the end more tricky than it seems at the first glance, it took quite a few cycles to eliminate most bugs. 2) I would try to tailor the hash aggregator to a "combiner" functionality initially. That means no spilling on memory shortage, but eviction of pre-aggregated results into the output stream. That is probably easier to do and the most powerful improvement over the current capabilities. Happy coding! Greetings, Stephan On Thu, Oct 1, 2015 at 8:33 PM, Gábor Gévay <gga...@gmail.com> wrote: > Hello, > > I would really like to see FLINK-2237 solved. > I would implement this feature over the weekend, if the > CompactingHashTable can be used to solve it (see my comment there). > Could you please give me some advice on whether is this a viable > approach, or you perhaps see some difficulties that I'm not aware of? > > Best, > Gabor >