It seems like a reasonable approach. I think my initial gut feeling would be that initializing and finalizing state for each change of key might be a bit heavyweight in cases where there are only a few values per key. I think these cases are fairly common as a data simplification / cleaning pass. Probably only benchmarks will really say.
On Tue, Sep 6, 2022 at 10:26 AM Yaron Gvili <rt...@hotmail.com> wrote: > > Hi All, > > I'm working on a design for ordered aggregations in Arrow C++ and would like > to get some opinions about it. Ordered aggregation is similar to grouped > aggregation except that one column in the grouping key is (known to be) > ordered. The result of both types of aggregations is the same but the > existence of an ordered column enables optimizing. In particular, the > grouping keys and their aggregation output can be flushed whenever all rows > having the same ordered value have been processed. > > Main points of the design: > > 1. Create a class OrderedGroupAggregator derived from GroupedAggregator. > This class will wrap a factory for an inner OrderedAggregator, passed in via > options. The factory allows the inner OrderedAggregator to be recreated for > the next ordered value after it is finalized for the current ordered value. > 2. Add a callback API that accepts finalized aggregation output. This > callback will be implemented by the aggregation node and will stream-forward > finalized aggregation output. > 3. Add an interface for selecting the next range of rows from the batch to > be processed. The range may be open-ended, meaning that more rows in the next > batch may be added to the range. In the grouped aggregation case this range > would be the entire batch and open-ended, whereas in the ordered aggregation > case this range would end at a row where the ordered value changes. After a > range has been processed, finalized aggregation output will be flushed. > > Cheers, > Yaron.