It seems like a reasonable approach.  I think my initial gut feeling
would be that initializing and finalizing state for each change of key
might be a bit heavyweight in cases where there are only a few values
per key.  I think these cases are fairly common as a data
simplification / cleaning pass.  Probably only benchmarks will really
say.

On Tue, Sep 6, 2022 at 10:26 AM Yaron Gvili <rt...@hotmail.com> wrote:
>
> Hi All,
>
> I'm working on a design for ordered aggregations in Arrow C++ and would like 
> to get some opinions about it. Ordered aggregation is similar to grouped 
> aggregation except that one column in the grouping key is (known to be) 
> ordered. The result of both types of aggregations is the same but the 
> existence of an ordered column enables optimizing. In particular, the 
> grouping keys and their aggregation output can be flushed whenever all rows 
> having the same ordered value have been processed.
>
> Main points of the design:
>
>   1.  Create a class OrderedGroupAggregator derived from GroupedAggregator. 
> This class will wrap a factory for an inner OrderedAggregator, passed in via 
> options. The factory allows the inner OrderedAggregator to be recreated for 
> the next ordered value after it is finalized for the current ordered value.
>   2.  Add a callback API that accepts finalized aggregation output. This 
> callback will be implemented by the aggregation node and will stream-forward 
> finalized aggregation output.
>   3.  Add an interface for selecting the next range of rows from the batch to 
> be processed. The range may be open-ended, meaning that more rows in the next 
> batch may be added to the range. In the grouped aggregation case this range 
> would be the entire batch and open-ended, whereas in the ordered aggregation 
> case this range would end at a row where the ordered value changes. After a 
> range has been processed, finalized aggregation output will be flushed.
>
> Cheers,
> Yaron.

Reply via email to