Thanks Jorge, Wes,
Will study the links and try to propose improvements of c++ aggregate function.

On 9/16/20 11:17 PM, Wes McKinney wrote:
Perhaps it would be helpful to look at how Clickhouse's aggregate
functions are implemented?

https://github.com/ClickHouse/ClickHouse/tree/master/src/AggregateFunctions

You're welcome to propose improvements to the kernel interface to
accommodate more complex aggregate functions

On Wed, Sep 16, 2020 at 5:27 AM Jorge Cardoso Leitão
<jorgecarlei...@gmail.com> wrote:

Hi Yibo,

That is correct. The simplest example is an average of 3 elements
{x1,x2,x3}, in two chunks: {x1} and {x2,x3}. The average of the average is
not equal to the average:

avg({avg({x1}), avg({x2,x3})}) = ((x1) + (x2 + x3)/2)/2 != (x1 + x2 + x3) /
3 = avg({x1,x2,x3})

We are solving this in DataFusion (Rust):

Issue: https://issues.apache.org/jira/browse/ARROW-9937
Proposal:
https://docs.google.com/document/d/1n-GS103ih3QIeQMbf_zyDStjUmryRQd45ypgk884LHU/edit
PR https://github.com/apache/arrow/pull/8172,

Essentially, what you said, there needs to be two operations, `update` and
`merge`, which are not the same.
There are many other examples, and that is also why e.g. spark's UDAF's
interface requires `update` and `merge`




On Wed, Sep 16, 2020 at 12:12 PM Yibo Cai <yibo....@arm.com> wrote:

Hi,

I have a question about aggregate kernel implementation. Any help is
appreciated.

Aggregate kernel implements "consume" and "merge" interfaces. For a
chunked array, "consume" is called for each array to get a temporary
aggregated result, then "merge" it with previously consumed result. For
associative operations like min/max/sum, this pattern is convenient. We can
easily "merge" min/max/sum of two arrays, e.g, sum([array_a, array_b]) =
sum(array_a) + sum(array_b).

But I wonder what's the best approach to deal with operations like
stdev/percentile. Results of these operations cannot be easily "merged". We
have to walk through all the chunks to get the result. For these
operations, looks "consume" must copy the input array and do all
calculation once at "finalize" time. Or we don't expect it to support
chunked array for them.

Yibo

Reply via email to