Re: [C++][Compute] question about aggregate kernels

2020-09-20 Thread Yibo Cai
Appreciate the helps from everyone. Looks arrow c++ aggregate kernel already addressed my problem of combining states from batches. Thanks. On 9/18/20 12:08 PM, Micah Kornfield wrote: Interestingly, spark uses count / N

Re: [C++][Compute] question about aggregate kernels

2020-09-17 Thread Micah Kornfield
> > Interestingly, spark uses count / N > > to > compute the average, not an online algorithm. Yes, it looks like the actual Spark

Re: [C++][Compute] question about aggregate kernels

2020-09-17 Thread Jorge Cardoso Leitão
Hi, > I think what everyone else was potentially stating implicitly is that for combining details about arrays, for std. dev. and average there needs to be more state kept that is different from the elements that one is actually dealing with. For std. dev. you need to keep two numbers (same wit

Re: [C++][Compute] question about aggregate kernels

2020-09-17 Thread Micah Kornfield
> > stdev = sqrt(sum_squares/N - (sum/N)^2) >From [1] this is considered the naive algorithm: Because SumSq and (Sum×Sum)/*n* can be very similar numbers, cancellation > can lead to the > precision

Re: [C++][Compute] question about aggregate kernels

2020-09-17 Thread Andrew Wieteska
Dear all I'm not sure I'm thinking about this right, but if we're looking to leverage vectorization for standard deviation/variance would it make sense to compute the sum, the sum of squares, and the total number of data (N) over all chunks and compute the actual function, stdev = sqrt(sum_square

Re: [C++][Compute] question about aggregate kernels

2020-09-17 Thread Micah Kornfield
> > stddev(x) = sqrt((sum(x*x) - sum(x)*sum(x) / count(x))/(count(x)-1))) This is not numerically stable. Please do not use it. Please see [1] for some algorithms that might be better. The equation you provided is great in practice to calculate stdev for one > array. It doesn't address the issu

Re: [C++][Compute] question about aggregate kernels

2020-09-17 Thread Yibo Cai
Thanks Jorge, Wes, Will study the links and try to propose improvements of c++ aggregate function. On 9/16/20 11:17 PM, Wes McKinney wrote: Perhaps it would be helpful to look at how Clickhouse's aggregate functions are implemented? https://github.com/ClickHouse/ClickHouse/tree/master/src/Aggre

Re: [C++][Compute] question about aggregate kernels

2020-09-17 Thread Yibo Cai
Thanks Andrew. The link gives a cool method to calculate variance incrementally. I think the problem is that it's computationally too expensive (cannot leverage vectorization, three divisions for a single data point). The equation you provided is great in practice to calculate stdev for one arr

Re: [C++][Compute] question about aggregate kernels

2020-09-16 Thread Wes McKinney
Perhaps it would be helpful to look at how Clickhouse's aggregate functions are implemented? https://github.com/ClickHouse/ClickHouse/tree/master/src/AggregateFunctions You're welcome to propose improvements to the kernel interface to accommodate more complex aggregate functions On Wed, Sep 16,

Re: [C++][Compute] question about aggregate kernels

2020-09-16 Thread Jorge Cardoso Leitão
Hi Yibo, That is correct. The simplest example is an average of 3 elements {x1,x2,x3}, in two chunks: {x1} and {x2,x3}. The average of the average is not equal to the average: avg({avg({x1}), avg({x2,x3})}) = ((x1) + (x2 + x3)/2)/2 != (x1 + x2 + x3) / 3 = avg({x1,x2,x3}) We are solving this in D

Re: [C++][Compute] question about aggregate kernels

2020-09-16 Thread Andrew Lamb
Perhaps you can rewrite the functions in terms of other kernels that can be merged -- for example something like the following stddev(x) = sqrt((sum(x*x) - sum(x)*sum(x) / count(x))/(count(x)-1))) (loosely translated from https://math.stackexchange.com/questions/102978/incremental-computation-of-

[C++][Compute] question about aggregate kernels

2020-09-16 Thread Yibo Cai
Hi, I have a question about aggregate kernel implementation. Any help is appreciated. Aggregate kernel implements "consume" and "merge" interfaces. For a chunked array, "consume" is called for each array to get a temporary aggregated result, then "merge" it with previously consumed result. For