Re: [C++][Compute] question about aggregate kernels

2020-09-17 Thread Micah Kornfield
> > Interestingly, spark uses count / N > > to > compute the average, not an online algorithm. Yes, it looks like the actual Spark

Re: [C++][Compute] question about aggregate kernels

2020-09-17 Thread Jorge Cardoso Leitão
Hi, > I think what everyone else was potentially stating implicitly is that for combining details about arrays, for std. dev. and average there needs to be more state kept that is different from the elements that one is actually dealing with. For std. dev. you need to keep two numbers (same wit

Re: [C++][Compute] question about aggregate kernels

2020-09-17 Thread Micah Kornfield
> > stdev = sqrt(sum_squares/N - (sum/N)^2) >From [1] this is considered the naive algorithm: Because SumSq and (Sum×Sum)/*n* can be very similar numbers, cancellation > can lead to the > precision

Re: [C++][Compute] question about aggregate kernels

2020-09-17 Thread Andrew Wieteska
Dear all I'm not sure I'm thinking about this right, but if we're looking to leverage vectorization for standard deviation/variance would it make sense to compute the sum, the sum of squares, and the total number of data (N) over all chunks and compute the actual function, stdev = sqrt(sum_square

Re: [C++][Compute] question about aggregate kernels

2020-09-17 Thread Micah Kornfield
> > stddev(x) = sqrt((sum(x*x) - sum(x)*sum(x) / count(x))/(count(x)-1))) This is not numerically stable. Please do not use it. Please see [1] for some algorithms that might be better. The equation you provided is great in practice to calculate stdev for one > array. It doesn't address the issu

Re: [C++][Compute] question about aggregate kernels

2020-09-17 Thread Yibo Cai
Thanks Jorge, Wes, Will study the links and try to propose improvements of c++ aggregate function. On 9/16/20 11:17 PM, Wes McKinney wrote: Perhaps it would be helpful to look at how Clickhouse's aggregate functions are implemented? https://github.com/ClickHouse/ClickHouse/tree/master/src/Aggre

Re: [C++][Compute] question about aggregate kernels

2020-09-17 Thread Yibo Cai
Thanks Andrew. The link gives a cool method to calculate variance incrementally. I think the problem is that it's computationally too expensive (cannot leverage vectorization, three divisions for a single data point). The equation you provided is great in practice to calculate stdev for one arr