Dear all I'm not sure I'm thinking about this right, but if we're looking to leverage vectorization for standard deviation/variance would it make sense to compute the sum, the sum of squares, and the total number of data (N) over all chunks and compute the actual function,
stdev = sqrt(sum_squares/N - (sum/N)^2) only once at the end? This is one of the approaches in [1]. Best wishes Andrew On Thu, Sep 17, 2020 at 11:29 PM Micah Kornfield <emkornfi...@gmail.com> wrote: > > > > stddev(x) = sqrt((sum(x*x) - sum(x)*sum(x) / count(x))/(count(x)-1))) > > > This is not numerically stable. Please do not use it. Please see [1] for > some algorithms that might be better. > > The equation you provided is great in practice to calculate stdev for one > > array. It doesn't address the issue of combining stdev from multiple > arrays. > > > I think what everyone else was potentially stating implicitly is that for > combining details about arrays, for std. dev. and average there needs to be > more state kept that is different from the elements that one is actually > dealing with. For std. dev. you need to keep two numbers (same with > average). > > For percentiles, I think calculating exactly will require quite a large > state (for integers a histogram approach could be used to compress this). > There are however some good approximation algorithms that can be used if > exact values are not necessary (for example t-digest [2]). At some point > Arrow should probably have both. > > [1] > > https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Computing_shifted_data > [2] https://github.com/tdunning/t-digest > > On Thu, Sep 17, 2020 at 8:17 PM Yibo Cai <yibo....@arm.com> wrote: > > > Thanks Andrew. The link gives a cool method to calculate variance > > incrementally. I think the problem is that it's computationally too > > expensive (cannot leverage vectorization, three divisions for a single > data > > point). > > The equation you provided is great in practice to calculate stdev for one > > array. It doesn't address the issue of combining stdev from multiple > arrays. > > > > On 9/16/20 6:25 PM, Andrew Lamb wrote: > > > Perhaps you can rewrite the functions in terms of other kernels that > can > > be > > > merged -- for example something like the following > > > > > > stddev(x) = sqrt((sum(x*x) - sum(x)*sum(x) / count(x))/(count(x)-1))) > > > > > > (loosely translated from > > > > > > https://math.stackexchange.com/questions/102978/incremental-computation-of-standard-deviation > > > ) > > > > > > On Wed, Sep 16, 2020 at 6:12 AM Yibo Cai <yibo....@arm.com> wrote: > > > > > >> Hi, > > >> > > >> I have a question about aggregate kernel implementation. Any help is > > >> appreciated. > > >> > > >> Aggregate kernel implements "consume" and "merge" interfaces. For a > > >> chunked array, "consume" is called for each array to get a temporary > > >> aggregated result, then "merge" it with previously consumed result. > For > > >> associative operations like min/max/sum, this pattern is convenient. > We > > can > > >> easily "merge" min/max/sum of two arrays, e.g, sum([array_a, > array_b]) = > > >> sum(array_a) + sum(array_b). > > >> > > >> But I wonder what's the best approach to deal with operations like > > >> stdev/percentile. Results of these operations cannot be easily > > "merged". We > > >> have to walk through all the chunks to get the result. For these > > >> operations, looks "consume" must copy the input array and do all > > >> calculation once at "finalize" time. Or we don't expect it to support > > >> chunked array for them. > > >> > > >> Yibo > > >> > > > > > >