> > stdev = sqrt(sum_squares/N - (sum/N)^2)
>From [1] this is considered the naive algorithm: Because SumSq and (Sum×Sum)/*n* can be very similar numbers, cancellation > <https://en.wikipedia.org/wiki/Loss_of_significance> can lead to the > precision <https://en.wikipedia.org/wiki/Precision_(arithmetic)> of the > result to be much less than the inherent precision of the floating-point > arithmetic <https://en.wikipedia.org/wiki/Floating-point_arithmetic> used > to perform the computation. Thus this algorithm should not be used in > practice,[1] > <https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#cite_note-Einarsson2005-1> > [2] > <https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#cite_note-Chan1983-2> > and > several alternate, numerically stable, algorithms have been proposed.[3] > <https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#cite_note-:1-3> On Thu, Sep 17, 2020 at 8:41 PM Andrew Wieteska <andrew.r.wiete...@gmail.com> wrote: > Dear all > > I'm not sure I'm thinking about this right, but if we're looking to > leverage vectorization for standard deviation/variance would it make sense > to compute the sum, the sum of squares, and the total number of data (N) > over all chunks and compute the actual function, > > stdev = sqrt(sum_squares/N - (sum/N)^2) > > only once at the end? This is one of the approaches in [1]. > > Best wishes > Andrew > > On Thu, Sep 17, 2020 at 11:29 PM Micah Kornfield <emkornfi...@gmail.com> > wrote: > >> > >> > stddev(x) = sqrt((sum(x*x) - sum(x)*sum(x) / count(x))/(count(x)-1))) >> >> >> This is not numerically stable. Please do not use it. Please see [1] for >> some algorithms that might be better. >> >> The equation you provided is great in practice to calculate stdev for one >> > array. It doesn't address the issue of combining stdev from multiple >> arrays. >> >> >> I think what everyone else was potentially stating implicitly is that for >> combining details about arrays, for std. dev. and average there needs to >> be >> more state kept that is different from the elements that one is actually >> dealing with. For std. dev. you need to keep two numbers (same with >> average). >> >> For percentiles, I think calculating exactly will require quite a large >> state (for integers a histogram approach could be used to compress this). >> There are however some good approximation algorithms that can be used if >> exact values are not necessary (for example t-digest [2]). At some point >> Arrow should probably have both. >> >> [1] >> >> https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Computing_shifted_data >> [2] https://github.com/tdunning/t-digest >> >> On Thu, Sep 17, 2020 at 8:17 PM Yibo Cai <yibo....@arm.com> wrote: >> >> > Thanks Andrew. The link gives a cool method to calculate variance >> > incrementally. I think the problem is that it's computationally too >> > expensive (cannot leverage vectorization, three divisions for a single >> data >> > point). >> > The equation you provided is great in practice to calculate stdev for >> one >> > array. It doesn't address the issue of combining stdev from multiple >> arrays. >> > >> > On 9/16/20 6:25 PM, Andrew Lamb wrote: >> > > Perhaps you can rewrite the functions in terms of other kernels that >> can >> > be >> > > merged -- for example something like the following >> > > >> > > stddev(x) = sqrt((sum(x*x) - sum(x)*sum(x) / count(x))/(count(x)-1))) >> > > >> > > (loosely translated from >> > > >> > >> https://math.stackexchange.com/questions/102978/incremental-computation-of-standard-deviation >> > > ) >> > > >> > > On Wed, Sep 16, 2020 at 6:12 AM Yibo Cai <yibo....@arm.com> wrote: >> > > >> > >> Hi, >> > >> >> > >> I have a question about aggregate kernel implementation. Any help is >> > >> appreciated. >> > >> >> > >> Aggregate kernel implements "consume" and "merge" interfaces. For a >> > >> chunked array, "consume" is called for each array to get a temporary >> > >> aggregated result, then "merge" it with previously consumed result. >> For >> > >> associative operations like min/max/sum, this pattern is convenient. >> We >> > can >> > >> easily "merge" min/max/sum of two arrays, e.g, sum([array_a, >> array_b]) = >> > >> sum(array_a) + sum(array_b). >> > >> >> > >> But I wonder what's the best approach to deal with operations like >> > >> stdev/percentile. Results of these operations cannot be easily >> > "merged". We >> > >> have to walk through all the chunks to get the result. For these >> > >> operations, looks "consume" must copy the input array and do all >> > >> calculation once at "finalize" time. Or we don't expect it to support >> > >> chunked array for them. >> > >> >> > >> Yibo >> > >> >> > > >> > >> >