>
> stdev = sqrt(sum_squares/N - (sum/N)^2)

>From [1]  this is considered the naive algorithm:

Because SumSq and (Sum×Sum)/*n* can be very similar numbers, cancellation
> <https://en.wikipedia.org/wiki/Loss_of_significance> can lead to the
> precision <https://en.wikipedia.org/wiki/Precision_(arithmetic)> of the
> result to be much less than the inherent precision of the floating-point
> arithmetic <https://en.wikipedia.org/wiki/Floating-point_arithmetic> used
> to perform the computation. Thus this algorithm should not be used in
> practice,[1]
> <https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#cite_note-Einarsson2005-1>
> [2]
> <https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#cite_note-Chan1983-2>
>  and
> several alternate, numerically stable, algorithms have been proposed.[3]
> <https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#cite_note-:1-3>


On Thu, Sep 17, 2020 at 8:41 PM Andrew Wieteska <andrew.r.wiete...@gmail.com>
wrote:

> Dear all
>
> I'm not sure I'm thinking about this right, but if we're looking to
> leverage vectorization for standard deviation/variance would it make sense
> to compute the sum, the sum of squares, and the total number of data (N)
> over all chunks and compute the actual function,
>
> stdev = sqrt(sum_squares/N - (sum/N)^2)
>
> only once at the end? This is one of the approaches in [1].
>
> Best wishes
> Andrew
>
> On Thu, Sep 17, 2020 at 11:29 PM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
>
>> >
>> > stddev(x) = sqrt((sum(x*x) - sum(x)*sum(x) / count(x))/(count(x)-1)))
>>
>>
>> This is not numerically stable. Please do not use it.  Please see [1] for
>> some algorithms that might be better.
>>
>> The equation you provided is great in practice to calculate stdev for one
>> > array. It doesn't address the issue of combining stdev from multiple
>> arrays.
>>
>>
>> I think what everyone else was potentially  stating implicitly is that for
>> combining details about arrays, for std. dev. and average there needs to
>> be
>> more state kept that is different from the elements that one is actually
>> dealing with.  For std. dev.  you need to keep two numbers (same with
>> average).
>>
>> For percentiles, I think calculating exactly will require quite a large
>> state (for integers a histogram approach could be used to compress this).
>> There are however some good approximation algorithms that can be used if
>> exact values are not necessary (for example t-digest [2]).  At some point
>> Arrow should probably have both.
>>
>> [1]
>>
>> https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Computing_shifted_data
>> [2] https://github.com/tdunning/t-digest
>>
>> On Thu, Sep 17, 2020 at 8:17 PM Yibo Cai <yibo....@arm.com> wrote:
>>
>> > Thanks Andrew. The link gives a cool method to calculate variance
>> > incrementally. I think the problem is that it's computationally too
>> > expensive (cannot leverage vectorization, three divisions for a single
>> data
>> > point).
>> > The equation you provided is great in practice to calculate stdev for
>> one
>> > array. It doesn't address the issue of combining stdev from multiple
>> arrays.
>> >
>> > On 9/16/20 6:25 PM, Andrew Lamb wrote:
>> > > Perhaps you can rewrite the functions in terms of other kernels that
>> can
>> > be
>> > > merged -- for example something like the following
>> > >
>> > > stddev(x) = sqrt((sum(x*x) - sum(x)*sum(x) / count(x))/(count(x)-1)))
>> > >
>> > > (loosely translated from
>> > >
>> >
>> https://math.stackexchange.com/questions/102978/incremental-computation-of-standard-deviation
>> > > )
>> > >
>> > > On Wed, Sep 16, 2020 at 6:12 AM Yibo Cai <yibo....@arm.com> wrote:
>> > >
>> > >> Hi,
>> > >>
>> > >> I have a question about aggregate kernel implementation. Any help is
>> > >> appreciated.
>> > >>
>> > >> Aggregate kernel implements "consume" and "merge" interfaces. For a
>> > >> chunked array, "consume" is called for each array to get a temporary
>> > >> aggregated result, then "merge" it with previously consumed result.
>> For
>> > >> associative operations like min/max/sum, this pattern is convenient.
>> We
>> > can
>> > >> easily "merge" min/max/sum of two arrays, e.g, sum([array_a,
>> array_b]) =
>> > >> sum(array_a) + sum(array_b).
>> > >>
>> > >> But I wonder what's the best approach to deal with operations like
>> > >> stdev/percentile. Results of these operations cannot be easily
>> > "merged". We
>> > >> have to walk through all the chunks to get the result. For these
>> > >> operations, looks "consume" must copy the input array and do all
>> > >> calculation once at "finalize" time. Or we don't expect it to support
>> > >> chunked array for them.
>> > >>
>> > >> Yibo
>> > >>
>> > >
>> >
>>
>

Reply via email to