Dear all

I'm not sure I'm thinking about this right, but if we're looking to
leverage vectorization for standard deviation/variance would it make sense
to compute the sum, the sum of squares, and the total number of data (N)
over all chunks and compute the actual function,

stdev = sqrt(sum_squares/N - (sum/N)^2)

only once at the end? This is one of the approaches in [1].

Best wishes
Andrew

On Thu, Sep 17, 2020 at 11:29 PM Micah Kornfield <emkornfi...@gmail.com>
wrote:

> >
> > stddev(x) = sqrt((sum(x*x) - sum(x)*sum(x) / count(x))/(count(x)-1)))
>
>
> This is not numerically stable. Please do not use it.  Please see [1] for
> some algorithms that might be better.
>
> The equation you provided is great in practice to calculate stdev for one
> > array. It doesn't address the issue of combining stdev from multiple
> arrays.
>
>
> I think what everyone else was potentially  stating implicitly is that for
> combining details about arrays, for std. dev. and average there needs to be
> more state kept that is different from the elements that one is actually
> dealing with.  For std. dev.  you need to keep two numbers (same with
> average).
>
> For percentiles, I think calculating exactly will require quite a large
> state (for integers a histogram approach could be used to compress this).
> There are however some good approximation algorithms that can be used if
> exact values are not necessary (for example t-digest [2]).  At some point
> Arrow should probably have both.
>
> [1]
>
> https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Computing_shifted_data
> [2] https://github.com/tdunning/t-digest
>
> On Thu, Sep 17, 2020 at 8:17 PM Yibo Cai <yibo....@arm.com> wrote:
>
> > Thanks Andrew. The link gives a cool method to calculate variance
> > incrementally. I think the problem is that it's computationally too
> > expensive (cannot leverage vectorization, three divisions for a single
> data
> > point).
> > The equation you provided is great in practice to calculate stdev for one
> > array. It doesn't address the issue of combining stdev from multiple
> arrays.
> >
> > On 9/16/20 6:25 PM, Andrew Lamb wrote:
> > > Perhaps you can rewrite the functions in terms of other kernels that
> can
> > be
> > > merged -- for example something like the following
> > >
> > > stddev(x) = sqrt((sum(x*x) - sum(x)*sum(x) / count(x))/(count(x)-1)))
> > >
> > > (loosely translated from
> > >
> >
> https://math.stackexchange.com/questions/102978/incremental-computation-of-standard-deviation
> > > )
> > >
> > > On Wed, Sep 16, 2020 at 6:12 AM Yibo Cai <yibo....@arm.com> wrote:
> > >
> > >> Hi,
> > >>
> > >> I have a question about aggregate kernel implementation. Any help is
> > >> appreciated.
> > >>
> > >> Aggregate kernel implements "consume" and "merge" interfaces. For a
> > >> chunked array, "consume" is called for each array to get a temporary
> > >> aggregated result, then "merge" it with previously consumed result.
> For
> > >> associative operations like min/max/sum, this pattern is convenient.
> We
> > can
> > >> easily "merge" min/max/sum of two arrays, e.g, sum([array_a,
> array_b]) =
> > >> sum(array_a) + sum(array_b).
> > >>
> > >> But I wonder what's the best approach to deal with operations like
> > >> stdev/percentile. Results of these operations cannot be easily
> > "merged". We
> > >> have to walk through all the chunks to get the result. For these
> > >> operations, looks "consume" must copy the input array and do all
> > >> calculation once at "finalize" time. Or we don't expect it to support
> > >> chunked array for them.
> > >>
> > >> Yibo
> > >>
> > >
> >
>

Reply via email to