Re: DISCUSS: the concept of duplicate insensitive aggregate functions

Julian Hyde Tue, 13 Oct 2020 22:20:01 -0700

I suspect that any duplicate-insensitive function is very easily
splittable. Especially ones that are their own rollup (which is true
of min, max, any_value, single_value).


If getDistinctOptionality() returns IGNORED, does the author of the
function need to write any further code to ensure that calls can be
split?

Lastly, I'll note that other systems (e.g. Algebird [1]) allow you to
assign aggregate functions to algebraic structures - for example, sum
and hyperLogLog are both monoids [2] and can therefore be 'rolled up'.
We could do the same in Calcite. We could perhaps allow aggregate
functions to declare themselves as belonging to a particular algebraic
structure, and we could exploit the properties of those structures to
perform optimizations.

Julian

[1] https://twitter.github.io/algebird/

[2] https://en.wikipedia.org/wiki/Monoid

On Tue, Oct 13, 2020 at 7:26 PM Fan Liya <[email protected]> wrote:
>
> Hi Julian,
>
> Thanks again for your feedback.
>
> Since they are duplicate-insensitive, they should also be splittable
> (SqlSplittableAggFunction), just like min/max, etc.
> What do you think?
>
> I want to fire a JIRA accordingly, so that more optimizations can be
> applied.
> Any feedback is appreciated.
>
> Best,
> Liya Fan
>
>
>
> On Wed, Oct 14, 2020 at 2:59 AM Julian Hyde <[email protected]> wrote:
>
> > I agree. ANY_VALUE and SINGLE_VALUE are duplicate-insensitive.
> >
> > > On Oct 13, 2020, at 2:17 AM, Fan Liya <[email protected]> wrote:
> > >
> > > Hi Julian,
> > >
> > > Thanks a lot for your feedback.
> > > I think SqlAggFunction.getDistinctOptionality() is exactly what I
> > > am looking for.
> > >
> > > BTW, I think ANY_VALUE and SINGLE_VALUE also belong to the category of
> > > duplicate insensitive functions.
> > > What do you think?
> > >
> > > Best,
> > > Liya Fan
> > >
> > >
> > >
> > > On Tue, Oct 13, 2020 at 4:55 PM Julian Hyde <[email protected]>
> > wrote:
> > >
> > >> We already have this concept. See
> > SqlAggFunction.getDistinctOptionality(),
> > >> added in https://issues.apache.org/jira/browse/CALCITE-3159 <
> > >> https://issues.apache.org/jira/browse/CALCITE-3159>.
> > >>
> > >> Julian
> > >>
> > >>
> > >>> On Oct 13, 2020, at 12:54 AM, Fan Liya <[email protected]> wrote:
> > >>>
> > >>> Hi all,
> > >>>
> > >>> I would like to introduce the idea of duplicate insensitive aggregate
> > >>> functions.
> > >>>
> > >>> For such functions, the aggregation results remain the same even after
> > >>> deduplication.
> > >>>
> > >>> For example, given a sequence of data {1, 1, 2, 2, 3, 5, 5}, the
> > >>> aggregation results of MIN are the same regardless of whether we
> > perform
> > >>> data deduplication first. That is,
> > >>>
> > >>> MIN({1, 1, 2, 2, 3, 5, 5}) = MIN({1, 2, 3, 5})
> > >>>
> > >>> So MIN is a *deduplicate insensitive function*.
> > >>>
> > >>> On the other hand, function SUM is not duplicate insensitive, because
> > >>>
> > >>> SUM({1, 1, 2, 2, 3, 5, 5}) != SUM({1, 2, 3, 5})
> > >>>
> > >>> The concept of deduplicate insensitiveness can help us in many
> > >> optimization
> > >>> scenarios.
> > >>>
> > >>> For example, the curent implementation of AggregateMergeRule rules out
> > >> any
> > >>> aggregate calls for which the isDistict() method returns true. However,
> > >> for
> > >>> duplicate insensitive functions, the rule should be applicable.
> > >>>
> > >>> Could you please give your valuable feedback?
> > >>>
> > >>> Best,
> > >>> Liya Fan
> > >>
> > >>
> >
> >

Re: DISCUSS: the concept of duplicate insensitive aggregate functions

Reply via email to