I suspect that any duplicate-insensitive function is very easily splittable. Especially ones that are their own rollup (which is true of min, max, any_value, single_value).
If getDistinctOptionality() returns IGNORED, does the author of the function need to write any further code to ensure that calls can be split? Lastly, I'll note that other systems (e.g. Algebird [1]) allow you to assign aggregate functions to algebraic structures - for example, sum and hyperLogLog are both monoids [2] and can therefore be 'rolled up'. We could do the same in Calcite. We could perhaps allow aggregate functions to declare themselves as belonging to a particular algebraic structure, and we could exploit the properties of those structures to perform optimizations. Julian [1] https://twitter.github.io/algebird/ [2] https://en.wikipedia.org/wiki/Monoid On Tue, Oct 13, 2020 at 7:26 PM Fan Liya <[email protected]> wrote: > > Hi Julian, > > Thanks again for your feedback. > > Since they are duplicate-insensitive, they should also be splittable > (SqlSplittableAggFunction), just like min/max, etc. > What do you think? > > I want to fire a JIRA accordingly, so that more optimizations can be > applied. > Any feedback is appreciated. > > Best, > Liya Fan > > > > On Wed, Oct 14, 2020 at 2:59 AM Julian Hyde <[email protected]> wrote: > > > I agree. ANY_VALUE and SINGLE_VALUE are duplicate-insensitive. > > > > > On Oct 13, 2020, at 2:17 AM, Fan Liya <[email protected]> wrote: > > > > > > Hi Julian, > > > > > > Thanks a lot for your feedback. > > > I think SqlAggFunction.getDistinctOptionality() is exactly what I > > > am looking for. > > > > > > BTW, I think ANY_VALUE and SINGLE_VALUE also belong to the category of > > > duplicate insensitive functions. > > > What do you think? > > > > > > Best, > > > Liya Fan > > > > > > > > > > > > On Tue, Oct 13, 2020 at 4:55 PM Julian Hyde <[email protected]> > > wrote: > > > > > >> We already have this concept. See > > SqlAggFunction.getDistinctOptionality(), > > >> added in https://issues.apache.org/jira/browse/CALCITE-3159 < > > >> https://issues.apache.org/jira/browse/CALCITE-3159>. > > >> > > >> Julian > > >> > > >> > > >>> On Oct 13, 2020, at 12:54 AM, Fan Liya <[email protected]> wrote: > > >>> > > >>> Hi all, > > >>> > > >>> I would like to introduce the idea of duplicate insensitive aggregate > > >>> functions. > > >>> > > >>> For such functions, the aggregation results remain the same even after > > >>> deduplication. > > >>> > > >>> For example, given a sequence of data {1, 1, 2, 2, 3, 5, 5}, the > > >>> aggregation results of MIN are the same regardless of whether we > > perform > > >>> data deduplication first. That is, > > >>> > > >>> MIN({1, 1, 2, 2, 3, 5, 5}) = MIN({1, 2, 3, 5}) > > >>> > > >>> So MIN is a *deduplicate insensitive function*. > > >>> > > >>> On the other hand, function SUM is not duplicate insensitive, because > > >>> > > >>> SUM({1, 1, 2, 2, 3, 5, 5}) != SUM({1, 2, 3, 5}) > > >>> > > >>> The concept of deduplicate insensitiveness can help us in many > > >> optimization > > >>> scenarios. > > >>> > > >>> For example, the curent implementation of AggregateMergeRule rules out > > >> any > > >>> aggregate calls for which the isDistict() method returns true. However, > > >> for > > >>> duplicate insensitive functions, the rule should be applicable. > > >>> > > >>> Could you please give your valuable feedback? > > >>> > > >>> Best, > > >>> Liya Fan > > >> > > >> > > > >
