Re: Adding a new rule to do a micro sort for group by on multiple columns

Ian Maxon Thu, 17 Jul 2025 09:16:54 -0700

Pratyoy and I met with Ali to seek his expertise on what might be
going wrong with this rewrite. The rewrites prior to introducing the
Micro Sort on b were looked at. A main issue is that the
implementation of Preclustered Group By expects that the individual
records for the groups it emits will be received totally sorted, not
just on a. Removing the sorting attribute can't change this as it's
part of what the operator's implementation expects. Introducing the
Micro Sort could fix it, however the issue is that in the current
rewrite, it is after and not before the part of Preclustered group by
that expects some of these sorting properties. The idea that came up
was to either make a modified version of Preclustered group by that
can sort the partially sorted incoming records or to somehow introduce
a micro or streaming sort before it to ensure it can work properly.


On Fri, Jul 11, 2025 at 12:35 AM Pratyoy Das <praty...@uci.edu> wrote:
>
> Hi Ali,
> Thanks for your response. Indeed there is a mismatch, i.e. I want to declare 
> the group by on just the first attribute, but aggregate on all the (group by) 
> attributes. I was hoping that there is some way I can do it.
> The query for which I am testing is TPC-H Q1 which is something like
> select l_returnflag, l_linestatus, sum(l_quantity) as sum_qty
> from lineitem
> where l_shipdate <= date '1998-12-01' - interval '90' day
> group by l_returnflag,  l_linestatus
> Here, let's say l_returnflag is pre-sorted so rather than sorting the 
> combination of (l_returnflag, l_linestatus), I want to just sort l_linestatus 
> for each individual l_returnflag value. So I feel like I need to declare the 
> pre clustered group by on just l_returnflag so that we wait for all tuples 
> with the same l_returnflag value to arrive, but then once they have arrived, 
> we aggregate on the combination of (l_returnflag, l_linestatus). I know it 
> looks weird but I hope you see what I am trying to do. I am not sure how much 
> help it will be, but I am attaching the rule that I had come up with.
> Thanks!
>
>
>
>
> On Thu, Jul 10, 2025 at 7:25 PM Ali Alsuliman <ali.al.solai...@gmail.com> 
> wrote:
>>
>> Hi Pratyoy,
>> We would need more information and context to help like the query you are
>> running and sharing the code you have (possibly put it up in Gerrit).
>> However, from just the information you shared, it feels like you have a
>> problem in the GROUP-BY operator output, e.g. the declared output type of
>> the GROUP-BY and the actual output type produced or something like that.
>> For example, you mentioned that you are aggregating over a, b ("then
>> aggregate tuples on(a,b)"), the original SORT_GROUP_BY has both fields:
>> SORT_GROUP_BY[$$139, $$138] which is part of the group-by operator
>> output: group by ([$$163 := $$139; $$164 := $$138]).
>> In your example, I only see one field: PRE_CLUSTERED_GROUP_BY[$$139] and 
>> group
>> by ([$$163 := $$139])
>>
>> On Tue, Jul 8, 2025 at 8:24 PM Pratyoy Das <praty...@uci.edu> wrote:
>>
>> > Hi all,
>> > I am trying to add a new rule in Asterix DB and would really appreciate
>> > some help/advice in constructing the rule.
>> > Let's say that there is a query: select a,b sum(c) from T group by a,b.
>> > The default way to answer this query is by using a sort group by, and the
>> > physical plan (for the local or global agg) would look something like the
>> > following :-
>> >
>> >
>> >               group by ([$$163 := $$139; $$164 := $$138]) decor ([]) {
>> >                         aggregate [$$155, $$156, $$157, $$158, $$159,
>> > $$160, $$161, $$162] <- [agg-local-sql-sum($$90), agg-local-sql-sum($$95),
>> > agg-local-sql-sum(numeric-multiply($$95, numeric-subtract(1, $$152))),
>> > agg-local-sql-sum(numeric-multiply(numeric-multiply($$95,
>> > numeric-subtract(1, $$152)), numeric-add(1, $$154))),
>> > agg-local-sql-avg($$90), agg-local-sql-avg($$95), agg-local-sql-avg($$152),
>> > agg-sql-count(1)] [cardinality: 0.0, op-cost: 0.0, total-cost: 0.0]
>> >                         -- AGGREGATE  |LOCAL|
>> >                           nested tuple source [cardinality: 0.0, op-cost:
>> > 0.0, total-cost: 0.0]
>> >                           -- NESTED_TUPLE_SOURCE  |LOCAL|
>> >                      }
>> >               -- SORT_GROUP_BY[$$139, $$138]  |PARTITIONED|
>> >
>> > Now let's say that we can guarantee a sorted order on attribute a. So, for
>> > the local agg,  I want to do a pre-clustered group by on "a", and then do a
>> > micro sort on b and then aggregate tuples on(a,b). I got inspired from how
>> > Asterix DB deals with queries that have DISTINCT and GROUP BY - eg select
>> > a, count(distinct b) from T group by a. The group by looks something like:
>> > group by ([$$o_orderpriority := $$75]) decor ([]) {
>> >                     aggregate [$$81] <- [agg-sql-count($$76)] [cardinality:
>> > 0.0, op-cost: 0.0, total-cost: 0.0]
>> >                     -- AGGREGATE  |LOCAL|
>> >                       distinct ([$$76]) [cardinality: 0.0, op-cost: 0.0,
>> > total-cost: 0.0]
>> >                       -- MICRO_PRE_SORTED_DISTINCT_BY  |LOCAL|
>> >                         order (ASC, $$76) [cardinality: 0.0, op-cost: 0.0,
>> > total-cost: 0.0]
>> >                         -- MICRO_STABLE_SORT [$$76(ASC)]  |LOCAL|
>> >                           nested tuple source [cardinality: 0.0, op-cost:
>> > 0.0, total-cost: 0.0]
>> >                           -- NESTED_TUPLE_SOURCE  |LOCAL|
>> >                  }
>> >           -- PRE_CLUSTERED_GROUP_BY[$$75]  |PARTITIONED|
>> >
>> > I created a rule and currently it fails during runtime (not compilation)
>> > with an error "Cannot invoke
>> > "org.apache.asterix.om.types.ATypeTag.ordinal()" because "sourceTag" is
>> > null". On further investigation, I found out that the tuple rising up from
>> > this group by is malformed. The rewritten group by looks like this:
>> >             group by ([$$163 := $$139]) decor ([]) { aggregate [$$155,
>> > $$156, $$157, $$158, $$159, $$160, $$161, $$162] <-
>> > [agg-local-sql-sum($$90), agg-local-sql-sum($$95),
>> > agg-local-sql-sum(numeric-multiply($$95, numeric-subtract(1, $$152))),
>> > agg-local-sql-sum(numeric-multiply(numeric-multiply($$95,
>> > numeric-subtract(1, $$152)), numeric-add(1, $$154))),
>> > agg-local-sql-avg($$90), agg-local-sql-avg($$95), agg-local-sql-avg($$152),
>> > agg-sql-count(1)] [cardinality: 0.0, op-cost: 0.0, total-cost: 0.0] --
>> > AGGREGATE |LOCAL| order (ASC, $$138) [cardinality: 0.0, op-cost: 0.0,
>> > total-cost: 0.0] -- MICRO_STABLE_SORT [$$138(ASC)] |LOCAL| nested tuple
>> > source [cardinality: 0.0, op-cost: 0.0, total-cost: 0.0] --
>> > NESTED_TUPLE_SOURCE |LOCAL| } -- PRE_CLUSTERED_GROUP_BY[$$139]
>> > |PARTITIONED|
>> > In particular, I think I need three major advices:-
>> > 1. Any idea as to what is going wrong.
>> > 2. Where should I place this rule? Right now I am placing it towards the
>> > end just before execution mode and memory requirements are set and perhaps
>> > it should be introduced much earlier?
>> > 3. How do I introduce functions
>> > like computeAndSetTypeEnvironmentForOperator() and
>> > computeDeliveredPhysicalProperties(). I believe these are key functions to
>> > set for adding new operators later in the compile stage and I couldn't find
>> > good examples to check the order in which we should invoke these functions.
>> > I am happy to provide more details (including the new rule) if necessary.
>> > Thanks a lot for your time!
>> > --
>> > Best Regards,
>> > Pratyoy
>> >
>>
>>
>> --
>> Regards,
>
>
>
> --
> Best Regards,
> Pratyoy

Re: Adding a new rule to do a micro sort for group by on multiple columns

Reply via email to