Adding a new rule to do a micro sort for group by on multiple columns

Pratyoy Das Tue, 08 Jul 2025 20:24:58 -0700

Hi all,
I am trying to add a new rule in Asterix DB and would really appreciate
some help/advice in constructing the rule.
Let's say that there is a query: select a,b sum(c) from T group by a,b.
The default way to answer this query is by using a sort group by, and the
physical plan (for the local or global agg) would look something like the
following :-



              group by ([$$163 := $$139; $$164 := $$138]) decor ([]) {
                        aggregate [$$155, $$156, $$157, $$158, $$159,
$$160, $$161, $$162] <- [agg-local-sql-sum($$90), agg-local-sql-sum($$95),
agg-local-sql-sum(numeric-multiply($$95, numeric-subtract(1, $$152))),
agg-local-sql-sum(numeric-multiply(numeric-multiply($$95,
numeric-subtract(1, $$152)), numeric-add(1, $$154))),
agg-local-sql-avg($$90), agg-local-sql-avg($$95), agg-local-sql-avg($$152),
agg-sql-count(1)] [cardinality: 0.0, op-cost: 0.0, total-cost: 0.0]
                        -- AGGREGATE  |LOCAL|
                          nested tuple source [cardinality: 0.0, op-cost:
0.0, total-cost: 0.0]
                          -- NESTED_TUPLE_SOURCE  |LOCAL|
                     }
              -- SORT_GROUP_BY[$$139, $$138]  |PARTITIONED|

Now let's say that we can guarantee a sorted order on attribute a. So, for
the local agg,  I want to do a pre-clustered group by on "a", and then do a
micro sort on b and then aggregate tuples on(a,b). I got inspired from how
Asterix DB deals with queries that have DISTINCT and GROUP BY - eg select
a, count(distinct b) from T group by a. The group by looks something like:
group by ([$$o_orderpriority := $$75]) decor ([]) {
                    aggregate [$$81] <- [agg-sql-count($$76)] [cardinality:
0.0, op-cost: 0.0, total-cost: 0.0]
                    -- AGGREGATE  |LOCAL|
                      distinct ([$$76]) [cardinality: 0.0, op-cost: 0.0,
total-cost: 0.0]
                      -- MICRO_PRE_SORTED_DISTINCT_BY  |LOCAL|
                        order (ASC, $$76) [cardinality: 0.0, op-cost: 0.0,
total-cost: 0.0]
                        -- MICRO_STABLE_SORT [$$76(ASC)]  |LOCAL|
                          nested tuple source [cardinality: 0.0, op-cost:
0.0, total-cost: 0.0]
                          -- NESTED_TUPLE_SOURCE  |LOCAL|
                 }
          -- PRE_CLUSTERED_GROUP_BY[$$75]  |PARTITIONED|

I created a rule and currently it fails during runtime (not compilation)
with an error "Cannot invoke
"org.apache.asterix.om.types.ATypeTag.ordinal()" because "sourceTag" is
null". On further investigation, I found out that the tuple rising up from
this group by is malformed. The rewritten group by looks like this:
            group by ([$$163 := $$139]) decor ([]) { aggregate [$$155,
$$156, $$157, $$158, $$159, $$160, $$161, $$162] <-
[agg-local-sql-sum($$90), agg-local-sql-sum($$95),
agg-local-sql-sum(numeric-multiply($$95, numeric-subtract(1, $$152))),
agg-local-sql-sum(numeric-multiply(numeric-multiply($$95,
numeric-subtract(1, $$152)), numeric-add(1, $$154))),
agg-local-sql-avg($$90), agg-local-sql-avg($$95), agg-local-sql-avg($$152),
agg-sql-count(1)] [cardinality: 0.0, op-cost: 0.0, total-cost: 0.0] --
AGGREGATE |LOCAL| order (ASC, $$138) [cardinality: 0.0, op-cost: 0.0,
total-cost: 0.0] -- MICRO_STABLE_SORT [$$138(ASC)] |LOCAL| nested tuple
source [cardinality: 0.0, op-cost: 0.0, total-cost: 0.0] --
NESTED_TUPLE_SOURCE |LOCAL| } -- PRE_CLUSTERED_GROUP_BY[$$139] |PARTITIONED|
In particular, I think I need three major advices:-
1. Any idea as to what is going wrong.
2. Where should I place this rule? Right now I am placing it towards the
end just before execution mode and memory requirements are set and perhaps
it should be introduced much earlier?
3. How do I introduce functions
like computeAndSetTypeEnvironmentForOperator() and
computeDeliveredPhysicalProperties(). I believe these are key functions to
set for adding new operators later in the compile stage and I couldn't find
good examples to check the order in which we should invoke these functions.
I am happy to provide more details (including the new rule) if necessary.
Thanks a lot for your time!
-- 
Best Regards,
Pratyoy

Adding a new rule to do a micro sort for group by on multiple columns

Reply via email to