Hi all, I am trying to add a new rule in Asterix DB and would really appreciate some help/advice in constructing the rule. Let's say that there is a query: select a,b sum(c) from T group by a,b. The default way to answer this query is by using a sort group by, and the physical plan (for the local or global agg) would look something like the following :-
group by ([$$163 := $$139; $$164 := $$138]) decor ([]) { aggregate [$$155, $$156, $$157, $$158, $$159, $$160, $$161, $$162] <- [agg-local-sql-sum($$90), agg-local-sql-sum($$95), agg-local-sql-sum(numeric-multiply($$95, numeric-subtract(1, $$152))), agg-local-sql-sum(numeric-multiply(numeric-multiply($$95, numeric-subtract(1, $$152)), numeric-add(1, $$154))), agg-local-sql-avg($$90), agg-local-sql-avg($$95), agg-local-sql-avg($$152), agg-sql-count(1)] [cardinality: 0.0, op-cost: 0.0, total-cost: 0.0] -- AGGREGATE |LOCAL| nested tuple source [cardinality: 0.0, op-cost: 0.0, total-cost: 0.0] -- NESTED_TUPLE_SOURCE |LOCAL| } -- SORT_GROUP_BY[$$139, $$138] |PARTITIONED| Now let's say that we can guarantee a sorted order on attribute a. So, for the local agg, I want to do a pre-clustered group by on "a", and then do a micro sort on b and then aggregate tuples on(a,b). I got inspired from how Asterix DB deals with queries that have DISTINCT and GROUP BY - eg select a, count(distinct b) from T group by a. The group by looks something like: group by ([$$o_orderpriority := $$75]) decor ([]) { aggregate [$$81] <- [agg-sql-count($$76)] [cardinality: 0.0, op-cost: 0.0, total-cost: 0.0] -- AGGREGATE |LOCAL| distinct ([$$76]) [cardinality: 0.0, op-cost: 0.0, total-cost: 0.0] -- MICRO_PRE_SORTED_DISTINCT_BY |LOCAL| order (ASC, $$76) [cardinality: 0.0, op-cost: 0.0, total-cost: 0.0] -- MICRO_STABLE_SORT [$$76(ASC)] |LOCAL| nested tuple source [cardinality: 0.0, op-cost: 0.0, total-cost: 0.0] -- NESTED_TUPLE_SOURCE |LOCAL| } -- PRE_CLUSTERED_GROUP_BY[$$75] |PARTITIONED| I created a rule and currently it fails during runtime (not compilation) with an error "Cannot invoke "org.apache.asterix.om.types.ATypeTag.ordinal()" because "sourceTag" is null". On further investigation, I found out that the tuple rising up from this group by is malformed. The rewritten group by looks like this: group by ([$$163 := $$139]) decor ([]) { aggregate [$$155, $$156, $$157, $$158, $$159, $$160, $$161, $$162] <- [agg-local-sql-sum($$90), agg-local-sql-sum($$95), agg-local-sql-sum(numeric-multiply($$95, numeric-subtract(1, $$152))), agg-local-sql-sum(numeric-multiply(numeric-multiply($$95, numeric-subtract(1, $$152)), numeric-add(1, $$154))), agg-local-sql-avg($$90), agg-local-sql-avg($$95), agg-local-sql-avg($$152), agg-sql-count(1)] [cardinality: 0.0, op-cost: 0.0, total-cost: 0.0] -- AGGREGATE |LOCAL| order (ASC, $$138) [cardinality: 0.0, op-cost: 0.0, total-cost: 0.0] -- MICRO_STABLE_SORT [$$138(ASC)] |LOCAL| nested tuple source [cardinality: 0.0, op-cost: 0.0, total-cost: 0.0] -- NESTED_TUPLE_SOURCE |LOCAL| } -- PRE_CLUSTERED_GROUP_BY[$$139] |PARTITIONED| In particular, I think I need three major advices:- 1. Any idea as to what is going wrong. 2. Where should I place this rule? Right now I am placing it towards the end just before execution mode and memory requirements are set and perhaps it should be introduced much earlier? 3. How do I introduce functions like computeAndSetTypeEnvironmentForOperator() and computeDeliveredPhysicalProperties(). I believe these are key functions to set for adding new operators later in the compile stage and I couldn't find good examples to check the order in which we should invoke these functions. I am happy to provide more details (including the new rule) if necessary. Thanks a lot for your time! -- Best Regards, Pratyoy