Hi Pratyoy, We would need more information and context to help like the query you are running and sharing the code you have (possibly put it up in Gerrit). However, from just the information you shared, it feels like you have a problem in the GROUP-BY operator output, e.g. the declared output type of the GROUP-BY and the actual output type produced or something like that. For example, you mentioned that you are aggregating over a, b ("then aggregate tuples on(a,b)"), the original SORT_GROUP_BY has both fields: SORT_GROUP_BY[$$139, $$138] which is part of the group-by operator output: group by ([$$163 := $$139; $$164 := $$138]). In your example, I only see one field: PRE_CLUSTERED_GROUP_BY[$$139] and group by ([$$163 := $$139])
On Tue, Jul 8, 2025 at 8:24 PM Pratyoy Das <praty...@uci.edu> wrote: > Hi all, > I am trying to add a new rule in Asterix DB and would really appreciate > some help/advice in constructing the rule. > Let's say that there is a query: select a,b sum(c) from T group by a,b. > The default way to answer this query is by using a sort group by, and the > physical plan (for the local or global agg) would look something like the > following :- > > > group by ([$$163 := $$139; $$164 := $$138]) decor ([]) { > aggregate [$$155, $$156, $$157, $$158, $$159, > $$160, $$161, $$162] <- [agg-local-sql-sum($$90), agg-local-sql-sum($$95), > agg-local-sql-sum(numeric-multiply($$95, numeric-subtract(1, $$152))), > agg-local-sql-sum(numeric-multiply(numeric-multiply($$95, > numeric-subtract(1, $$152)), numeric-add(1, $$154))), > agg-local-sql-avg($$90), agg-local-sql-avg($$95), agg-local-sql-avg($$152), > agg-sql-count(1)] [cardinality: 0.0, op-cost: 0.0, total-cost: 0.0] > -- AGGREGATE |LOCAL| > nested tuple source [cardinality: 0.0, op-cost: > 0.0, total-cost: 0.0] > -- NESTED_TUPLE_SOURCE |LOCAL| > } > -- SORT_GROUP_BY[$$139, $$138] |PARTITIONED| > > Now let's say that we can guarantee a sorted order on attribute a. So, for > the local agg, I want to do a pre-clustered group by on "a", and then do a > micro sort on b and then aggregate tuples on(a,b). I got inspired from how > Asterix DB deals with queries that have DISTINCT and GROUP BY - eg select > a, count(distinct b) from T group by a. The group by looks something like: > group by ([$$o_orderpriority := $$75]) decor ([]) { > aggregate [$$81] <- [agg-sql-count($$76)] [cardinality: > 0.0, op-cost: 0.0, total-cost: 0.0] > -- AGGREGATE |LOCAL| > distinct ([$$76]) [cardinality: 0.0, op-cost: 0.0, > total-cost: 0.0] > -- MICRO_PRE_SORTED_DISTINCT_BY |LOCAL| > order (ASC, $$76) [cardinality: 0.0, op-cost: 0.0, > total-cost: 0.0] > -- MICRO_STABLE_SORT [$$76(ASC)] |LOCAL| > nested tuple source [cardinality: 0.0, op-cost: > 0.0, total-cost: 0.0] > -- NESTED_TUPLE_SOURCE |LOCAL| > } > -- PRE_CLUSTERED_GROUP_BY[$$75] |PARTITIONED| > > I created a rule and currently it fails during runtime (not compilation) > with an error "Cannot invoke > "org.apache.asterix.om.types.ATypeTag.ordinal()" because "sourceTag" is > null". On further investigation, I found out that the tuple rising up from > this group by is malformed. The rewritten group by looks like this: > group by ([$$163 := $$139]) decor ([]) { aggregate [$$155, > $$156, $$157, $$158, $$159, $$160, $$161, $$162] <- > [agg-local-sql-sum($$90), agg-local-sql-sum($$95), > agg-local-sql-sum(numeric-multiply($$95, numeric-subtract(1, $$152))), > agg-local-sql-sum(numeric-multiply(numeric-multiply($$95, > numeric-subtract(1, $$152)), numeric-add(1, $$154))), > agg-local-sql-avg($$90), agg-local-sql-avg($$95), agg-local-sql-avg($$152), > agg-sql-count(1)] [cardinality: 0.0, op-cost: 0.0, total-cost: 0.0] -- > AGGREGATE |LOCAL| order (ASC, $$138) [cardinality: 0.0, op-cost: 0.0, > total-cost: 0.0] -- MICRO_STABLE_SORT [$$138(ASC)] |LOCAL| nested tuple > source [cardinality: 0.0, op-cost: 0.0, total-cost: 0.0] -- > NESTED_TUPLE_SOURCE |LOCAL| } -- PRE_CLUSTERED_GROUP_BY[$$139] > |PARTITIONED| > In particular, I think I need three major advices:- > 1. Any idea as to what is going wrong. > 2. Where should I place this rule? Right now I am placing it towards the > end just before execution mode and memory requirements are set and perhaps > it should be introduced much earlier? > 3. How do I introduce functions > like computeAndSetTypeEnvironmentForOperator() and > computeDeliveredPhysicalProperties(). I believe these are key functions to > set for adding new operators later in the compile stage and I couldn't find > good examples to check the order in which we should invoke these functions. > I am happy to provide more details (including the new rule) if necessary. > Thanks a lot for your time! > -- > Best Regards, > Pratyoy > -- Regards,