Hi Ali, Thanks for your response. Indeed there is a mismatch, i.e. I want to declare the group by on just the first attribute, but aggregate on all the (group by) attributes. I was hoping that there is some way I can do it. The query for which I am testing is TPC-H Q1 which is something like select l_returnflag, l_linestatus, sum(l_quantity) as sum_qty from lineitem where l_shipdate <= date '1998-12-01' - interval '90' day group by l_returnflag, l_linestatus Here, let's say l_returnflag is pre-sorted so rather than sorting the combination of (l_returnflag, l_linestatus), I want to just sort l_linestatus for each individual l_returnflag value. So I feel like I need to declare the pre clustered group by on just l_returnflag so that we wait for all tuples with the same l_returnflag value to arrive, but then once they have arrived, we aggregate on the combination of (l_returnflag, l_linestatus). I know it looks weird but I hope you see what I am trying to do. I am not sure how much help it will be, but I am attaching the rule that I had come up with. Thanks!
On Thu, Jul 10, 2025 at 7:25 PM Ali Alsuliman <ali.al.solai...@gmail.com> wrote: > Hi Pratyoy, > We would need more information and context to help like the query you are > running and sharing the code you have (possibly put it up in Gerrit). > However, from just the information you shared, it feels like you have a > problem in the GROUP-BY operator output, e.g. the declared output type of > the GROUP-BY and the actual output type produced or something like that. > For example, you mentioned that you are aggregating over a, b ("then > aggregate tuples on(a,b)"), the original SORT_GROUP_BY has both fields: > SORT_GROUP_BY[$$139, $$138] which is part of the group-by operator > output: group by ([$$163 := $$139; $$164 := $$138]). > In your example, I only see one field: PRE_CLUSTERED_GROUP_BY[$$139] and > group > by ([$$163 := $$139]) > > On Tue, Jul 8, 2025 at 8:24 PM Pratyoy Das <praty...@uci.edu> wrote: > > > Hi all, > > I am trying to add a new rule in Asterix DB and would really appreciate > > some help/advice in constructing the rule. > > Let's say that there is a query: select a,b sum(c) from T group by a,b. > > The default way to answer this query is by using a sort group by, and the > > physical plan (for the local or global agg) would look something like the > > following :- > > > > > > group by ([$$163 := $$139; $$164 := $$138]) decor ([]) { > > aggregate [$$155, $$156, $$157, $$158, $$159, > > $$160, $$161, $$162] <- [agg-local-sql-sum($$90), > agg-local-sql-sum($$95), > > agg-local-sql-sum(numeric-multiply($$95, numeric-subtract(1, $$152))), > > agg-local-sql-sum(numeric-multiply(numeric-multiply($$95, > > numeric-subtract(1, $$152)), numeric-add(1, $$154))), > > agg-local-sql-avg($$90), agg-local-sql-avg($$95), > agg-local-sql-avg($$152), > > agg-sql-count(1)] [cardinality: 0.0, op-cost: 0.0, total-cost: 0.0] > > -- AGGREGATE |LOCAL| > > nested tuple source [cardinality: 0.0, op-cost: > > 0.0, total-cost: 0.0] > > -- NESTED_TUPLE_SOURCE |LOCAL| > > } > > -- SORT_GROUP_BY[$$139, $$138] |PARTITIONED| > > > > Now let's say that we can guarantee a sorted order on attribute a. So, > for > > the local agg, I want to do a pre-clustered group by on "a", and then > do a > > micro sort on b and then aggregate tuples on(a,b). I got inspired from > how > > Asterix DB deals with queries that have DISTINCT and GROUP BY - eg select > > a, count(distinct b) from T group by a. The group by looks something > like: > > group by ([$$o_orderpriority := $$75]) decor ([]) { > > aggregate [$$81] <- [agg-sql-count($$76)] > [cardinality: > > 0.0, op-cost: 0.0, total-cost: 0.0] > > -- AGGREGATE |LOCAL| > > distinct ([$$76]) [cardinality: 0.0, op-cost: 0.0, > > total-cost: 0.0] > > -- MICRO_PRE_SORTED_DISTINCT_BY |LOCAL| > > order (ASC, $$76) [cardinality: 0.0, op-cost: > 0.0, > > total-cost: 0.0] > > -- MICRO_STABLE_SORT [$$76(ASC)] |LOCAL| > > nested tuple source [cardinality: 0.0, op-cost: > > 0.0, total-cost: 0.0] > > -- NESTED_TUPLE_SOURCE |LOCAL| > > } > > -- PRE_CLUSTERED_GROUP_BY[$$75] |PARTITIONED| > > > > I created a rule and currently it fails during runtime (not compilation) > > with an error "Cannot invoke > > "org.apache.asterix.om.types.ATypeTag.ordinal()" because "sourceTag" is > > null". On further investigation, I found out that the tuple rising up > from > > this group by is malformed. The rewritten group by looks like this: > > group by ([$$163 := $$139]) decor ([]) { aggregate [$$155, > > $$156, $$157, $$158, $$159, $$160, $$161, $$162] <- > > [agg-local-sql-sum($$90), agg-local-sql-sum($$95), > > agg-local-sql-sum(numeric-multiply($$95, numeric-subtract(1, $$152))), > > agg-local-sql-sum(numeric-multiply(numeric-multiply($$95, > > numeric-subtract(1, $$152)), numeric-add(1, $$154))), > > agg-local-sql-avg($$90), agg-local-sql-avg($$95), > agg-local-sql-avg($$152), > > agg-sql-count(1)] [cardinality: 0.0, op-cost: 0.0, total-cost: 0.0] -- > > AGGREGATE |LOCAL| order (ASC, $$138) [cardinality: 0.0, op-cost: 0.0, > > total-cost: 0.0] -- MICRO_STABLE_SORT [$$138(ASC)] |LOCAL| nested tuple > > source [cardinality: 0.0, op-cost: 0.0, total-cost: 0.0] -- > > NESTED_TUPLE_SOURCE |LOCAL| } -- PRE_CLUSTERED_GROUP_BY[$$139] > > |PARTITIONED| > > In particular, I think I need three major advices:- > > 1. Any idea as to what is going wrong. > > 2. Where should I place this rule? Right now I am placing it towards the > > end just before execution mode and memory requirements are set and > perhaps > > it should be introduced much earlier? > > 3. How do I introduce functions > > like computeAndSetTypeEnvironmentForOperator() and > > computeDeliveredPhysicalProperties(). I believe these are key functions > to > > set for adding new operators later in the compile stage and I couldn't > find > > good examples to check the order in which we should invoke these > functions. > > I am happy to provide more details (including the new rule) if necessary. > > Thanks a lot for your time! > > -- > > Best Regards, > > Pratyoy > > > > > -- > Regards, > -- Best Regards, Pratyoy