Re: Adding a new rule to do a micro sort for group by on multiple columns

Pratyoy Das Fri, 11 Jul 2025 00:35:15 -0700

Hi Ali,
Thanks for your response. Indeed there is a mismatch, i.e. I want to
declare the group by on just the first attribute, but aggregate on all the
(group by) attributes. I was hoping that there is some way I can do it.
The query for which I am testing is TPC-H Q1 which is something like
select l_returnflag, l_linestatus, sum(l_quantity) as sum_qty
from lineitem
where l_shipdate <= date '1998-12-01' - interval '90' day
group by l_returnflag,  l_linestatus
Here, let's say l_returnflag is pre-sorted so rather than sorting the
combination of (l_returnflag, l_linestatus), I want to just sort
l_linestatus for each individual l_returnflag value. So I feel like I need
to declare the pre clustered group by on just l_returnflag so that we wait
for all tuples with the same l_returnflag value to arrive, but then once
they have arrived, we aggregate on the combination of (l_returnflag,
l_linestatus). I know it looks weird but I hope you see what I am trying to
do. I am not sure how much help it will be, but I am attaching the rule
that I had come up with.
Thanks!





On Thu, Jul 10, 2025 at 7:25 PM Ali Alsuliman <ali.al.solai...@gmail.com>
wrote:

> Hi Pratyoy,
> We would need more information and context to help like the query you are
> running and sharing the code you have (possibly put it up in Gerrit).
> However, from just the information you shared, it feels like you have a
> problem in the GROUP-BY operator output, e.g. the declared output type of
> the GROUP-BY and the actual output type produced or something like that.
> For example, you mentioned that you are aggregating over a, b ("then
> aggregate tuples on(a,b)"), the original SORT_GROUP_BY has both fields:
> SORT_GROUP_BY[$$139, $$138] which is part of the group-by operator
> output: group by ([$$163 := $$139; $$164 := $$138]).
> In your example, I only see one field: PRE_CLUSTERED_GROUP_BY[$$139] and
> group
> by ([$$163 := $$139])
>
> On Tue, Jul 8, 2025 at 8:24 PM Pratyoy Das <praty...@uci.edu> wrote:
>
> > Hi all,
> > I am trying to add a new rule in Asterix DB and would really appreciate
> > some help/advice in constructing the rule.
> > Let's say that there is a query: select a,b sum(c) from T group by a,b.
> > The default way to answer this query is by using a sort group by, and the
> > physical plan (for the local or global agg) would look something like the
> > following :-
> >
> >
> >               group by ([$$163 := $$139; $$164 := $$138]) decor ([]) {
> >                         aggregate [$$155, $$156, $$157, $$158, $$159,
> > $$160, $$161, $$162] <- [agg-local-sql-sum($$90),
> agg-local-sql-sum($$95),
> > agg-local-sql-sum(numeric-multiply($$95, numeric-subtract(1, $$152))),
> > agg-local-sql-sum(numeric-multiply(numeric-multiply($$95,
> > numeric-subtract(1, $$152)), numeric-add(1, $$154))),
> > agg-local-sql-avg($$90), agg-local-sql-avg($$95),
> agg-local-sql-avg($$152),
> > agg-sql-count(1)] [cardinality: 0.0, op-cost: 0.0, total-cost: 0.0]
> >                         -- AGGREGATE  |LOCAL|
> >                           nested tuple source [cardinality: 0.0, op-cost:
> > 0.0, total-cost: 0.0]
> >                           -- NESTED_TUPLE_SOURCE  |LOCAL|
> >                      }
> >               -- SORT_GROUP_BY[$$139, $$138]  |PARTITIONED|
> >
> > Now let's say that we can guarantee a sorted order on attribute a. So,
> for
> > the local agg,  I want to do a pre-clustered group by on "a", and then
> do a
> > micro sort on b and then aggregate tuples on(a,b). I got inspired from
> how
> > Asterix DB deals with queries that have DISTINCT and GROUP BY - eg select
> > a, count(distinct b) from T group by a. The group by looks something
> like:
> > group by ([$$o_orderpriority := $$75]) decor ([]) {
> >                     aggregate [$$81] <- [agg-sql-count($$76)]
> [cardinality:
> > 0.0, op-cost: 0.0, total-cost: 0.0]
> >                     -- AGGREGATE  |LOCAL|
> >                       distinct ([$$76]) [cardinality: 0.0, op-cost: 0.0,
> > total-cost: 0.0]
> >                       -- MICRO_PRE_SORTED_DISTINCT_BY  |LOCAL|
> >                         order (ASC, $$76) [cardinality: 0.0, op-cost:
> 0.0,
> > total-cost: 0.0]
> >                         -- MICRO_STABLE_SORT [$$76(ASC)]  |LOCAL|
> >                           nested tuple source [cardinality: 0.0, op-cost:
> > 0.0, total-cost: 0.0]
> >                           -- NESTED_TUPLE_SOURCE  |LOCAL|
> >                  }
> >           -- PRE_CLUSTERED_GROUP_BY[$$75]  |PARTITIONED|
> >
> > I created a rule and currently it fails during runtime (not compilation)
> > with an error "Cannot invoke
> > "org.apache.asterix.om.types.ATypeTag.ordinal()" because "sourceTag" is
> > null". On further investigation, I found out that the tuple rising up
> from
> > this group by is malformed. The rewritten group by looks like this:
> >             group by ([$$163 := $$139]) decor ([]) { aggregate [$$155,
> > $$156, $$157, $$158, $$159, $$160, $$161, $$162] <-
> > [agg-local-sql-sum($$90), agg-local-sql-sum($$95),
> > agg-local-sql-sum(numeric-multiply($$95, numeric-subtract(1, $$152))),
> > agg-local-sql-sum(numeric-multiply(numeric-multiply($$95,
> > numeric-subtract(1, $$152)), numeric-add(1, $$154))),
> > agg-local-sql-avg($$90), agg-local-sql-avg($$95),
> agg-local-sql-avg($$152),
> > agg-sql-count(1)] [cardinality: 0.0, op-cost: 0.0, total-cost: 0.0] --
> > AGGREGATE |LOCAL| order (ASC, $$138) [cardinality: 0.0, op-cost: 0.0,
> > total-cost: 0.0] -- MICRO_STABLE_SORT [$$138(ASC)] |LOCAL| nested tuple
> > source [cardinality: 0.0, op-cost: 0.0, total-cost: 0.0] --
> > NESTED_TUPLE_SOURCE |LOCAL| } -- PRE_CLUSTERED_GROUP_BY[$$139]
> > |PARTITIONED|
> > In particular, I think I need three major advices:-
> > 1. Any idea as to what is going wrong.
> > 2. Where should I place this rule? Right now I am placing it towards the
> > end just before execution mode and memory requirements are set and
> perhaps
> > it should be introduced much earlier?
> > 3. How do I introduce functions
> > like computeAndSetTypeEnvironmentForOperator() and
> > computeDeliveredPhysicalProperties(). I believe these are key functions
> to
> > set for adding new operators later in the compile stage and I couldn't
> find
> > good examples to check the order in which we should invoke these
> functions.
> > I am happy to provide more details (including the new rule) if necessary.
> > Thanks a lot for your time!
> > --
> > Best Regards,
> > Pratyoy
> >
>
>
> --
> Regards,
>


-- 
Best Regards,
Pratyoy

Re: Adding a new rule to do a micro sort for group by on multiple columns

Reply via email to