Re: Why are nested aggregations illegal? Best alternatives?

Stamatis Zampetakis Mon, 21 Feb 2022 05:07:31 -0800

Hi Gavin,

A few more comments in case they help to get you a bit further on your work.


The need to return the result as a single object is a common problem in
object relational mapping (ORM) frameworks/APIS (JPA, Datanucleus,
Hibernate, etc.). Apart from the suggestions so far maybe you could look
into these frameworks as well for more inspiration.

Moreover your approach of decomposing the query into individual parts is
commonly known as the N+1 problem [1].

Lastly, keep in mind that you can introduce custom UDF, UDAF functions if
you need more flexibility on reconstructing the final result.

Best,
Stamatis

[1]
https://stackoverflow.com/questions/97197/what-is-the-n1-selects-problem-in-orm-object-relational-mapping

On Sun, Feb 13, 2022 at 3:59 AM Gavin Ray <ray.gavi...@gmail.com> wrote:

> Ah wait nevermind, got excited and spoke too soon. Looking at it more
> closely, that data isn't correct.
> At least it's in somewhat the right shape, ha!
>
> On Sat, Feb 12, 2022 at 9:57 PM Gavin Ray <ray.gavi...@gmail.com> wrote:
>
> > After ~5 hours, I think I may have made some progress =)
> >
> > I have this, which currently works. The problem is that the nested
> columns
> > don't have names on them.
> > Since I need to return a nested "Map<String, Object>", I have to figure
> > out how to convert this query into a form that gives column names.
> >
> > But this is still great progress I think!
> >
> > SELECT
> >     "todos".*,
> >     ARRAY(
> >         SELECT
> >             "users".*,
> >             ARRAY(
> >                 SELECT
> >                     "todos".*
> >                 FROM
> >                     "todos"
> >             ) AS "todos"
> >         FROM
> >             "users"
> >     ) AS "users"
> > FROM
> >     "todos"
> > WHERE
> >     "user_id" IN (
> >         SELECT
> >             "user_id"
> >         FROM
> >             "users"
> >         WHERE
> >             "house_id" IN (
> >                 SELECT
> >                     "id"
> >                 FROM
> >                     "houses"
> >             )
> >     );
> >
> >
> >
> >
> +----+---------+------------------------+------------------------------------------------------------------------------+
> > | id | user_id |      description       |
> >                                             |
> >
> >
> +----+---------+------------------------+------------------------------------------------------------------------------+
> > | 1  | 1       | Take out the trash     | [{1, John, 1, [{1, 1, Take out
> > the trash}, {2, 1, Watch my favorite show}, { |
> > | 2  | 1       | Watch my favorite show | [{1, John, 1, [{1, 1, Take out
> > the trash}, {2, 1, Watch my favorite show}, { |
> > | 3  | 1       | Charge my phone        | [{1, John, 1, [{1, 1, Take out
> > the trash}, {2, 1, Watch my favorite show}, { |
> > | 4  | 2       | Cook dinner            | [{1, John, 1, [{1, 1, Take out
> > the trash}, {2, 1, Watch my favorite show}, { |
> > | 5  | 2       | Read a book            | [{1, John, 1, [{1, 1, Take out
> > the trash}, {2, 1, Watch my favorite show}, { |
> > | 6  | 2       | Organize office        | [{1, John, 1, [{1, 1, Take out
> > the trash}, {2, 1, Watch my favorite show}, { |
> > | 7  | 3       | Walk the dog           | [{1, John, 1, [{1, 1, Take out
> > the trash}, {2, 1, Watch my favorite show}, { |
> > | 8  | 3       | Feed the cat           | [{1, John, 1, [{1, 1, Take out
> > the trash}, {2, 1, Watch my favorite show}, { |
> >
> >
> +----+---------+------------------------+------------------------------------------------------------------------------+
> >
> > On Sat, Feb 12, 2022 at 4:13 PM Gavin Ray <ray.gavi...@gmail.com> wrote:
> >
> >> Nevermind, this is a standard term not something Calcite-specific it
> >> seems!
> >>
> >> https://en.wikipedia.org/wiki/Correlated_subquery
> >>
> >> On Sat, Feb 12, 2022 at 3:46 PM Gavin Ray <ray.gavi...@gmail.com>
> wrote:
> >>
> >>> Forgive my ignorance/lack of experience
> >>>
> >>> I am somewhat familiar with the ARRAY() function, but not sure I know
> >>> the term "correlated"
> >>> Searching the Calcite codebase for uses of "correlated" + "query", I
> >>> found:
> >>>
> >>>
> >>>
> https://github.com/apache/calcite/blob/1d4f1b394bfdba03c5538017e12ab2431b137ca9/core/src/test/java/org/apache/calcite/test/SqlToRelConverterTest.java#L1603-L1612
> >>>
> >>>   @Test void testCorrelatedSubQueryInJoin() {
> >>>     final String sql = "select *\n"
> >>>         + "from emp as e\n"
> >>>         + "join dept as d using (deptno)\n"
> >>>         + "where d.name = (\n"
> >>>         + "  select max(name)\n"
> >>>         + "  from dept as d2\n"
> >>>         + "  where d2.deptno = d.deptno)";
> >>>     sql(sql).withExpand(false).ok();
> >>>   }
> >>>
> >>> But I also see this, which says it is "uncorrelated" but seems very
> >>> similar?
> >>>
> >>>   @Test void testInUncorrelatedSubQuery() {
> >>>     final String sql = "select empno from emp where deptno in"
> >>>         + " (select deptno from dept)";
> >>>     sql(sql).ok();
> >>>   }
> >>>
> >>> I wouldn't blame you for not answering such a basic question -- but
> what
> >>> exactly does "correlation" mean here?
> >>>
> >>> Thanks, as usual Julian
> >>>
> >>>
> >>> On Sat, Feb 12, 2022 at 3:08 PM Julian Hyde <jhyde.apa...@gmail.com>
> >>> wrote:
> >>>
> >>>> Correlated ARRAY sub-query?
> >>>>
> >>>> > On Feb 12, 2022, at 10:40 AM, Gavin Ray <ray.gavi...@gmail.com>
> >>>> wrote:
> >>>> >
> >>>> > Apologies for the delay in replying
> >>>> >
> >>>> > This makes things clear and seems obvious now that you point it out.
> >>>> > Thank you, Justin and Julian =)
> >>>> >
> >>>> > Let me ask another question (if I may) that I am struggling to
> phrase
> >>>> > easily.
> >>>> >
> >>>> > So with GraphQL, you might have a query like:
> >>>> > - "Get houses"
> >>>> > - "For each house get the user that lives in the house
> >>>> > - "And for each user get their list of todos"
> >>>> >
> >>>> > The result has to come back such that it's a single object for each
> >>>> row
> >>>> > ==============
> >>>> > {
> >>>> >  houses: [{
> >>>> >    address: "123 Main Street",
> >>>> >    users: [{
> >>>> >      name: "Joe",
> >>>> >      todos: [{
> >>>> >        description: "Take out trash"
> >>>> >      }]
> >>>> >    }]
> >>>> > }
> >>>> >
> >>>> > From a SQL perspective, the logical equivalent would be something
> >>>> like:
> >>>> > ==============
> >>>> > SELECT
> >>>> >    house.address,
> >>>> >    (somehow nest users + double-nest todos under user)
> >>>> > FROM
> >>>> >    house
> >>>> > JOIN
> >>>> >    user ON user.house_id = house.id
> >>>> >    todos ON todos.user_id = user.id
> >>>> > WHERE
> >>>> >    house.id = 1
> >>>> >
> >>>> > I'm not familiar enough with SQL to have figured out a way to make
> >>>> this
> >>>> > kind of
> >>>> > query using operators that are supported across most of the DB's
> >>>> Calcite has
> >>>> > adapters for.
> >>>> >
> >>>> > Currently what I have done instead, on a tip from Gopalakrishna
> Holla
> >>>> from
> >>>> > LinkedIn Coral team who has built GraphQL-on-Calcite, was to break
> up
> >>>> the
> >>>> > query
> >>>> > into individual parts and then do the join in-memory:
> >>>> >
> >>>> > SELECT ... FROM users;
> >>>> > SELECT ... FROM todos WHERE todos.user_id IN (ResultSet from prev
> >>>> response);
> >>>> >
> >>>> > However, the way I am doing this seems like it's probably very
> >>>> inefficient.
> >>>> > Because I do a series of nested loops to add the key to each object
> >>>> in the
> >>>> > parent ResultSet row:
> >>>> >
> >>>> >
> >>>>
> https://github.com/GavinRay97/GraphQLCalcite/blob/648e0ac4f6810a3c360d13a03e6597c33406de4b/src/main/kotlin/TableDataFetcher.kt#L135-L153
> >>>> >
> >>>> > Is there some better way of doing this?
> >>>> > I would be eternally grateful for any advice.
> >>>> >
> >>>> >
> >>>> > On Thu, Feb 10, 2022 at 3:27 PM Julian Hyde <jhyde.apa...@gmail.com
> >
> >>>> wrote:
> >>>> >
> >>>> >> Yes, if you want to do multiple layers of aggregation, use CTEs
> >>>> (WITH) or
> >>>> >> nested sub-queries. For example, the following is I believe valid
> >>>> standard
> >>>> >> SQL, and actually computes something useful:
> >>>> >>
> >>>> >>  WITH q1 AS
> >>>> >>   (SELECT deptno, job, AVG(sal) AS avg_sal
> >>>> >>    FROM emp
> >>>> >>    GROUP BY deptno, job)
> >>>> >>  WITH q2 AS
> >>>> >>   (SELECT deptno, AVG(avg_sal) AS avg_avg_sal
> >>>> >>    FROM q1
> >>>> >>    GROUP BY deptno)
> >>>> >>  SELECT AVG(avg_avg_sal)
> >>>> >>  FROM q2
> >>>> >>  GROUP BY ()
> >>>> >>
> >>>> >> (You can omit the “GROUP BY ()” line, but I think it makes things
> >>>> clearer.)
> >>>> >>
> >>>> >> Julian
> >>>> >>
> >>>> >>
> >>>> >>
> >>>> >>> On Feb 10, 2022, at 12:17 PM, Justin Swanhart <
> greenl...@gmail.com>
> >>>> >> wrote:
> >>>> >>>
> >>>> >>> I wish you could unsend emails :)  Answering my own question, no,
> >>>> because
> >>>> >>> that would return three rows with the average :D
> >>>> >>>
> >>>> >>> On Thu, Feb 10, 2022 at 3:16 PM Justin Swanhart <
> >>>> greenl...@gmail.com>
> >>>> >> wrote:
> >>>> >>>
> >>>> >>>> Just out of curiosity, is the second level aggregation using AVG
> >>>> in a
> >>>> >>>> window context?  It the frame is the whole table and it
> aggregates
> >>>> over
> >>>> >> it?
> >>>> >>>>
> >>>> >>>> On Thu, Feb 10, 2022 at 3:12 PM Justin Swanhart <
> >>>> greenl...@gmail.com>
> >>>> >>>> wrote:
> >>>> >>>>
> >>>> >>>>> That is really neat about Oracle.
> >>>> >>>>>
> >>>> >>>>> The alternative in general is to use a subquery:
> >>>> >>>>> SELECT avg(avg(sal)) FROM emp GROUP BY deptno;
> >>>> >>>>> becomes
> >>>> >>>>> select avg(the_avg)
> >>>> >>>>> from (select avg(sal) from emp group b deptno) an_alias;
> >>>> >>>>>
> >>>> >>>>> or
> >>>> >>>>>
> >>>> >>>>> with the_cte as (select avg(sal) x from emp group by deptno)
> >>>> >>>>> select avg(x) from the_cte;
> >>>> >>>>>
> >>>> >>>>> On Thu, Feb 10, 2022 at 3:03 PM Julian Hyde <
> >>>> jhyde.apa...@gmail.com>
> >>>> >>>>> wrote:
> >>>> >>>>>
> >>>> >>>>>> Some databases, e.g. Oracle, allow TWO levels of nesting:
> >>>> >>>>>>
> >>>> >>>>>> SELECT avg(sal) FROM emp GROUP BY deptno;
> >>>> >>>>>>
> >>>> >>>>>> AVG(SAL)
> >>>> >>>>>> ========
> >>>> >>>>>> 1,566.67
> >>>> >>>>>> 2,175.00
> >>>> >>>>>> 2,916.65
> >>>> >>>>>>
> >>>> >>>>>> SELECT avg(avg(sal)) FROM emp GROUP BY deptno;
> >>>> >>>>>>
> >>>> >>>>>> AVG(SUM(SAL))
> >>>> >>>>>> =============
> >>>> >>>>>>       9,675
> >>>> >>>>>>
> >>>> >>>>>> The first level aggregates by department (returning 3 records),
> >>>> and
> >>>> >> the
> >>>> >>>>>> second level computes the grand total (returning 1 record). But
> >>>> that
> >>>> >> is an
> >>>> >>>>>> exceptional case.
> >>>> >>>>>>
> >>>> >>>>>> Generally, any expression in the SELECT or HAVING clause of an
> >>>> >> aggregate
> >>>> >>>>>> query is either ‘before’ or ‘after’ aggregation. Consider
> >>>> >>>>>>
> >>>> >>>>>> SELECT t.x + 1 AS a, 2 + SUM(t.y + 3) AS b
> >>>> >>>>>> FROM t
> >>>> >>>>>> GROUP BY t.x
> >>>> >>>>>>
> >>>> >>>>>> The expressions “t.y” and “t.y + 3” occur before aggregation;
> >>>> “t.x”,
> >>>> >>>>>> “t.x + 1”, “SUM(t.y + 3)” and “2 + SUM(t.y + 3)” occur after
> >>>> >> aggregation.
> >>>> >>>>>> SQL semantics rely heavily on this stratification. Allowing an
> >>>> extra
> >>>> >> level
> >>>> >>>>>> of aggregation would mess it all up.
> >>>> >>>>>>
> >>>> >>>>>> Julian
> >>>> >>>>>>
> >>>> >>>>>>
> >>>> >>>>>>
> >>>> >>>>>>> On Feb 10, 2022, at 9:45 AM, Justin Swanhart <
> >>>> greenl...@gmail.com>
> >>>> >>>>>> wrote:
> >>>> >>>>>>>
> >>>> >>>>>>> This is a SQL limitation.
> >>>> >>>>>>>
> >>>> >>>>>>> mysql> select sum(1);
> >>>> >>>>>>> +--------+
> >>>> >>>>>>> | sum(1) |
> >>>> >>>>>>> +--------+
> >>>> >>>>>>> |      1 |
> >>>> >>>>>>> +--------+
> >>>> >>>>>>> 1 row in set (0.00 sec)
> >>>> >>>>>>>
> >>>> >>>>>>> mysql> select sum(sum(1));
> >>>> >>>>>>> ERROR 1111 (HY000): Invalid use of group function
> >>>> >>>>>>>
> >>>> >>>>>>> On Thu, Feb 10, 2022 at 12:39 PM Gavin Ray <
> >>>> ray.gavi...@gmail.com>
> >>>> >>>>>> wrote:
> >>>> >>>>>>>
> >>>> >>>>>>>> Went to test this query out and found that it can't be
> >>>> performed:
> >>>> >>>>>>>>
> >>>> >>>>>>>> SELECT
> >>>> >>>>>>>>  JSON_OBJECT(
> >>>> >>>>>>>>      KEY 'users'
> >>>> >>>>>>>>      VALUE JSON_ARRAYAGG(
> >>>> >>>>>>>>          JSON_OBJECT(
> >>>> >>>>>>>>              KEY 'name' VALUE "users"."name",
> >>>> >>>>>>>>              KEY 'todos' VALUE JSON_ARRAYAGG(
> >>>> >>>>>>>>                  JSON_OBJECT(
> >>>> >>>>>>>>                      KEY 'description' VALUE
> >>>> "todos"."description"
> >>>> >>>>>>>>                  )
> >>>> >>>>>>>>              )
> >>>> >>>>>>>>          )
> >>>> >>>>>>>>      )
> >>>> >>>>>>>>  )
> >>>> >>>>>>>> FROM
> >>>> >>>>>>>>  "users"
> >>>> >>>>>>>> LEFT OUTER JOIN
> >>>> >>>>>>>>  "todos" ON "users"."id" = "todos"."user_id";
> >>>> >>>>>>>>
> >>>> >>>>>>>> Checking the source, seems this is a blanket policy, not a
> >>>> >>>>>>>> datasource-specific thing.
> >>>> >>>>>>>> From a functional perspective, it doesn't feel like it's much
> >>>> >>>>>> different
> >>>> >>>>>>>> from JOINs
> >>>> >>>>>>>> But I don't understand relational theory or DB functionality
> >>>> in the
> >>>> >>>>>> least,
> >>>> >>>>>>>> so I'm not fit to judge.
> >>>> >>>>>>>>
> >>>> >>>>>>>> Just curious why Calcite doesn't allow this
> >>>> >>>>>>>>
> >>>> >>>>>>
> >>>> >>>>>>
> >>>> >>
> >>>> >>
> >>>>
> >>>>
>

Re: Why are nested aggregations illegal? Best alternatives?

Reply via email to