Re: POC: GROUP BY optimization

Alexander Korotkov Mon, 27 May 2024 05:41:55 -0700

Hi!

On Tue, Apr 23, 2024 at 1:19 AM Alexander Korotkov <[email protected]> wrote:
> On Thu, Apr 18, 2024 at 1:57 PM Alexander Korotkov <[email protected]> 
> wrote:
> > Thank you for the fixes you've proposed.  I didn't look much into
> > details yet, but I think the main concern Tom expressed in [1] is
> > whether the feature is reasonable at all.  I think at this stage the
> > most important thing is to come up with convincing examples showing
> > how huge performance benefits it could cause.  I will return to this
> > later today and will try to provide some convincing examples.
>
> I took a fresh look at 0452b461b, and have the following thoughts:
> 1) Previously, preprocess_groupclause() tries to reorder GROUP BY
> clauses to match the required ORDER BY order.  It only reorders if
> GROUP BY pathkeys are the prefix of ORDER BY pathkeys or vice versa.
> So, both of them need to be satisfied by one ordering.  0452b461b also
> tries to match GROUP BY clauses to ORDER BY clauses, but takes into
> account an incremental sort.  Actually, instead of removing
> preprocess_groupclause(), we could just teach it to take incremental
> sort into account.
> 2) The get_useful_group_keys_orderings() function takes into account 3
> orderings of pathkeys and clauses: original order as written in GROUP
> BY, matching ORDER BY clauses as much as possible, and matching the
> input path as much as possible.  Given that even before 0452b461b,
> preprocess_groupclause() could change the original order of GROUP BY
> clauses, so we don't need to distinguish the first two.  We could just
> consider output of new preprocess_groupclause() taking into account an
> incremental sort and the ordering matching the input path as much as
> possible.  This seems like significant simplification.
>
> Let me also describe my thoughts about the justification of the
> feature itself.  As Tom pointed upthread, Sort + Grouping is generally
> unlikely faster than Hash Aggregate.  The main point of this feature
> is being able to match the order of GROUP BY clauses to the order of
> the input path.  That input path could be Index Scan or Subquery.
> Let's concentrate on Index Scan.  Index Scan can give us the required
> order, so we can do grouping without Sort or with significantly
> cheaper Incremental Sort.  That could be much faster than Hash
> Aggregate.  But if we scan the full table (or its significant
> fraction), Index Scan is typically much more expensive than Sequential
> Scan because of random page access.  However, there are cases when
> Index Scan could be cheaper.
> 1) If the heap row is wide and the index contains all the required
> columns, Index Only Scan can be cheaper than Sequential Scan because
> of lesser volume.
> 2) If the query predicate is selective enough and matches the index,
> Index Scan might be significantly cheaper.  One may say that if the
> query predicate is selective enough then there are not that many
> matching rows, so aggregating them either way isn't a big deal anyway.
> However, given that data volumes are growing tremendously, it's not
> hard to imagine that the index selected a small fraction of table
> rows, but they are still enough to be costly for aggregating.
> Therefore, I think there are significant cases where matching GROUP BY
> clauses to the order of the input path could give a substantial
> improvement over Hash Aggregate.
>
> While there are some particular use-cases by Jian He, I hope that
> above could give some rationale.


I've assembled patches in this thread into one patchset.
0001 The patch fixing asymmetry in setting EquivalenceClass.ec_sortref
by Andrei [1].  I've revised comments and wrote the commit message.
0002 The patch for handling duplicates of SortGroupClause.  I didn't
get the sense of Andrei implementation.  It seems to care about
duplicate pointers in group clauses list.  But the question is the
equal SortGroupClause's comprising different pointers.  I think we
should group duplicate SortGroupClause's together as
preprocess_groupclause() used to do.  Reimplemented patch to do so.
0003 Rename PathKeyInfo to GroupByOrdering by Andres [3].  I only
revised comments and wrote the commit message.
0004 Turn back preprocess_groupclause() for the reason I described upthread [4].

Any thoughts?

Links.
1. 
https://www.postgresql.org/message-id/17037754-f187-4138-8285-0e2bfebd0dea%40postgrespro.ru
2. 
https://www.postgresql.org/message-id/a663f0f6-cbf6-49aa-af2e-234dc6768a07%40postgrespro.ru
3. 
https://www.postgresql.org/message-id/db0fc3a4-966c-4cec-a136-94024d39212d%40postgrespro.ru
4. 
https://www.postgresql.org/message-id/CAPpHfdvyWLMGwvxaf%3D7KAp-z-4mxbSH8ti2f6mNOQv5metZFzg%40mail.gmail.com

------
Regards,
Alexander Korotkov
Supabase

v2-0003-Rename-PathKeyInfo-to-GroupByOrdering.patch
Description: Binary data

v2-0004-Restore-preprocess_groupclause.patch
Description: Binary data

v2-0002-Teach-group_keys_reorder_by_pathkeys-about-redund.patch
Description: Binary data

v2-0001-Fix-asymmetry-in-setting-EquivalenceClass.ec_sort.patch
Description: Binary data

Re: POC: GROUP BY optimization

Reply via email to