Re: POC: GROUP BY optimization

Tomas Vondra Mon, 22 Jun 2020 15:26:03 -0700

On Mon, Jun 22, 2020 at 11:50:49AM -0400, Robert Haas wrote:

On Sat, May 16, 2020 at 10:56 AM Tomas Vondra
<tomas.von...@2ndquadrant.com> wrote:

The local effects are trivial - it's for example reordering the
pathkeys to make the explicit sort as cheap as possible. This thread
already discussed a number of things to base this on - ndistinct for
columns, cost of comparison function, ... In any case, this is
something we can decide locally, when building the grouping paths.


The global effects are much harder to tackle, because the decision
can't be made locally when building the grouping paths. It requires
changes both below and above the point where we build grouping paths.

An example of a decision we need to make before we even get to
building a grouping path is which index paths to build. Currently we
only build index paths with "useful" pathkeys, and without tweaking
that we'll never even see the index in add_paths_to_grouping_rel().

But there are also decisions that can be made only after we build the
grouping paths. For example, we may have both GROUP BY and ORDER BY,
and there is no "always correct" way to combine those. In some cases
it may be correct to use the same pathkeys, in other cases it's
better to use different ones (which will require an extra Sort, with
additional cost).

So I don't think there will be a single "interesting" grouping
pathkeys (i.e. root->group_pathkeys), but a collection of pathkeys.
And we'll need to build grouping paths for all of those, and leave
the planner to eventually pick the one giving us the cheapest plan
...


I agree with all of this and I think it's really good analysis. Part
of the reason why the planner isn't that sophisticated in this area is
that, for a long time, we didn't use paths at this level, and so it
was much harder to write any kind of viable patch to consider
alternatives.  With Tom's planner path-ification word there should be
a lot more potential for optimization here, but, as you say, we need
to do that by leveraging the existing costing machinery, not just via
simple heuristics.


Agreed.

It also strikes me that one of the problems in this area is that the
statistics we currently gather don't seem to be entirely useful or
reliable for aggregate planning.


I don't think that's an immediate issue. I don't think anyone started
implementing these optimizations and decided not to do that because of
lack of statistics.

The reason why we don't have those optimizations is more that we decided
not to even consider those optimizations, possibly because of planner
limitations. Of course, the optimizations may require additional stats,
but that's only step #2.

I wonder if there are extensions to the extended statistics mechanism,
or even just more things we should gather during a routine ANALYZE,
that would enable us to estimate things better here. The most obvious
thing is that n_distinct is often wildly inaccurate, but it's deeper
than that. For instance, we need some way to estimate how many groups
you're going to get when you filter on a and then group by b that
doesn't assume uniform distribution. And we need also need something
that doesn't just output a number of groups, but gives you some inkling
of the sizes of those groups: 1 giant group and a bunch of little ones
isn't the same as a bunch of equally sized groups. I don't know that
these are things we really have much chance of figuring out with the
currently-available information, though.


Not sure. I think the extended stats we have now (ndistinct coeffs,
multi-column MCV lists) should be good basis for these decisions, even
with skewed data. Of course, we may need to add something else, but I'm
not sure what would that be.

The main limitation of extended stats is it's at the per-table level.
Once you do a join, it's all over - we don't have that capability :-(

Sorry if this is hijacking the thread a bit; I don't mean to
discourage work on this specific patch. I'm just wondering if we need
to think a little bigger to see our way to a good solution.


I think it's fine continuing on this optimization. Either we'll be able
to get it done with existing stats, or we'll find what other stats we
need to sufficiently good heuristics.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: POC: GROUP BY optimization

Reply via email to