Re: [HACKERS] upper planner path-ification

Kouhei Kaigai Tue, 23 Jun 2015 02:07:05 -0700

> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]] On Behalf Of Tom Lane
> Sent: Monday, May 18, 2015 1:12 AM
> To: Robert Haas
> Cc: [email protected]
> Subject: Re: [HACKERS] upper planner path-ification
> 
> Robert Haas <[email protected]> writes:
> > So, getting back to this part, what's the value of returning a list of
> > Paths rather than a list of Plans?
> 
> (1) less work, since we don't have to fill in details not needed for
>     costing purposes;
> (2) paths carry info that the planner wants but the executor doesn't,
>     notably sort-order annotations.
> 
> > target lists are normally computed when paths are converted to plans,
> > but for the higher-level plan nodes adding by grouping_planner, the
> > path list is typically just passed in.  So would the new path types be
> > expected to carry target lists of their own, or would they need to
> > figure out the target list on the fly at plan generation time?
> 
> Yeah, that is something I've been struggling with while thinking about
> this.  I don't really want to add tlists as such to Paths, but it's
> not very clear how else to annotate a Path as to what it computes,
> and that seems like an annotation we have to have in some form in order
> to convert these planning steps into a Path universe.
> 
> There are other cases where it would be useful to have some notion of
> this kind.  An example is that right now, if you have an expression index
> on an expensive function and a query that wants the value of that function,
> the planner is able to extract the value from the index --- but there is
> nothing that gives any cost benefit to doing so, so it's just as likely
> to choose some other index and eat the cost of recalculating the function.
> It seems like the only way to fix that in a principled fashion is to have
> some concept that the index-scan Path can produce the function value,
> and then when we come to some appropriate costing step, penalize the other
> paths for having to compute the value that's available for free from this
> one.
> 
> Rather than adding tlists per se to Paths, I've been vaguely toying with
> a notion of identifying all the "interesting" subexpressions in a query
> (expensive functions, aggregates, etc), giving them indexes 1..n, and then
> marking Paths with bitmapsets showing which interesting subexpressions
> they can produce values for.  This would make things like "does this Path
> compute all the needed aggregates" much cheaper to deal with than a raw
> tlist representation would do.  But maybe that's still not the best way.
>
Hmm.... it seems to me a little bit complicated than flat expression node.


> Another point is that a Path that computes aggregates is fundamentally
> different from a Path that doesn't, because it doesn't even produce the
> same number of rows.  So I'm not at all sure how to visualize the idea
> of a Path that computes only some aggregates, or whether it's even a
> sensible thing to worry about supporting.
>
I expect partial aggregate shall be done per a particular input stream,
not per aggregate function. In other words, once planner determined
a relation scan/join has advantage to run partial aggregate, all the
aggregate functions that consume rows produced by this scan/join have
to have partial aggregate / final aggregate form, doesn't it?
If so, number of rows to be returned is associated with a Path.

For example, when we break down a query below using 2-phase aggregation,

  SELECT t1.cat, avg(t2.x) FROM t1 JOIN t2 ON t1.id_1 = t2.id_2 GROUP BY t1.cat;

expected plan is as shown below, isn't it?

  FinalAgg (nrows=100)
      tlist: t1.cat, avg(nrows, sum(t2.x))
      grouping key: t1.cat
   -> HashJoin (nrows=1000)
      tlist: t1.cat, count(t2.x) nrows, sum(t2.x)
       -> PartialAgg (nrows=1000)
          tlist: count(t2.x) nrows, sum(t2.x), t2.id_2
          grouping key: t2.id_2
           -> SeqScan on t2 (nrows=100000000)
       -> Hash
           -> SeqScan on t1 (nrows=100)

It is clear that PartialAgg consumes 100000000 rows, then output 1000
rows because of partial reduction. All the partial aggregation on this
node will work in a coordinated manner.

Do you have another vision for the partial aggregation behavior?


> > One thing that seems like it might complicate things here is that a
> > lot of planner functions take PlannerInfo *root as an argument.  But
> > if we generate only paths in grouping_planner() and path-ify them
> > later, the subquery's root will not be available when we're trying to
> > do the Path -> Plan transformation.
> 
> Ah, you're wrong there, because we hang onto the subquery's root already
> (I forget why exactly, but see PlannerGlobal.subroots for SubPlans, and
> RelOptInfo.subroot for subquery-in-FROM).  So it would not be a
> fundamental problem to postpone create_plan() for a subquery.
> 
> > I think grouping_planner() is badly in need of some refactoring just
> > to make it shorter.  It's over 1000 lines of code, which IMHO is a
> > fairly ridiculous length for a single function.
> 
> Amen to that.  But as I said to Andrew, I think this will be a side-effect
> of path-ification in this area, and is probably not something to set out
> to do first.
>

--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <[email protected]>



-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] upper planner path-ification

Reply via email to