Pushing down a subquery relation's ppi_clauses, and more ...

Richard Guo Fri, 25 Jul 2025 20:09:32 -0700

For a subquery relation, currently we consider pushing down its
restriction clauses to become WHERE or HAVING quals of the subquery
itself.  This transformation can potentially avoid the need to
evaluate all subquery output rows before filtering them, which may
lead to a more efficient execution plan.  As an example, please
consider:


explain (costs off)
select * from t t1 join
  (select t2.a, t2.b, sum(t2.b) from t t2 group by t2.a, t2.b) s
  on s.a = 1;
             QUERY PLAN
-------------------------------------
 Nested Loop
   ->  Seq Scan on t t1
   ->  Materialize
         ->  HashAggregate
               Group Key: t2.b
               ->  Seq Scan on t t2
                     Filter: (a = 1)
(7 rows)

If the subquery is LATERAL, its ppi_clauses -- that is, join clauses
available from its required outer rels -- sort of behave like
restriction clauses.  As an example, please consider:

explain (costs off)
select * from t t1 join lateral
  (select t2.a, t2.b, sum(t2.b), t1.a x from t t2 group by t2.a, t2.b) s
  on s.a = t1.a;
             QUERY PLAN
-------------------------------------
 Nested Loop
   ->  Seq Scan on t t1
   ->  Subquery Scan on s
         Filter: (t1.a = s.a)
         ->  HashAggregate
               Group Key: t2.a, t2.b
               ->  Seq Scan on t t2
(7 rows)

Here, I'd like to discuss whether it's worthwhile to also consider
pushing down a subquery relation's ppi_clauses if the subquery is
LATERAL.

First, it's important to note that pushing down ppi_clauses doesn't
always result in a better execution plan.  While doing so can reduce
the amount of data processed in each aggregation invocation within the
subquery, it also means that the aggregation needs to be re-evaluated
for every outer tuple.  If t1 is very small and t2 is large, pushing
down ppi_clauses can be a win.  As t1 gets larger, this gets less
attractive, and eventually it will have a higher cost than the current
plan, where the aggregation is evaluated only once.

Therefore, if we decide to pursue this approach, we would need to
generate two paths: one with the ppi_clauses pushed down, and one
without, and then compare their costs.  A potential concern is that
this might require re-planning the subquery twice, which could
increase planning overhead.

On the other hand, once this optimization is implemented, in addition
to making lateral subqueries more efficient in some cases, it will
enable us to pull up more complicated sublinks.  To be more concrete,
consider query:

select * from foo where foo.a >
    (select avg(bar.a) from bar where foo.b = bar.b);

... it can be transformed into something like:

select * from foo join
    (select bar.b, avg(bar.a) as avg from bar group by bar.b) sub
    on foo.b = sub.b and foo.a > sub.avg;

Currently, the planner does not perform this transformation because
it's not always a win -- if foo is very small and bar is large, the
original form tends to be more efficient.  However, if we can push
down the subquery's join clauses into the subquery, the resulting
parameterized paths will be pretty nearly performance-equivalent to
the origin form.  We can then compare these parameterized paths with
the unparameterized paths that calculate the aggregation only once.
With this capability, it becomes safe to always pull up such sublinks.

In some articles, the PostgreSQL optimizer is noted to lack certain
important transformations.  For example, one recent article from
Microsoft states:

"
Comparison with Volcano/Cascades: In PostgreSQL the separation
of scan/join planning from query flattening and post scan/join phase
prevents the exploitation of important transformations such as the
reordering of GROUP BY and joins, and decorrelation of correlated
subqueries (see Sections 4.4 and 4.5). This results in a reduced search
space of plans and therefore can result in missing out plans with lower
cost.
"

I hope the transformation proposed here can help address part of the
gap related to decorrelation of correlated subqueries.

(BTW, for the "reordering of GROUP BY and joins" transformation, I
have another patch called "Eager Aggregation" that specifically
targets that optimization.)

There are still many details I haven't fully thought through.  Before
investing too much time in this idea, I'd appreciate some feedback to
make sure I'm not overlooking something obvious.

Thanks
Richard

Pushing down a subquery relation's ppi_clauses, and more ...

Reply via email to