Support run-time partition pruning for hash join

Richard Guo Sun, 20 Aug 2023 20:48:35 -0700

If we have a hash join with an Append node on the outer side, something
like


 Hash Join
   Hash Cond: (pt.a = t.a)
   ->  Append
         ->  Seq Scan on pt_p1 pt_1
         ->  Seq Scan on pt_p2 pt_2
         ->  Seq Scan on pt_p3 pt_3
   ->  Hash
         ->  Seq Scan on t

We can actually prune those subnodes of the Append that cannot possibly
contain any matching tuples from the other side of the join.  To do
that, when building the Hash table, for each row from the inner side we
can compute the minimum set of subnodes that can possibly match the join
condition.  When we have built the Hash table and start to execute the
Append node, we should have known which subnodes are survived and thus
can skip other subnodes.

This kind of partition pruning can be extended to happen across multiple
join levels.  For instance,

 Hash Join
   Hash Cond: (pt.a = t2.a)
   ->  Hash Join
         Hash Cond: (pt.a = t1.a)
         ->  Append
               ->  Seq Scan on pt_p1 pt_1
               ->  Seq Scan on pt_p2 pt_2
               ->  Seq Scan on pt_p3 pt_3
         ->  Hash
               ->  Seq Scan on t1
   ->  Hash
         ->  Seq Scan on t2

We can compute the matching subnodes of the Append when building Hash
table for 't1' according to the join condition 'pt.a = t1.a', and when
building Hash table for 't2' according to join condition 'pt.a = t2.a',
and the final surviving subnodes would be their intersection.

Greenplum [1] has implemented this kind of partition pruning as
'Partition Selector'.  Attached is a patch that refactores Greenplum's
implementation to make it work on PostgreSQL master.  Here are some
details about the patch.

During planning:

1. When creating a hash join plan in create_hashjoin_plan() we first
   collect information required to build PartitionPruneInfos at this
   join, which includes the join's RestrictInfos and the join's inner
   relids, and put this information in a stack.

2. When we call create_append_plan() for an appendrel, for each of the
   joins we check if join partition pruning is possible to take place
   for this appendrel, based on the information collected at that join,
   and if so build a PartitionPruneInfo and add it to the stack entry.

3. After finishing the outer side of the hash join, we should have built
   all the PartitionPruneInfos that can be used to perform join
   partition pruning at this join.  So we pop out the stack entry to get
   the PartitionPruneInfos and add them to Hash node.

During executing:

When building the hash table for a hash join, we perform the partition
prunning for each row according to each of the JoinPartitionPruneStates
at this join, and store each result in a special executor parameter to
make it available to Append nodes.  When executing an Append node, we
can directly use the pre-computed pruning results to skip subnodes that
cannot contain any matching rows.

Here is a query that shows the effect of the join partition prunning.

CREATE TABLE pt (a int, b int, c varchar) PARTITION BY RANGE(a);
CREATE TABLE pt_p1 PARTITION OF pt FOR VALUES FROM (0) TO (250);
CREATE TABLE pt_p2 PARTITION OF pt FOR VALUES FROM (250) TO (500);
CREATE TABLE pt_p3 PARTITION OF pt FOR VALUES FROM (500) TO (600);
INSERT INTO pt SELECT i, i % 25, to_char(i, 'FM0000') FROM
generate_series(0, 599) i WHERE i % 2 = 0;

CREATE TABLE t1 (a int, b int);
INSERT INTO t1 values (10, 10);

CREATE TABLE t2 (a int, b int);
INSERT INTO t2 values (300, 300);

ANALYZE pt, t1, t2;

SET enable_nestloop TO off;

explain (analyze, costs off, summary off, timing off)
select * from pt join t1 on pt.a = t1.a right join t2 on pt.a = t2.a;
                         QUERY PLAN
------------------------------------------------------------
 Hash Right Join (actual rows=1 loops=1)
   Hash Cond: (pt.a = t2.a)
   ->  Hash Join (actual rows=0 loops=1)
         Hash Cond: (pt.a = t1.a)
         ->  Append (actual rows=0 loops=1)
               ->  Seq Scan on pt_p1 pt_1 (never executed)
               ->  Seq Scan on pt_p2 pt_2 (never executed)
               ->  Seq Scan on pt_p3 pt_3 (never executed)
         ->  Hash (actual rows=1 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 9kB
               ->  Seq Scan on t1 (actual rows=1 loops=1)
   ->  Hash (actual rows=1 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 9kB
         ->  Seq Scan on t2 (actual rows=1 loops=1)
(14 rows)


There are several points that need more consideration.

1. All the join partition prunning decisions are made in createplan.c
   where the best path tree has been decided.  This is not great.  Maybe
   it's better to make it happen when we build up the path tree, so that
   we can take the partition prunning into consideration when estimating
   the costs.

2. In order to make the join partition prunning take effect, the patch
   hacks the empty-outer optimization in ExecHashJoinImpl().  Not sure
   if this is a good practice.

3. This patch does not support parallel hash join yet.  But it's not
   hard to add the support.

4. Is it possible and worthwhile to extend the join partition prunning
   mechanism to support nestloop and mergejoin also?

Any thoughts or comments?

[1] https://github.com/greenplum-db/gpdb

Thanks
Richard

v1-0001-Support-run-time-partition-pruning-for-hash-join.patch
Description: Binary data

Support run-time partition pruning for hash join

Reply via email to