If we have a hash join with an Append node on the outer side, something like
Hash Join Hash Cond: (pt.a = t.a) -> Append -> Seq Scan on pt_p1 pt_1 -> Seq Scan on pt_p2 pt_2 -> Seq Scan on pt_p3 pt_3 -> Hash -> Seq Scan on t We can actually prune those subnodes of the Append that cannot possibly contain any matching tuples from the other side of the join. To do that, when building the Hash table, for each row from the inner side we can compute the minimum set of subnodes that can possibly match the join condition. When we have built the Hash table and start to execute the Append node, we should have known which subnodes are survived and thus can skip other subnodes. This kind of partition pruning can be extended to happen across multiple join levels. For instance, Hash Join Hash Cond: (pt.a = t2.a) -> Hash Join Hash Cond: (pt.a = t1.a) -> Append -> Seq Scan on pt_p1 pt_1 -> Seq Scan on pt_p2 pt_2 -> Seq Scan on pt_p3 pt_3 -> Hash -> Seq Scan on t1 -> Hash -> Seq Scan on t2 We can compute the matching subnodes of the Append when building Hash table for 't1' according to the join condition 'pt.a = t1.a', and when building Hash table for 't2' according to join condition 'pt.a = t2.a', and the final surviving subnodes would be their intersection. Greenplum [1] has implemented this kind of partition pruning as 'Partition Selector'. Attached is a patch that refactores Greenplum's implementation to make it work on PostgreSQL master. Here are some details about the patch. During planning: 1. When creating a hash join plan in create_hashjoin_plan() we first collect information required to build PartitionPruneInfos at this join, which includes the join's RestrictInfos and the join's inner relids, and put this information in a stack. 2. When we call create_append_plan() for an appendrel, for each of the joins we check if join partition pruning is possible to take place for this appendrel, based on the information collected at that join, and if so build a PartitionPruneInfo and add it to the stack entry. 3. After finishing the outer side of the hash join, we should have built all the PartitionPruneInfos that can be used to perform join partition pruning at this join. So we pop out the stack entry to get the PartitionPruneInfos and add them to Hash node. During executing: When building the hash table for a hash join, we perform the partition prunning for each row according to each of the JoinPartitionPruneStates at this join, and store each result in a special executor parameter to make it available to Append nodes. When executing an Append node, we can directly use the pre-computed pruning results to skip subnodes that cannot contain any matching rows. Here is a query that shows the effect of the join partition prunning. CREATE TABLE pt (a int, b int, c varchar) PARTITION BY RANGE(a); CREATE TABLE pt_p1 PARTITION OF pt FOR VALUES FROM (0) TO (250); CREATE TABLE pt_p2 PARTITION OF pt FOR VALUES FROM (250) TO (500); CREATE TABLE pt_p3 PARTITION OF pt FOR VALUES FROM (500) TO (600); INSERT INTO pt SELECT i, i % 25, to_char(i, 'FM0000') FROM generate_series(0, 599) i WHERE i % 2 = 0; CREATE TABLE t1 (a int, b int); INSERT INTO t1 values (10, 10); CREATE TABLE t2 (a int, b int); INSERT INTO t2 values (300, 300); ANALYZE pt, t1, t2; SET enable_nestloop TO off; explain (analyze, costs off, summary off, timing off) select * from pt join t1 on pt.a = t1.a right join t2 on pt.a = t2.a; QUERY PLAN ------------------------------------------------------------ Hash Right Join (actual rows=1 loops=1) Hash Cond: (pt.a = t2.a) -> Hash Join (actual rows=0 loops=1) Hash Cond: (pt.a = t1.a) -> Append (actual rows=0 loops=1) -> Seq Scan on pt_p1 pt_1 (never executed) -> Seq Scan on pt_p2 pt_2 (never executed) -> Seq Scan on pt_p3 pt_3 (never executed) -> Hash (actual rows=1 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 9kB -> Seq Scan on t1 (actual rows=1 loops=1) -> Hash (actual rows=1 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 9kB -> Seq Scan on t2 (actual rows=1 loops=1) (14 rows) There are several points that need more consideration. 1. All the join partition prunning decisions are made in createplan.c where the best path tree has been decided. This is not great. Maybe it's better to make it happen when we build up the path tree, so that we can take the partition prunning into consideration when estimating the costs. 2. In order to make the join partition prunning take effect, the patch hacks the empty-outer optimization in ExecHashJoinImpl(). Not sure if this is a good practice. 3. This patch does not support parallel hash join yet. But it's not hard to add the support. 4. Is it possible and worthwhile to extend the join partition prunning mechanism to support nestloop and mergejoin also? Any thoughts or comments? [1] https://github.com/greenplum-db/gpdb Thanks Richard
v1-0001-Support-run-time-partition-pruning-for-hash-join.patch
Description: Binary data