Re: [HACKERS] advanced partition matching algorithm for partition-wise join

Ashutosh Bapat Tue, 03 Mar 2020 08:49:24 -0800

Hi Fujita-san,
I reviewed the patch. Except for the logic of matching the pairs of
partitions from already merged partitions, I think the code changes are
good. But there are several places where it needs further changes to
comments. The attached patch has those. I have described some of them below.
+ * We can not perform partition-wise join if either of the joining
+ * relations is not partitioned.

We are consistently using partitionwise instead of partition-wise.

+ /*
+ * See if the partition bounds for inputs are exactly the same, in
+ * which case we don't need to work hard: partitions with the same
+ * partition indexes will form join pairs, and the join rel will have
+ * the same partition bounds as inputs; otherwise try to merge the
+ * partition bounds along with generating join pairs.

Phrase "joining relations" is better than "inputs", IMO. Updated in the
attached patch.

+ /*
+ * If the partition bounds for the join rel are not merged ones,
+ * inputs are guaranteed to have the same partition bounds, so
+ * partitions with the same partition indexes will form join pairs;
+ * else let get_matching_part_pairs() do the work.
+ */
+ if (joinrel->merged)
+ {

This condition in the comment is opposite to the condition being checked in
code, so looks confusing. BTW this comment is also repeated around line
1405.
See attached patch for correction.

+ /*
+ * If this segment of the join is empty, it means that this segment

"partition of the join" looks consistent with other usages than "segment of
the
join". Modified in the attached patch.

+ /*
+ * Get a relids set of partition(s) involved in this join segment that
+ * are from the rel1 side.
+ */
+ child_relids1 = bms_intersect(child_joinrel->relids,
+  rel1->all_partrels);

The partition bounds are sorted by their values. Even for a list partitioned
table, the bounds are sorted by the least partition value. We do not allow
multiple paritions from one side to be joined with one partition on the
other
and vice versa. All this together means that the partitions of the join
relation are formed by joining partitions from either side in the order of
their bounds. This means that the matching pairs of partitions can be found
by
matching relids of partitions of join with those of the joining relation by
traversing partitions from all the three relations only once in the order
they
appears in partition bounds of corresponding relations. If we use this
algorithm, we don't need all_partrels to be collected and also don't need to
search base or join relation. That, I think, will reduce the time and space
complexity of this logic. Am I missing something? As a side effect it won't
require any special handling for base and join relation.

+ /*
+ * Get a child rel for rel1 with the relids.  Note that we should have
+ * the child rel even if rel1 is a join rel, because in that case the
+ * partitions specified in the relids would have matching/overlapping
+ * boundaries, so those partitions should be considered as ones to be
+ * joined even when planning partitionwise joins of rel1, meaning that
+ * the child rel would have been built by the time we get here.
+ */
+ if (rel1_is_simple)

This variable is used only in one place. So instead we should the expression
assigning the value to it. Changed in the attached patch.

- rel->nparts = 0;
+ rel->nparts = -1;

I think we need to add comments around various values that nparts can take.
How
about like something attached.

+ case PARTITION_STRATEGY_HASH:
+ merged_bounds = NULL;

I think, we need to explain why we aren't merging hash partition bounds.
AFAIU,
the reason is thus: When the modulo of used for partition mapping (i.e.
maximum
number of has partitions) is same, their partition bounds are same and do
not
need merging. If the maximum number of partitions is different for both the
joining relations, there's high probability that one partition on one side
will
join with multiple partitions on the other side. So exact partition bounds
match will work in most of the cases. The cases otherwise are not so common
to
spend the effort in coding and planning.

I have added this explanation in the patch. Don't know if it's there written
somewhere already.

+ if (part_index == -1)
+ return -1;
+ } while (is_dummy_partition(rel, part_index));

I understand why we are skipping NULL positions. I am not sure why are we
are
skipping over RelOptInfos which exist but are marked as dummy; we can still
create a join pair with those partitions.

+/*
+ * get_merged_range_bounds
+ * Given the bounds of range partitions to be join, determine the range

s/join/joined/

There are more changes to comments, where I thought that the comments are
required or existing comments need more clarification. Please review the
attached patch. This patch is created on top of
v32-0001-Improve-partition-matching-for-partitionwise-join.

On Mon, Feb 10, 2020 at 5:14 PM Etsuro Fujita <etsuro.fuj...@gmail.com>
wrote:

> On Fri, Feb 7, 2020 at 9:57 PM Etsuro Fujita <etsuro.fuj...@gmail.com>
> wrote:
> > On Thu, Feb 6, 2020 at 3:55 AM Mark Dilger <mark.dil...@enterprisedb.com>
> wrote:
> > > The patches apply and pass all tests.  A review of the patch vs.
> master looks reasonable.
>
> I've merged the patches.  Attached is a new version of the patch.
>
> > > The partition_join.sql test has multiple levels of partitioning, but
> when your patch extends that test with “advanced partition-wise join”, none
> of the tables for the new section have multiple levels.  I spent a little
> while reviewing the code and inventing multiple level partitioning tests
> for advanced partition-wise join and did not encounter any problems.  I
> don’t care whether you use this particular example, but do you want to have
> multiple level partitioning in the new test section?
> >
> > Yes, I do.
> >
> > > CREATE TABLE alpha (a double precision, b double precision) PARTITION
> BY RANGE (a);
> > > CREATE TABLE alpha_neg PARTITION OF alpha FOR VALUES FROM
> ('-Infinity') TO (0) PARTITION BY RANGE (b);
> > > CREATE TABLE alpha_pos PARTITION OF alpha FOR VALUES FROM (0) TO
> ('Infinity') PARTITION BY RANGE (b);
> > > CREATE TABLE alpha_nan PARTITION OF alpha FOR VALUES FROM ('Infinity')
> TO ('NaN');
> > > CREATE TABLE alpha_neg_neg PARTITION OF alpha_neg FOR VALUES FROM
> ('-Infinity') TO (0);
> > > CREATE TABLE alpha_neg_pos PARTITION OF alpha_neg FOR VALUES FROM (0)
> TO ('Infinity');
> > > CREATE TABLE alpha_neg_nan PARTITION OF alpha_neg FOR VALUES FROM
> ('Infinity') TO ('NaN');
> > > CREATE TABLE alpha_pos_neg PARTITION OF alpha_pos FOR VALUES FROM
> ('-Infinity') TO (0);
> > > CREATE TABLE alpha_pos_pos PARTITION OF alpha_pos FOR VALUES FROM (0)
> TO ('Infinity');
> > > CREATE TABLE alpha_pos_nan PARTITION OF alpha_pos FOR VALUES FROM
> ('Infinity') TO ('NaN');
> > > INSERT INTO alpha (a, b)
> > >     (SELECT * FROM
> > >         (VALUES (-1.0::float8), (0.0::float8), (1.0::float8),
> ('Infinity'::float8)) a,
> > >         (VALUES (-1.0::float8), (0.0::float8), (1.0::float8),
> ('Infinity'::float8)) b
> > >     );
> > > ANALYZE alpha;
> > > ANALYZE alpha_neg;
> > > ANALYZE alpha_pos;
> > > ANALYZE alpha_nan;
> > > ANALYZE alpha_neg_neg;
> > > ANALYZE alpha_neg_pos;
> > > ANALYZE alpha_neg_nan;
> > > ANALYZE alpha_pos_neg;
> > > ANALYZE alpha_pos_pos;
> > > ANALYZE alpha_pos_nan;
> > > CREATE TABLE beta (a double precision, b double precision) PARTITION
> BY RANGE (a, b);
> > > CREATE TABLE beta_lo PARTITION OF beta FOR VALUES FROM (-5, -5) TO (0,
> 0);
> > > CREATE TABLE beta_me PARTITION OF beta FOR VALUES FROM (0, 0) TO (0,
> 5);
> > > CREATE TABLE beta_hi PARTITION OF beta FOR VALUES FROM (0, 5) TO (5,
> 5);
> > > INSERT INTO beta (a, b)
> > >     (SELECT * FROM
> > >         (VALUES (-1.0::float8), (0.0::float8), (1.0::float8)) a,
> > >         (VALUES (-1.0::float8), (0.0::float8), (1.0::float8)) b
> > >     );
> > > ANALYZE beta;
> > > ANALYZE beta_lo;
> > > ANALYZE beta_me;
> > > ANALYZE beta_hi;
> > > EXPLAIN SELECT * FROM alpha INNER JOIN beta ON (alpha.a = beta.a AND
> alpha.b = beta.b) WHERE alpha.a = 1 AND beta.b = 1;
> > >                                   QUERY PLAN
> > >
> -------------------------------------------------------------------------------
> > >  Nested Loop  (cost=0.00..2.11 rows=1 width=32)
> > >    ->  Seq Scan on alpha_pos_pos alpha  (cost=0.00..1.06 rows=1
> width=16)
> > >          Filter: ((b = '1'::double precision) AND (a = '1'::double
> precision))
> > >    ->  Seq Scan on beta_hi beta  (cost=0.00..1.04 rows=1 width=16)
> > >          Filter: ((b = '1'::double precision) AND (a = '1'::double
> precision))
> > > (5 rows)
> >
> > Hmm, I'm not sure this is a good test case for that, because this
> > result would be due to partition pruning applied to each side of the
> > join before considering partition-wise join; you could get the same
> > result even with enable_partitionwise_join=off.  I think it's
> > important that the partition-wise join logic doesn't break this query,
> > though.
>
> I think this would be beyond the scope of the patch, so I added
> different test cases that I think would be better as ones for
> multi-level partitioning.
>
> Thanks!
>
> Best regards,
> Etsuro Fujita
>

-- 
--
Best Wishes,
Ashutosh Bapat

diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c
index 77e6ff5376..e14a2e51c2 100644
--- a/src/backend/optimizer/path/joinrels.c
+++ b/src/backend/optimizer/path/joinrels.c
@@ -1378,7 +1378,7 @@ try_partitionwise_join(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 	Assert(joinrel->consider_partitionwise_join);
 
 	/*
-	 * We can not perform partition-wise join if either of the joining
+	 * We can not perform partitionwise join if either of the joining
 	 * relations is not partitioned.
 	 */
 	if (!IS_PARTITIONED_REL(rel1) || !IS_PARTITIONED_REL(rel2))
@@ -1399,8 +1399,9 @@ try_partitionwise_join(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 
 	/*
 	 * If we don't have the partition bounds for the join rel yet, try to
-	 * create it along with pairs of partitions to be joined; else generate
-	 * those using the partitioning info for the join rel we already have.
+	 * compute those along with pairs of partitions to be joined; else generate
+	 * the pairs using the partitioning info of the join relation we already
+	 * have.
 	 */
 	if (joinrel->nparts == -1)
 	{
@@ -1412,16 +1413,17 @@ try_partitionwise_join(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 		Assert(joinrel->part_rels == NULL);
 
 		/*
-		 * See if the partition bounds for inputs are exactly the same, in
-		 * which case we don't need to work hard: partitions with the same
-		 * partition indexes will form join pairs, and the join rel will have
-		 * the same partition bounds as inputs; otherwise try to merge the
-		 * partition bounds along with generating join pairs.
+		 * See if the partition bounds of the joining relations are exactly the
+		 * same, in which case we don't need to work hard: partitions with the
+		 * same partition indexes will form join pairs, and the join relation
+		 * will have the same partition bounds as the joining relations;
+		 * otherwise try to merge the partition bounds along with generating
+		 * join pairs.
 		 *
-		 * Even if one or both inputs have merged partition bounds, it'd be
-		 * possible for the partition bounds to be exactly the same, but it
-		 * seems unlikely to be worth the cycles to check; do this check only
-		 * if both inputs have non-merged partition bounds.
+		 * Even if one or both the joining relations have merged partition
+		 * bounds, it'd be possible for the partition bounds to be exactly the
+		 * same, but it seems unlikely to be worth the cycles to check; do this
+		 * check only if both inputs have non-merged partition bounds.
 		 */
 		if (!rel1->merged &&
 			!rel2->merged &&
@@ -1467,10 +1469,11 @@ try_partitionwise_join(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 		Assert(joinrel->part_rels);
 
 		/*
-		 * If the partition bounds for the join rel are not merged ones,
-		 * inputs are guaranteed to have the same partition bounds, so
-		 * partitions with the same partition indexes will form join pairs;
-		 * else let get_matching_part_pairs() do the work.
+		 * If the partition bounds of the join relation were computed by
+		 * merging the bounds of the joining relations, generate the pairs of
+		 * joining partitions by matching their relids. Nothing to do otherwise
+		 * since the partitions at same cardinal positions form the joining
+		 * pairs.
 		 */
 		if (joinrel->merged)
 		{
@@ -1847,8 +1850,6 @@ get_matching_part_pairs(PlannerInfo *root, RelOptInfo *joinrel,
 						RelOptInfo *rel1, RelOptInfo *rel2,
 						List **parts1, List **parts2)
 {
-	bool		rel1_is_simple = IS_SIMPLE_REL(rel1);
-	bool		rel2_is_simple = IS_SIMPLE_REL(rel2);
 	int 		cnt_parts;
 
 	*parts1 = NIL;
@@ -1863,7 +1864,7 @@ get_matching_part_pairs(PlannerInfo *root, RelOptInfo *joinrel,
 		Relids		child_relids2;
 
 		/*
-		 * If this segment of the join is empty, it means that this segment
+		 * If the current partition of the join is empty, it means that this segment
 		 * was ignored when previously creating child-join paths for it in
 		 * try_partitionwise_join() as it would not contribute to the join
 		 * result, due to one or both inputs being empty; add NULL to each of
@@ -1893,7 +1894,7 @@ get_matching_part_pairs(PlannerInfo *root, RelOptInfo *joinrel,
 		 * joined even when planning partitionwise joins of rel1, meaning that
 		 * the child rel would have been built by the time we get here.
 		 */
-		if (rel1_is_simple)
+		if (IS_SIMPLE_REL(rel1))
 		{
 			int			varno = bms_singleton_member(child_relids1);
 
@@ -1914,7 +1915,7 @@ get_matching_part_pairs(PlannerInfo *root, RelOptInfo *joinrel,
 		/*
 		 * Get a child rel for rel2 with the relids.  See above comments.
 		 */
-		if (rel2_is_simple)
+		if (IS_SIMPLE_REL(rel2))
 		{
 			int			varno = bms_singleton_member(child_relids2);
 
diff --git a/src/backend/partitioning/partbounds.c b/src/backend/partitioning/partbounds.c
index 3c7b6030c0..f639e2c69b 100644
--- a/src/backend/partitioning/partbounds.c
+++ b/src/backend/partitioning/partbounds.c
@@ -3156,6 +3156,16 @@ partition_bounds_merge(int partnatts,
 	switch (strategy)
 	{
 		case PARTITION_STRATEGY_HASH:
+			/*
+			 * When the modulo of used for partition mapping (i.e. maximum
+			 * number of has partitions) is same, their partition bounds are
+			 * same and do not need merging. If the maximum number of
+			 * partitions is different for both the joining relations, there's
+			 * high probability that one partition on one side will join with
+			 * multiple partitions on the other side. So exact partition bounds
+			 * match will work in most of the cases. The cases otherwise are
+			 * not so common to spend the effort in coding and planning.
+			 */
 			merged_bounds = NULL;
 
 			break;
@@ -3324,7 +3334,7 @@ compare_range_partitions(int partnatts, FmgrInfo *partsupfuncs,
 
 /*
  * get_merged_range_bounds
- *		Given the bounds of range partitions to be join, determine the range
+ *		Given the bounds of range partitions to be joined, determine the range
  *		bounds of the merged partition produced from the range partitions
  *
  * *merged_lb and *merged_ub are set to the lower and upper bounds of the
@@ -3534,7 +3544,7 @@ partition_range_bounds_merge(int partnatts, FmgrInfo *partsupfuncs,
 		 * the side which finishes earlier has an extra partition with lower
 		 * and upper bounds higher than any other partition of the unfinished
 		 * side. That way we advance the partitions on that side till all of
-		 * them are  exhausted.
+		 * them are exhausted.
 		 */
 		if (outer_part == -1)
 		{
@@ -4093,8 +4103,13 @@ map_and_merge_partitions(PartitionMap *outer_map, PartitionMap *inner_map,
 		if (!outer_merged && !inner_merged)
 		{
 			/*
-			 * Note that we will fix the larger index that have been added to
-			 * the merged_indexes list so far in fix_merged_indexes().
+			 * Both the inner and outer partitions have an empty partition on
+			 * the other side as their joining partner. But now that each of
+			 * them has found a non-empty joining partner we should re-map
+			 * those to a single partition in the join. We use lower of the
+			 * two indexes to avoid any holes being created by re-mapping.
+			 * Also, it keeps the partition index array in partition bounds
+			 * roughly sorted.
 			 */
 			if (outer_merged_index < inner_merged_index)
 			{
@@ -4165,6 +4180,12 @@ map_and_merge_partitions(PartitionMap *outer_map, PartitionMap *inner_map,
 /*
  * merge_partition_with_dummy
  *
+ * The caller thinks that the partition at the given index does not have a
+ * partition in the other relation or the joining partition is empty. In such a
+ * case we assign a temporary index (indicated by merged flag in the map) for
+ * the resulting partition in the join. In case the given partition finds a
+ * non-empty partner latter we will adjust the mapping again.
+ *
  * *next_index is incremented.
  */
 static int
@@ -4206,8 +4227,8 @@ process_outer_partition(PartitionMap *outer_map,
 
 	/*
 	 * If the inner side has the default partition, the outer partition has to
-	 * be joined with the default partition; try merging them.  Otherwise, we
-	 * should in an outer join, in which case the outer partition has to be
+	 * be joined with the default partition; try merging them.  Otherwise
+	 * it's an outer join, in which case the outer partition has to be
 	 * scanned all the way anyway; if the outer partition is already mapped to
 	 * a merged partition, get it, otherwise create a new merged partition by
 	 * merging the outer partition with a dummy partition.
@@ -4217,9 +4238,11 @@ process_outer_partition(PartitionMap *outer_map,
 		Assert(inner_default >= 0);
 
 		/*
-		 * If the outer side has the default partition as well, we need to
-		 * merge the default partitions (see merge_default_partitions()); give
-		 * up on it.
+		 * If the outer side has the default partition as well, the default
+		 * partition from inner side will have two matching partitions on the
+		 * outer side: the default partition on the outer side and the given
+		 * outer partition. Partitionwise join doesn't handle this scenario
+		 * yet.
 		 */
 		if (outer_has_default)
 			return -1;
@@ -4231,9 +4254,13 @@ process_outer_partition(PartitionMap *outer_map,
 			return -1;
 
 		/*
-		 * If this is a FULL join, the merged partition would act as the
-		 * default partition of the join; record the index in *default_index
-		 * if not done yet.
+		 * If this is a FULL join, both the sides act as outer side. Since the
+		 * inner partition is a default partition, it will have partition key
+		 * values which are not covered by any other partition. In join result
+		 * as well, the resulting partition will hold partition key values that
+		 * no other partition holds. Thus the merged partition would act as the
+		 * default partition of the join; record the index in *default_index if
+		 * not done yet.
 		 */
 		if (jointype == JOIN_FULL)
 		{
@@ -4293,8 +4320,11 @@ process_inner_partition(PartitionMap *outer_map,
 
 		/*
 		 * If the inner side has the default partition as well, we need to
-		 * merge the default partitions (see merge_default_partitions()); give
-		 * up on it.
+		 * merge the default partitions (see merge_default_partitions()). So
+		 * there will two inner partitions, given inner partition and the
+		 * default inner partition, that will map to the default outer
+		 * partition. Partitionwise join does not support this case, so give up
+		 * on it.
 		 */
 		if (inner_has_default)
 			return -1;
@@ -4590,6 +4620,13 @@ merge_default_partitions(PartitionMap *outer_map,
 	}
 	else
 	{
+		/*
+		 * We should have already given up if we found that both the inner and
+		 * outer relations have default partitions and either of them had a
+		 * partition without a matching non-default partition on the other
+		 * side. See process_outer_partition() and process_inner_partition()
+		 * for details.
+		 */
 		Assert(outer_has_default && inner_has_default);
 		Assert(outer_merged_index == -1);
 		Assert(inner_merged_index == -1);
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 1545877d8c..bdf8ac4bce 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -720,7 +720,12 @@ typedef struct RelOptInfo
 
 	/* used for partitioned relations */
 	PartitionScheme part_scheme;	/* Partitioning scheme. */
-	int			nparts;			/* number of partitions */
+	int			nparts;			/* number of partitions.
+								 * 0 for a relation with no partitions,
+								 * > 0 indicates actual number of partitions
+								 * -1 for a relation whose number of partitions
+								 *    is not yet known.
+								 */
 	struct PartitionBoundInfoData *boundinfo;	/* Partition bounds */
 	bool		merged;			/* true if partition bounds were created by
 								 * partition_bounds_merge() */

Re: [HACKERS] advanced partition matching algorithm for partition-wise join

Reply via email to