On Wed, Apr 5, 2017 at 2:42 AM, Ashutosh Bapat <ashutosh.ba...@enterprisedb.com> wrote: > Only inner join conditions have equivalence classes associated with > those. Outer join conditions create single element equivalence > classes. So, we can not associate equivalence classes as they are with > partition scheme. If we could do that, it makes life much easier since > checking whether equi-join between all partition keys exist, is simply > looking up equivalence classes that cover joining relations and find > em_member corresponding to partition keys.
OK. > It looks like we should only keep strategy, partnatts, partopfamily > and parttypcoll in PartitionScheme. A partition-wise join between two > relations would be possible if all those match. Yes, I think so. Conceivably you could even exclude partnatts and strategy, since there's nothing preventing a partitionwise join between a list-partitioned table and a range-partitioned table, or between a table range-partitioned on (a) and another range-partitioned on (a, b), but there is probably not much benefit in trying to cover such cases. I think it's reasonable to tell users that this is only going to work when the partitioning strategy is the same and the join conditions include all of the partitioning columns on both sides. > There's a relevant comment in 0006, build_joinrel_partition_info() > (probably that name needs to change, but I will do that once we have > settled on design) > + /* > + * Construct partition keys for the join. > + * > + * An INNER join between two partitioned relations is partition by key > + * expressions from both the relations. For tables A and B > partitioned by a and b > + * respectively, (A INNER JOIN B ON A.a = B.b) is partitioned by both A.a > + * and B.b. > + * > + * An OUTER join like (A LEFT JOIN B ON A.a = B.b) may produce rows with > + * B.b NULL. These rows may not fit the partitioning conditions imposed on > + * B.b. Hence, strictly speaking, the join is not partitioned by B.b. > + * Strictly speaking, partition keys of an OUTER join should include > + * partition key expressions from the OUTER side only. Consider a join > like > + * (A LEFT JOIN B on (A.a = B.b) LEFT JOIN C ON B.b = C.c. If we do not > + * include B.b as partition key expression for (AB), it prohibits us from > + * using partition-wise join when joining (AB) with C as there is no > + * equi-join between partition keys of joining relations. But two NULL > + * values are never equal and no two rows from mis-matching partitions can > + * join. Hence it's safe to include B.b as partition key expression for > + * (AB), even though rows in (AB) are not strictly partitioned by B.b. > + */ > > I think that also needs to be reviewed carefully. The following passage from src/backend/optimizer/README seems highly relevant: === The planner's treatment of outer join reordering is based on the following identities: 1. (A leftjoin B on (Pab)) innerjoin C on (Pac) = (A innerjoin C on (Pac)) leftjoin B on (Pab) where Pac is a predicate referencing A and C, etc (in this case, clearly Pac cannot reference B, or the transformation is nonsensical). 2. (A leftjoin B on (Pab)) leftjoin C on (Pac) = (A leftjoin C on (Pac)) leftjoin B on (Pab) 3. (A leftjoin B on (Pab)) leftjoin C on (Pbc) = A leftjoin (B leftjoin C on (Pbc)) on (Pab) Identity 3 only holds if predicate Pbc must fail for all-null B rows (that is, Pbc is strict for at least one column of B). If Pbc is not strict, the first form might produce some rows with nonnull C columns where the second form would make those entries null. === In other words, I think your statement that null is never equal to null is a bit imprecise. Somebody could certainly create an operator that is named "=" which returns true in that case, and then they could say, hey, two nulls are equal (when you use that operator). The argument needs to be made in terms of the formal properties of the operator. The relevant logic is in have_partkey_equi_join: + /* Skip clauses which are not equality conditions. */ + if (rinfo->hashjoinoperator == InvalidOid && !rinfo->mergeopfamilies) + continue; Actually, I think the hashjoinoperator test is formally and practically unnecessary here; lower down there is a test to see whether the partitioning scheme's operator family is a member of rinfo->mergeopfamilies, which will certainly fail if we got through this test with rinfo->mergeopfamilies == NIL just on the strength of rinfo->hashjoinoperator != InvalidOid. So you can just bail out if rinfo->mergeopfamilies == NIL. But the underlying point here is that the only thing you really know about the function is that it's got to be a strategy-3 operator in some btree opclass; if that guarantees strictness, then so be it -- but I wasn't able to find anything in the code or documentation off-hand that supports that contention, so we might need to think a bit more about why (or if) this is guaranteed to be true. > Partition-wise joins > may be happy including partition keys from all sides, but > partition-wise aggregates may not be, esp. when pushing complete > aggregation down to partitions. In that case, rows with NULL partition > key, which falls on nullable side of join, will be spread across > multiple partitions. Proabably, we should separate nullable and > non-nullable partition key expressions. I don't think I understand quite what you're getting at here. Can you spell this out in more detail? To push an aggregate down to partitions, you need the grouping key to match the applicable partition key, and the partition key shouldn't allow nulls in more than one place. Now I think your point may be that outer join semantics could let them creep in there, e.g. SELECT b.x, sum(a.y) FROM a LEFT JOIN b ON a.x = b.x GROUP BY 1 -- which would indeed be a good test case for partitionwise aggregate. I'd be inclined to think that we should just give up on partitionwise aggregate in such cases; it's not worth trying to optimize such a weird query, at least IMHO. (Does this sort of case ever happen with joins? I think not, as long as the join operator is strict.) I spent some time thinking about this patch set today and I don't see that there's much point in committing any more of this to v10. I think that 0001 and 0002 are probably committable or very close at this point. However, 0001 is adding more complexity than I think is warranted until we're actually ready to commit the feature that uses it, and 0002 is so small that committing isn't really going to smooth future development much. 0003-0009 are essentially all one big patch that will have to be committed together. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers