On Fri, Aug 23, 2019 at 4:05 AM Kohei KaiGai <kai...@heterodb.com> wrote: > We can consider the table join ptable X t1 above is equivalent to: > (ptable_p0 + ptable_p1 + ptable_p2) X t1 > = (ptable_p0 X t1) + (ptable_p1 X t1) + (ptable_p2 X t1) > It returns an equivalent result, however, rows are already reduced by HashJoin > in the individual leaf of Append, so CPU-cycles consumed by Append node can > be cheaper. > > On the other hands, it has a downside because t1 must be read 3 times and > hash table also must be built 3 times. It increases the expected cost, > so planner > may not choose the asymmetric partition-wise join plan.
What if you include the partition constraint as a filter on t1? So you get: ptable X t1 = (ptable_p0 X (σ hash(dist)%4=0 (t1))) + (ptable_p1 X (σ hash(dist)%4=1 (t1))) + (ptable_p2 X (σ hash(dist)%4=2 (t1))) + (ptable_p3 X (σ hash(dist)%4=3 (t1))) Pros: 1. The hash tables will not contain unnecessary junk. 2. You'll get the right answer if t1 is on the outer side of an outer join. 3. If this runs underneath a Parallel Append and t1 is big enough then workers will hopefully cooperate and do a synchronised scan of t1. 4. The filter might enable a selective and efficient plan like an index scan. Cons: 1. The filter might not enable a selective and efficient plan, and therefore cause extra work. (It's a little weird in this example because don't usually see hash functions in WHERE clauses, but that could just as easily be dist BETWEEN 1 AND 99 or any other partition constraint.) > One idea I have is, sibling HashJoin shares a hash table that was built once > by any of the sibling Hash plan. Right now, it is not implemented yet. Yeah, I've thought a little bit about that in the context of Parallel Repartition. I'm interested in combining intra-node partitioning (where a single plan node repartitions data among workers on the fly) with inter-node partitioning (like PWJ, where partitions are handled by different parts of the plan, considered at planning time); you finish up needing to have nodes in the plan that 'receive' tuples for each partition, to match up with the PWJ plan structure. That's not entirely unlike CTE references, and not entirely unlike your idea of somehow sharing the same hash table. I ran into a number of problems while thinking about that, which I should write about in another thread. -- Thomas Munro https://enterprisedb.com