On Fri, Aug 25, 2023 at 11:03 AM David Rowley <dgrowle...@gmail.com> wrote:
> I'd suggest writing some cost which costs an execution of run-time > pruning. With LIST and RANGE you probably want something like > cpu_operator_cost * LOG2(nparts) once for each hashed tuple to account > for the binary search over the sorted datum array. For HASH > partitions, something like cpu_operator_cost * npartcols once for each > hashed tuple. > > You'll need to then come up with some counter costs to subtract from > the Append/MergeAppend. This is tricky, as discussed. Just come up > with something crude for now. > > To start with, it could just be as crude as: > > total_costs *= (Min(expected_outer_rows, n_append_subnodes) / > n_append_subnodes); > > i.e assume that every outer joined row will require exactly 1 new > partition up to the total number of partitions. That's pretty much > worst-case, but it'll at least allow the optimisation to work for > cases like where the hash table is expected to contain just a tiny > number of rows (fewer than the number of partitions) > > To make it better, you might want to look at join selectivity > estimation and see if you can find something there to influence > something better. I have a go at writing some costing codes according to your suggestion. That's compute_partprune_cost() in the v2 patch. For the hash side, this function computes the pruning cost as cpu_operator_cost * LOG2(nparts) * inner_rows for LIST and RANGE, and cpu_operator_cost * nparts * inner_rows for HASH. For the Append/MergeAppend side, this function first estimates the size of outer side that matches, using the same idea as we estimate the joinrel size for JOIN_SEMI. Then it assumes that each outer joined row occupies one new partition (the worst case) and computes how much cost can be saved from partition pruning. If the cost saved from the Append/MergeAppend side is larger than the pruning cost from the Hash side, then we say that partition pruning is a win. Note that this costing logic runs for each Append-Hash pair, so it copes with the case where we have multiple join levels. With this costing logic added, I performed the same performance comparisons of the worst case as in [1], and here is what I got. tuples unpatched patched 10000 44.66 44.37 -0.006493506 20000 52.41 52.29 -0.002289639 30000 61.11 61.12 +0.000163639 40000 67.87 68.24 +0.005451599 50000 74.51 74.75 +0.003221044 60000 82.3 81.55 -0.009113001 70000 87.16 86.98 -0.002065168 80000 93.49 93.89 +0.004278532 90000 101.52 100.83 -0.00679669 100000 108.34 108.56 +0.002030644 So the costing logic successfully avoids performing the partition pruning in the worst case. I also tested the cases where partition pruning is possible with different sizes of the hash side. tuples unpatched patched 100 36.86 2.4 -0.934888768 200 35.87 2.37 -0.933928074 300 35.95 2.55 -0.92906815 400 36.4 2.63 -0.927747253 500 36.39 2.85 -0.921681781 600 36.32 2.97 -0.918226872 700 36.6 3.23 -0.911748634 800 36.88 3.44 -0.906724512 900 37.02 3.46 -0.906537007 1000 37.25 37.21 -0.001073826 The first 9 rows show that the costing logic allows the partition pruning to be performed and the pruning turns out to be a big win. The last row shows that the partition pruning is disallowed by the costing logic because it thinks no partition can be pruned (we have 1000 partitions in total). So it seems that the new costing logic is quite crude and tends to be very conservative, but it can help avoid the large overhead in the worst cases. I think this might be a good start to push this patch forward. Any thoughts or comments? [1] https://www.postgresql.org/message-id/CAMbWs49%2Bp6hBxXJHFiSwOtPCSkAHwhJj3hTpCR_pmMiUUVLZ1Q%40mail.gmail.com Thanks Richard
v2-0001-Support-run-time-partition-pruning-for-hash-join.patch
Description: Binary data