I was recently nudged to describe an optimisation of merge joins/sorts. Rather than decribe that, I've looked at the general case:
There are some additional implications we may make when joining tables... a particularly interesting one is that SELECT * FROM Fact JOIN Dim on Fact.col = Dim.col can be rewritten as SELECT * FROM Fact JOIN Dim on Fact.col = Dim.col WHERE ( Fact.col BETWEEN (SELECT min(col) FROM Dim) AND (SELECT max(col) FROM Dim) ) AND ( Dim.col BETWEEN (SELECT min(col) FROM Fact) AND (SELECT max(col) FROM Fact) ) which also works similarly for anti/semi-joins. If we have indexes on A.col and B.col then these additional quals can be derived cheaply at run-time and could have an important effect on optimisation. 1) With various kinds of index, we would be able to use these implied quals to further restrict the scan. Perhaps that doesn't sound very interesting, but it is very important when solving an "outside-in" join on a star schema, such as... SELECT count(*) FROM Fact JOIN Dim on Fact.col = Dim.col WHERE Dim.other = 1 since there is no qual that can be applied directly to the Fact table, causing us to scan the entire table. We can rewrite this query as EXPLAIN (ANALYZE on, TIMING off, COSTS on) SELECT count(*) FROM Fact JOIN Dim on Fact.col = Dim.col WHERE Dim.other = 1 AND ( Fact.col BETWEEN (SELECT min(col) FROM Dim WHERE Dim.other = 1) AND (SELECT max(col) FROM Dim WHERE Dim.other = 9) ) Note that the implied qual on the Dim table has been dropped as uninteresting. This is because we can calculate the cost and potential benefit of applying the rewrite, allowing us to discard one or both implied clauses. Note also that this has nothing to do with join order. This is solely about making inferences using the join quals between any two tables. 2) We calculate the join selectivity by comparing the MFVs of the join columns on the tables being joined. ISTM that we could use the min() and max() values to refine the selectivity, which can often be wrong as a result. - - - The current planner doesn't add these predicates automatically, but if it did, it would come up with the following slightly sub-optimal plan... EXPLAIN (ANALYZE on, TIMING off, COSTS on) SELECT count(*) FROM Fact JOIN Dim on Fact.col = Dim.col WHERE Dim.other = 1 AND ( Fact.col BETWEEN (SELECT min(col) FROM Dim WHERE Dim.other = 1) AND (SELECT max(col) FROM Dim WHERE Dim.other = 1) ) Aggregate (cost=31.79..31.80 rows=1 width=0) (actual rows=1 loops=1) InitPlan 1 (returns $0) -> Aggregate (cost=1.03..1.04 rows=1 width=4) (actual rows=1 loops=1) -> Seq Scan on dim dim_1 (cost=0.00..1.02 rows=1 width=4) (actual rows=1 loops=1) Filter: (other = 1) Rows Removed by Filter: 1 InitPlan 2 (returns $1) -> Aggregate (cost=1.03..1.04 rows=1 width=4) (actual rows=1 loops=1) -> Seq Scan on dim dim_2 (cost=0.00..1.02 rows=1 width=4) (actual rows=1 loops=1) Filter: (other = 1) Rows Removed by Filter: 1 -> Merge Join (cost=1.33..29.09 rows=250 width=0) (actual rows=100000 loops=1) Merge Cond: (dim.col = fact.col) -> Sort (cost=1.03..1.04 rows=1 width=4) (actual rows=1 loops=1) Sort Key: dim.col Sort Method: quicksort Memory: 25kB -> Seq Scan on dim (cost=0.00..1.02 rows=1 width=4) (actual rows=1 loops=1) Filter: (other = 1) Rows Removed by Filter: 1 -> Index Only Scan using fact_col_idx on fact (cost=0.29..24.29 rows=500 width=4) (actual rows=100000 loops=1) Index Cond: ((col >= $0) AND (col <= $1)) Heap Fetches: 100000 which is sub-optimal only because of the mis-estimation of the effect of the min() and max(), so we will still benefit from resolving the parameters to a constant before proceeding with the main query. A better plan would be EXPLAIN (ANALYZE on, TIMING off, COSTS on) SELECT count(*) FROM Fact JOIN Dim on Fact.col = Dim.col WHERE Dim.other = 1 AND Fact.col BETWEEN 1 AND 1 Aggregate (cost=2944.04..2944.05 rows=1 width=0) (actual rows=1 loops=1) -> Hash Join (cost=1.04..2819.04 rows=50000 width=0) (actual rows=100000 loops=1) Hash Cond: (fact.col = dim.col) -> Seq Scan on fact (cost=0.00..1943.00 rows=100000 width=4) (actual rows=100000 loops=1) Filter: ((col >= 1) AND (col <= 1)) -> Hash (cost=1.02..1.02 rows=1 width=4) (actual rows=1 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 1kB -> Seq Scan on dim (cost=0.00..1.02 rows=1 width=4) (actual rows=1 loops=1) Filter: (other = 1) Rows Removed by Filter: 1 but we can probably live with that, as we do with other dynamic index plans. So when can we use this? The additional cost of adding this to the query is * additional qual: 2 * cpu_operator_cost * rows * getting values: 2 * cost of getting one row from index Reduced cost plans happen if the additional quals reduce the total cost of the plan/scan because they significantly reduce selectivity and possibly because they use an index plan. We don't know the reduction that is possible, if any, so I suggest we look at the base rel plan near the bottom of set_plain_rel_pathlist(). If the cost of scanning the base rel is high in comparison with the cost of the additional quals, then we add them because if we add them and there is no benefit we lose a small amount, yet if we do gain benefit it could be substantial. Proposal: in set_plain_rel_pathlist() if total cost is already greater than 100 cost of adding quals then we can add the additional implied quals and look again for index/seq plans. It is possible that this could be implemented for 9.5, if it is considered a good idea. Good? - - - Current proposal ends there, but there is a further optimization that allows us to remove the join altogether if * There is a FK between Fact and Dim * The join is an anti or semi join * There are no values of the join column on Dim that are between Min() and Max() that are not returned from Dim It seems a little less practical, as will become clear.. This optimization allows SELECT count(*) FROM Fact JOIN Dim on Fact.col = Dim.col WHERE Dim.other = 1 to be written as EXPLAIN (ANALYZE on, TIMING off, COSTS on) SELECT count(*) FROM Fact WHERE ( Fact.col BETWEEN (SELECT min(col) FROM Dim WHERE Dim.other = 1) AND (SELECT max(col) FROM Dim WHERE Dim.other = 1) ) AND /* when */ (0 = ( SELECT count(*) FROM Dim WHERE ( Dim.col BETWEEN (SELECT min(col) FROM Dim WHERE Dim.other = 1) AND (SELECT max(col) FROM Dim WHERE Dim.other = 1) ) AND NOT (Dim.other = 1) )); which looks ubelieveably complex, but performs quite well. For those with a long memory, you may recognise my Proof Join from 2007. Note that my Proof Join contained the (mistaken) requirement that the key values were contiguous whereas in fact that doesn't matter, as shown. This has also been described as the BETWEEN join rewrite (Abadi et al 2008). -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers