Hi all, Monday RhodiumToad/Andrew Gierth and I tried to debug a plan of raptelan (CCed) getting rather strange plans. After trawling through some unrelated stuff we diagnosed that the problem were some rather strange estimates.
I managed to extract a reproducable, simple testcase. Our analysis was that a recent backported change causes the strange estimates: http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=3f5d2fe3029b181fe773a02f1d4b34624c357634 The problem lies in eqjoinsel_semi's behaviour if it doesn't find MCVs and cannot rely on the ndistinct estimate of a lower node. If one of both sides of a semijoin doesn't have a sensible estimate it just assumes a selectivity of 0.5 which will often overestimate That change is pretty bad because - as seen in the example below - it leads to absurd rowcounts. The approach I quickly tried was to use the underlying relations rows as substitute ndistinct estimate. For those examples that seems to work rather well. Very likely its sensible though to check whether those values actually make sense before really using them ;) diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c index da638f8..7117978 100644 --- a/src/backend/utils/adt/selfuncs.c +++ b/src/backend/utils/adt/selfuncs.c @@ -2471,15 +2471,22 @@ eqjoinsel_semi(Oid operator, */ double nullfrac1 = stats1 ? stats1->stanullfrac : 0.0; - if (!isdefault1 && !isdefault2) + if (isdefault1 && isdefault2) { + selec = 0.5 * (1.0 - nullfrac1); + } + else{ + if(isdefault1) + nd1 = Max(nd1, vardata1->rel->rows * (1.0 - nullfrac1)); + + if(isdefault2) + nd2 = Max(nd2, vardata2->rel->rows); + if (nd1 <= nd2 || nd2 < 0) selec = 1.0 - nullfrac1; else selec = (nd2 / nd1) * (1.0 - nullfrac1); } - else - selec = 0.5 * (1.0 - nullfrac1); } if (have_mcvs1) Whats your opinion on this? Andres Testcase: /* test_raptelan=# EXPLAIN SELECT * FROM a WHERE a.id IN (SELECT DISTINCT id FROM b WHERE id < 5000); QUERY PLAN -------------------------------------------------------------------------------------------- Hash Join (cost=315.13..2782.56 rows=50000 width=9) Hash Cond: (a.id = b.id) -> Seq Scan on a (cost=0.00..1540.00 rows=100000 width=9) -> Hash (cost=249.59..249.59 rows=5243 width=4) -> Unique (cost=0.00..197.16 rows=5243 width=4) -> Index Only Scan using b_pkey on b (cost=0.00..184.06 rows=5243 width=4) Index Cond: (id < 5000) (7 rows) Time: 4.016 ms test_raptelan=# EXPLAIN WITH foo AS (SELECT * FROM b WHERE id < 5000) SELECT * FROM a WHERE a.id IN (SELECT id FROM foo); QUERY PLAN ------------------------------------------------------------------------------ Nested Loop (cost=302.02..486.74 rows=50000 width=9) CTE foo -> Index Scan using b_pkey on b (cost=0.00..184.06 rows=5243 width=10) Index Cond: (id < 5000) -> HashAggregate (cost=117.97..119.97 rows=200 width=4) -> CTE Scan on foo (cost=0.00..104.86 rows=5243 width=4) -> Index Scan using a_pkey on a (cost=0.00..0.90 rows=1 width=9) Index Cond: (id = foo.id) (8 rows) Time: 2.636 ms test_raptelan=# EXPLAIN SELECT * FROM a WHERE a.id IN (SELECT id FROM b WHERE id < 5000); QUERY PLAN -------------------------------------------------------------------------------------- Hash Semi Join (cost=249.59..2184.02 rows=5243 width=9) Hash Cond: (a.id = b.id) -> Seq Scan on a (cost=0.00..1540.00 rows=100000 width=9) -> Hash (cost=184.06..184.06 rows=5243 width=4) -> Index Only Scan using b_pkey on b (cost=0.00..184.06 rows=5243 width=4) Index Cond: (id < 5000) (6 rows) Time: 2.459 ms */ -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs