[BUGS] Botched estimation in eqjoinsel_semi for cases without reliable ndistinct

Andres Freund Wed, 11 Jan 2012 12:11:08 -0800

Hi all,

Monday RhodiumToad/Andrew Gierth and I tried to debug a plan of raptelan 
(CCed) getting rather strange plans. After trawling through some unrelated 
stuff we diagnosed that the problem were some rather strange estimates.


I managed to extract a reproducable, simple testcase. Our analysis was that a 
recent backported change causes the strange estimates:

http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=3f5d2fe3029b181fe773a02f1d4b34624c357634

The problem lies in eqjoinsel_semi's behaviour if it doesn't find MCVs and 
cannot rely on the ndistinct estimate of a lower node. If one of both sides of 
a semijoin doesn't have a sensible estimate it just assumes a selectivity of 
0.5 which will often overestimate 
That change is pretty bad because - as seen in the example below - it leads to 
absurd rowcounts.

The approach I quickly tried was to use the underlying relations rows as 
substitute ndistinct estimate. For those examples that seems to work rather 
well. Very likely its sensible though to check whether those values actually 
make sense before really using them ;)

diff --git a/src/backend/utils/adt/selfuncs.c 
b/src/backend/utils/adt/selfuncs.c
index da638f8..7117978 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -2471,15 +2471,22 @@ eqjoinsel_semi(Oid operator,
                 */
                double          nullfrac1 = stats1 ? stats1->stanullfrac : 
0.0;
 
-               if (!isdefault1 && !isdefault2)
+               if (isdefault1 && isdefault2)
                {
+                       selec = 0.5 * (1.0 - nullfrac1);
+               }
+               else{
+                       if(isdefault1)
+                               nd1 = Max(nd1, vardata1->rel->rows * (1.0 - 
nullfrac1));
+
+                       if(isdefault2)
+                               nd2 = Max(nd2, vardata2->rel->rows);
+
                        if (nd1 <= nd2 || nd2 < 0)
                                selec = 1.0 - nullfrac1;
                        else
                                selec = (nd2 / nd1) * (1.0 - nullfrac1);
                }
-               else
-                       selec = 0.5 * (1.0 - nullfrac1);
        }
 
        if (have_mcvs1)


Whats your opinion on this?

Andres

Testcase:

/*

test_raptelan=# EXPLAIN SELECT * FROM a WHERE a.id IN (SELECT DISTINCT id FROM 
b WHERE id < 5000);
                                         QUERY PLAN
--------------------------------------------------------------------------------------------
 Hash Join  (cost=315.13..2782.56 rows=50000 width=9)
   Hash Cond: (a.id = b.id)
   ->  Seq Scan on a  (cost=0.00..1540.00 rows=100000 width=9)
   ->  Hash  (cost=249.59..249.59 rows=5243 width=4)
         ->  Unique  (cost=0.00..197.16 rows=5243 width=4)
               ->  Index Only Scan using b_pkey on b  (cost=0.00..184.06 
rows=5243 width=4)
                     Index Cond: (id < 5000)
(7 rows)

Time: 4.016 ms

test_raptelan=# EXPLAIN WITH foo AS (SELECT * FROM b WHERE id < 5000) SELECT * 
FROM a WHERE a.id IN (SELECT id FROM foo);
                                  QUERY PLAN
------------------------------------------------------------------------------
 Nested Loop  (cost=302.02..486.74 rows=50000 width=9)
   CTE foo
     ->  Index Scan using b_pkey on b  (cost=0.00..184.06 rows=5243 width=10)
           Index Cond: (id < 5000)
   ->  HashAggregate  (cost=117.97..119.97 rows=200 width=4)
         ->  CTE Scan on foo  (cost=0.00..104.86 rows=5243 width=4)
   ->  Index Scan using a_pkey on a  (cost=0.00..0.90 rows=1 width=9)
         Index Cond: (id = foo.id)
(8 rows)

Time: 2.636 ms

test_raptelan=# EXPLAIN SELECT * FROM a WHERE a.id IN (SELECT id FROM b WHERE 
id < 5000);
                                      QUERY PLAN
--------------------------------------------------------------------------------------
 Hash Semi Join  (cost=249.59..2184.02 rows=5243 width=9)
   Hash Cond: (a.id = b.id)
   ->  Seq Scan on a  (cost=0.00..1540.00 rows=100000 width=9)
   ->  Hash  (cost=184.06..184.06 rows=5243 width=4)
         ->  Index Only Scan using b_pkey on b  (cost=0.00..184.06 rows=5243 
width=4)
               Index Cond: (id < 5000)
(6 rows)

Time: 2.459 ms

*/

-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

[BUGS] Botched estimation in eqjoinsel_semi for cases without reliable ndistinct

Reply via email to