Hello,

I already posted this question to novice mail list and there is no answer yet. 
I've decided to post it again here.

Before posting the question here, I checked the mail list again for the same 
cases and found the message describing the case I started from:  
http://www.postgresql.org/message-id/51304b61.5030...@gmail.com
It looks like there is no answer for that case too.

I have a performance issue on using UNION ALL clause. Optimizer generates 
incorrect plan based on strange estimation of returned rows number. It suppose  
that plan will be correct if this estimation is done correctly.
The following example helps to reproduce the issue:

CREATE TABLE t1 (c1 INTEGER, id INTEGER PRIMARY KEY);
INSERT INTO t1 (c1, id) SELECT b, b FROM generate_series(1, 1000000) a (b);
REINDEX TABLE t1;
ANALYZE t1;

CREATE TABLE t2 (c1 INTEGER, id INTEGER PRIMARY KEY);
INSERT INTO t2 (c1, id) SELECT b, b FROM generate_series(1, 1000000) a (b);
REINDEX TABLE t2;
ANALYZE t2;

CREATE TABLE links (c1 INTEGER PRIMARY KEY, descr TEXT);
INSERT INTO links (c1, descr) SELECT b, '2' FROM generate_series(1, 100000) a 
(b);
REINDEX TABLE links;
ANALYZE links;

CREATE TEMP TABLE t3 (ref_id INTEGER);
INSERT INTO t3 (ref_id) VALUES (333333), (666666);
ANALYZE t3;

If I do the following:
EXPLAIN ANALYZE SELECT * FROM (SELECT * FROM t1) t INNER JOIN t3 ON t3.ref_id = 
t.id INNER JOIN links l ON (t.c1 = l.c1);

QUERY PLAN:
Nested Loop  (cost=0.00..18.39 rows=1 width=18) (actual time=0.056..0.056 
rows=0 loops=1)
  ->  Nested Loop  (cost=0.00..17.80 rows=2 width=12) (actual time=0.030..0.047 
rows=2 loops=1)
        ->  Seq Scan on t3  (cost=0.00..1.02 rows=2 width=4) (actual 
time=0.007..0.008 rows=2 loops=1)
        ->  Index Scan using t1_pkey on t1  (cost=0.00..8.38 rows=1 width=8) 
(actual time=0.015..0.016 rows=1 loops=2)
              Index Cond: (id = t3.ref_id)
  ->  Index Scan using links_pkey on links l  (cost=0.00..0.28 rows=1 width=6) 
(actual time=0.004..0.004 rows=0 loops=2)
        Index Cond: (c1 = t1.c1)
"Total runtime: 0.118 ms"

It uses correctly index scan on "links" table and works normal.

If I do the following:
EXPLAIN ANALYZE SELECT * FROM (SELECT * FROM t1 UNION ALL SELECT * FROM t2) t 
INNER JOIN t3 ON t3.ref_id = t.id INNER JOIN links l ON (t.c1 = l.c1);

QUERY PLAN:
Hash Join  (cost=2693.00..3127.58 rows=20000 width=18) (actual 
time=47.158..47.158 rows=0 loops=1)
  Hash Cond: (t1.c1 = l.c1)
  ->  Nested Loop  (cost=0.00..34.58 rows=20000 width=12) (actual 
time=0.049..0.101 rows=4 loops=1)
        ->  Seq Scan on t3  (cost=0.00..1.02 rows=2 width=4) (actual 
time=0.010..0.011 rows=2 loops=1)
        ->  Append  (cost=0.00..16.76 rows=2 width=8) (actual time=0.022..0.038 
rows=2 loops=2)
              ->  Index Scan using t1_pkey on t1  (cost=0.00..8.38 rows=1 
width=8) (actual time=0.019..0.022 rows=1 loops=2)
                    Index Cond: (id = t3.ref_id)
              ->  Index Scan using t2_pkey on t2  (cost=0.00..8.38 rows=1 
width=8) (actual time=0.011..0.012 rows=1 loops=2)
                    Index Cond: (id = t3.ref_id)
  ->  Hash  (cost=1443.00..1443.00 rows=100000 width=6) (actual 
time=46.988..46.988 rows=100000 loops=1)
        Buckets: 16384  Batches: 1  Memory Usage: 3711kB
        ->  Seq Scan on links l  (cost=0.00..1443.00 rows=100000 width=6) 
(actual time=0.015..17.443 rows=100000 loops=1)
"Total runtime: 47.246 ms"

It uses sequence scan on "links" table because of strange estimation on 
selection with using UNION ALL. It is very slow. I have more that million rows 
in each tables.

Strange estimation is shown bellow:
EXPLAIN ANALYZE SELECT * FROM (SELECT * FROM t1 UNION ALL SELECT * FROM t2) t 
INNER JOIN t3 ON t3.ref_id = t.id;

QUERY PLAN:
Nested Loop  (cost=0.00..34.58 rows=20000 width=12) (actual time=0.049..0.099 
rows=4 loops=1)
  ->  Seq Scan on t3  (cost=0.00..1.02 rows=2 width=4) (actual 
time=0.009..0.010 rows=2 loops=1)
  ->  Append  (cost=0.00..16.76 rows=2 width=8) (actual time=0.023..0.038 
rows=2 loops=2)
        ->  Index Scan using t1_pkey on t1  (cost=0.00..8.38 rows=1 width=8) 
(actual time=0.021..0.022 rows=1 loops=2)
              Index Cond: (id = t3.ref_id)
        ->  Index Scan using t2_pkey on t2  (cost=0.00..8.38 rows=1 width=8) 
(actual time=0.013..0.013 rows=1 loops=2)
              Index Cond: (id = t3.ref_id)
"Total runtime: 0.164 ms"

The strange estimation is seen at the first line:
Nested Loop  (cost=0.00..34.58 rows=20000 width=12) (actual time=0.049..0.099 
rows=4 loops=1)
At the second line we can see that only two rows will be selected from joined 
table and 4 rows from two tables appended with UNION ALL, how it estimates that 
it should handle 20000 rows for that query?
I suppose, that if it estimates 4 rows will be handled then it will use index 
scan on "links" table and will work fast.

With best regards, Aleksey Kuznetsov.

Reply via email to