Re: [HACKERS] between not propated into a simple equality join

David G. Johnston Mon, 09 May 2016 21:35:58 -0700

On Mon, May 9, 2016 at 8:53 AM, Benedikt Grundmann <
[email protected]> wrote:


> We just run into a very simple query that the planner does much worse on
> than we thought it would (in production the table in question is ~ 100
> GB).  It surprised us given the planner is generally quite good, so I
> thought I share our surprise
>
> Setup:
>
> postgres_prod@proddb_testing=# select version();[1]
>                                                     version
>
>
> ────────────────────────────────────────────────────────────────────────────────────────────────────────────────
>  PostgreSQL 9.2.16 on x86_64-unknown-linux-gnu, compiled by gcc (GCC)
> 4.4.7 20120313 (Red Hat 4.4.7-16), 64-bit
> (1 row)
>
> Time: 69.246 ms
>
> postgres_prod@proddb_testing=# create table toy_data3 (the_date date, i
> int);
> CREATE TABLE
> Time: 67.096 ms
> postgres_prod@proddb_testing=# insert into toy_data3
>
>   (select current_date-(s.idx/1000), s.idx from generate_series(1,1000000)
> as s(idx));
> INSERT 0 1000000
> Time: 1617.483 ms
> postgres_prod@proddb_testing=# create index toy_data_date3 on
> toy_data3(the_date);
> CREATE INDEX
> Time: 660.166 ms
> postgres_prod@proddb_testing=# analyze toy_data3;
> ANALYZE
> Time: 294.984 ms
>
> The bad behavior:
>
> postgres_prod@proddb_testing=# explain analyze
>   select * from (
>    select td1.the_date, td1.i
>     from toy_data3 td1, toy_data3 td2  where td1.the_date = td2.the_date
> and td1.i = td2.i
>   ) foo
>   where the_date between current_date and current_date;
>                                                                    QUERY
> PLAN
>
> ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
>  Hash Join  (cost=55.49..21980.50 rows=1 width=8) (actual
> time=0.336..179.374 rows=999 loops=1)
>    Hash Cond: ((td2.the_date = td1.the_date) AND (td2.i = td1.i))
>    ->  Seq Scan on toy_data3 td2  (cost=0.00..14425.00 rows=1000000
> width=8) (actual time=0.007..72.510 rows=1000000 lo
>    ->  Hash  (cost=40.44..40.44 rows=1003 width=8) (actual
> time=0.321..0.321 rows=999 loops=1)
>          Buckets: 1024  Batches: 1  Memory Usage: 40kB
>          ->  Index Scan using toy_data_date3 on toy_data3 td1
>  (cost=0.01..40.44 rows=1003 width=8) (actual time=0.018.
>                Index Cond: ((the_date >= ('now'::cstring)::date) AND
> (the_date <= ('now'::cstring)::date))
>  Total runtime: 179.440 ms
> (8 rows)
>
> Time: 246.094 ms
>
> Notice the red.  Which is sad because one would like it to realize that it
> could propagate the index constraint onto td2.  That is on both sides of
> the join do the green.
>
>
FWIW

This is my plan result:
version
PostgreSQL 9.5.2 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu
4.8.2-19ubuntu1) 4.8.2, 64-bit
All default settings

using "BETWEEN"


 QUERY PLAN
Nested Loop  (cost=0.86..48.91 rows=1 width=8) (actual time=0.042..168.512
rows=999 loops=1)
  ->  Index Scan using toy_data_date3 on toy_data3 td1  (cost=0.43..8.46
rows=1 width=8) (actual time=0.022..1.388 rows=999 loops=1)
        Index Cond: ((the_date >= ('now'::cstring)::date) AND (the_date <=
('now'::cstring)::date))
  ->  Index Scan using toy_data_date3 on toy_data3 td2  (cost=0.42..40.44
rows=1 width=8) (actual time=0.078..0.160 rows=1 loops=999)
        Index Cond: (the_date = td1.the_date)
        Filter: (td1.i = i)
        Rows Removed by Filter: 998
Planning time: 0.353 ms
Execution time: 169.692 ms

using "="

QUERY PLAN
Hash Join  (cost=49.89..90.46 rows=1 width=8) (actual time=2.320..5.652
rows=999 loops=1)
  Hash Cond: (td1.i = td2.i)
  ->  Index Scan using toy_data_date3 on toy_data3 td1  (cost=0.43..37.37
rows=967 width=8) (actual time=0.014..1.168 rows=999 loops=1)
        Index Cond: (the_date = ('now'::cstring)::date)
  ->  Hash  (cost=37.37..37.37 rows=967 width=8) (actual time=2.292..2.292
rows=999 loops=1)
        Buckets: 1024  Batches: 1  Memory Usage: 48kB
        ->  Index Scan using toy_data_date3 on toy_data3 td2
 (cost=0.43..37.37 rows=967 width=8) (actual time=0.008..1.183 rows=999
loops=1)
              Index Cond: (the_date = ('now'::cstring)::date)
Planning time: 0.326 ms
Execution time: 6.673 ms

I was hoping to be able to say more but alas cannot find the words.

I'm surprised by the estimate of 1 rows for the td1 index scan in my 9.5
query - and also why the 9.2 query would choose to sequential scan hash
join in favor of what seems to be a superior index scan nested loop on a
fraction of the table.

The fact that the between doesn't get transitively applied to td2 through
the td1=td2 condition, not as much...though whether the limitation is due
to theory or implementation I do not know.

I do suspect that at least part of the issue is that the computation of
"the_date" makes the two columns highly correlated while the planner
assumes independence.

David J.

Re: [HACKERS] between not propated into a simple equality join

Reply via email to