Re: Execution plan does not use index

Peter Coppens Tue, 10 Nov 2020 00:26:19 -0800

Triggered by Michael mentioning subqueries I ended up trying

explain
  select d.short_id,mv.timestamp ,mv.I64_01
  from device d, device_configuration dc, measurement_value mv
  where mv.device_id=d.short_id and dc.device_id = d.id and dc.latest=true and 
        mv.timestamp > '2020-11-06'::timestamp - (select tzn.utc_offset from 
pg_timezone_names tzn where tzn.name=dc.timezone) and mv.timestamp < 
'2020-11-07'::timestamp - (select tzn.utc_offset from pg_timezone_names tzn 
where tzn.name=dc.timezone)


==>
Nested Loop  (cost=25.85..84540074.64 rows=16996885 width=20)
  ->  Seq Scan on device_configuration dc  (cost=0.00..470.01 rows=615 width=30)
        Filter: latest
  ->  Nested Loop  (cost=25.85..137027.83 rows=43494 width=36)
        ->  Index Scan using device_pkey on device d  (cost=0.28..7.23 rows=1 
width=20)
              Index Cond: (id = dc.device_id)
        ->  Index Scan using measurement_values_pkey on measurement_value mv  
(cost=25.58..136585.66 rows=43494 width=20)
              Index Cond: ((device_id = d.short_id) AND ("timestamp" > 
('2020-11-06 00:00:00'::timestamp without time zone - (SubPlan 1))) AND 
("timestamp" < ('2020-11-07 00:00:00'::timestamp without time zone - (SubPlan 
2))))
              SubPlan 1
                ->  Function Scan on pg_timezone_names  (cost=0.00..12.50 
rows=5 width=16)
                      Filter: (name = dc.timezone)
              SubPlan 2
                ->  Function Scan on pg_timezone_names pg_timezone_names_1  
(cost=0.00..12.50 rows=5 width=16)
                      Filter: (name = dc.timezone)


Now returns the 320K in less than 5sec. 

I was till now convinced that correlated subqueries or joins are equivalent. I 
guess I was wrong :). Wonder how stable this plan will be though

Peter

> On 10 Nov 2020, at 09:06, Peter Coppens <peter.copp...@datylon.com> wrote:
> 
> Pavel
> 
> Tx for the tip. But given that if the I64_01 column is removed from the 
> select list, the index is used I guess the cast is not likely to be the cause.
> 
> Like so
> 
> explain 
>   select d.short_id,mv.timestamp --,mv.I64_01 
>   from device d, device_configuration dc, measurement_value mv, 
> pg_timezone_names tzn 
>   where mv.device_id=d.short_id and dc.device_id = d.id <http://d.id/> and 
> dc.latest=true and dc.timezone=tzn.name and 
>         mv.timestamp > '2020-11-06'::timestamp - tzn.utc_offset and 
> mv.timestamp < '2020-11-07'::timestamp - tzn.utc_offset
> 
> 
> ==>
> 
> Nested Loop  (cost=1.13..6217004.08 rows=60425437 width=12)
>   ->  Nested Loop  (cost=0.56..21334.84 rows=2186 width=20)
>         Join Filter: (dc.timezone = pg_timezone_names.name)
>         ->  Nested Loop  (cost=0.56..7497.34 rows=615 width=18)
>               ->  Index Scan using device_short_id_key on device d  
> (cost=0.28..2423.90 rows=683 width=20)
>               ->  Index Scan using device_configuration_device_latest_idx on 
> device_configuration dc  (cost=0.28..7.42 rows=1 width=30)
>                     Index Cond: ((device_id = d.id <http://d.id/>) AND 
> (latest = true))
>                     Filter: latest
>         ->  Function Scan on pg_timezone_names  (cost=0.00..10.00 rows=1000 
> width=48)
>   ->  Index Only Scan using measurement_values_pkey on measurement_value mv  
> (cost=0.57..2399.33 rows=43492 width=12)
>         Index Cond: ((device_id = d.short_id) AND ("timestamp" > ('2020-11-06 
> 00:00:00'::timestamp without time zone - pg_timezone_names.utc_offset)) AND 
> ("timestamp" < ('2020-11-07 00:00:00'::timestamp without time zone - 
> pg_timezone_names.utc_offset)))
> 
> 
> Peter
> 
>> On 10 Nov 2020, at 08:25, Pavel Stehule <pavel.steh...@gmail.com 
>> <mailto:pavel.steh...@gmail.com>> wrote:
>> 
>> 
>> 
>> út 10. 11. 2020 v 8:18 odesílatel Peter Coppens <peter.copp...@datylon.com 
>> <mailto:peter.copp...@datylon.com>> napsal:
>> Michael
>> 
>> Many thanks for spending your time on this. Your alternative does not help 
>> unfortunately (see execution plan)
>> 
>> Still a sequential scan on the complete table. I have tried many 
>> alternatives and somehow whenever I add a column that is not in the index 
>> (I64_01) the optimizer decides not to use the index. If I remove that 
>> column, the index is used. I guess it estimates that the extra indirection 
>> from index pages to the row pages is more costly than scanning the 168M 
>> records. Pretty sure it’s not, but I cannot explain it to the stubborn thing 
>> :)
>> 
>> Btw, thanks for the >= tip (I was aware of it)
>> 
>> Wkr,
>> 
>> Peter
>> 
>> 
>> Hash Join  (cost=683.93..7270857.46 rows=458127 width=20)
>>   Hash Cond: (mv_inner.device_id = d.short_id)
>>   Join Filter: ((mv_inner."timestamp" > ('2020-11-06 00:00:00'::timestamp 
>> without time zone - pg_timezone_names.utc_offset)) AND (mv_inner."timestamp" 
>> < ('2020-11-07 00:00:00'::timestamp without time zone - 
>> pg_timezone_names.utc_offset)))
>>   ->  Seq Scan on measurement_value mv_inner  (cost=0.00..7166797.33 
>> rows=1287989 width=1006)
>>         Filter: (("timestamp" > '2020-11-05 00:00:00'::timestamp without 
>> time zone) AND ("timestamp" < '2020-11-08 00:00:00'::timestamp without time 
>> zone))
>> 
>> when you see cast in filter, then you should check type equality in 
>> constraints. With some exception Postgres uses indexes only when filtered 
>> value has same type like column type.
>> 
>> Maybe there is inconsistency between timestamp (with time zone), and 
>> timestamp without time zone
>> 
>> Regards
>> 
>> Pavel
>> 
>> 
>>   ->  Hash  (cost=656.61..656.61 rows=2186 width=20)
>>         ->  Hash Join  (cost=77.87..656.61 rows=2186 width=20)
>>               Hash Cond: (dc.timezone = pg_timezone_names.name 
>> <http://pg_timezone_names.name/>)
>>               ->  Hash Join  (cost=55.37..533.83 rows=615 width=18)
>>                     Hash Cond: (dc.device_id = d.id <http://d.id/>)
>>                     ->  Seq Scan on device_configuration dc  
>> (cost=0.00..470.01 rows=615 width=30)
>>                           Filter: latest
>>                     ->  Hash  (cost=46.83..46.83 rows=683 width=20)
>>                           ->  Seq Scan on device d  (cost=0.00..46.83 
>> rows=683 width=20)
>>               ->  Hash  (cost=10.00..10.00 rows=1000 width=48)
>>                     ->  Function Scan on pg_timezone_names  
>> (cost=0.00..10.00 rows=1000 width=48)
>> 
>> 
>> 
>>> On 10 Nov 2020, at 01:15, Michael Lewis <mle...@entrata.com 
>>> <mailto:mle...@entrata.com>> wrote:
>>> 
>>> On Mon, Nov 9, 2020 at 1:11 PM Peter Coppens <peter.copp...@datylon.com 
>>> <mailto:peter.copp...@datylon.com>> wrote:
>>> Adding the tzn.utc_offset results in the fact that the execution plan no 
>>> longer considers to use the index on the measurement_value table. Is there 
>>> any way the SQL can be rewritten so that the index is used? Or any other 
>>> solution so that the query with the timezone offset returns in a comparable 
>>> time?
>>> 
>>> I am not aware of a best practice to handle this. Your where condition on 
>>> mv.timestamp now depends on several joins to do a filtering that used to be 
>>> a static range that can be scanned into the index as a first node in the 
>>> plan. I have sometimes used a sub-query on a broader condition that allows 
>>> the use of the index, and then fully reducing the set later. Something like 
>>> this-
>>> 
>>> select d.short_id,mv.timestamp,mv.I64_01
>>>   from device d, device_configuration dc, (
>>> select mv.*
>>>   from measurement_value AS mv_inner
>>>   where mv.timestamp > '2020-11-06'::timestamp - interval '1 day' and 
>>> mv.timestamp < '2020-11-07'::timestamp + interval '1 day'
>>> offset 0 /* to prevent in-lining the join to the outside set */
>>> ) mv, pg_timezone_names tzn
>>>   where mv.device_id=d.short_id and dc.device_id = d.id <http://d.id/> and 
>>> dc.latest=true and dc.timezone=tzn.name <http://tzn.name/> and
>>>         mv.timestamp > '2020-11-06'::timestamp - tzn.utc_offset and 
>>> mv.timestamp < '2020-11-07'::timestamp - tzn.utc_offset
>>> 
>>> By the way, it seems a little odd to be exclusive on both the begin and 
>>> end. I'd usually expect timestamp >= start_date and timestamp < end_date + 
>>> interval '1 day' to fully capture a 24 hour period. Right now, you are 
>>> excluding any data that happens to have a timestamp value with .000000 
>>> seconds (midnight exactly).
>> 
>

Re: Execution plan does not use index

Reply via email to