Re: [PERFORM] Slow index scan on B-Tree index over timestamp field

2013-11-04 Thread Kevin Grittner
Caio Casimiro  wrote:

> I have one query running at ~ 7 seconds and I would like to know
> if it's possible to make it run faster, once this query runs lots
> of time in my experiment.

>   Buffers: shared hit=2390 read=32778

> Total runtime: 24066.145 ms

> effective_cache_size = 2GB

> it seems the problem is with the 'tweet' table.

The EXPLAIN ANALYZE output shows it taking 24 seconds, 8.9 seconds
of which is in accessing the tweet_topic table and 15.1 seconds in
accessing the tweet table.  It looks like you have a painfully low
cache hit ratio.  The plan looks reasonable to me; it looks like
you need more RAM to cache data if you want better speed.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Slow index scan on B-Tree index over timestamp field

2013-11-04 Thread Elliot

On 2013-11-04 13:56, Kevin Grittner wrote:

Caio Casimiro  wrote:


I have one query running at ~ 7 seconds and I would like to know
if it's possible to make it run faster, once this query runs lots
of time in my experiment.
Buffers: shared hit=2390 read=32778
Total runtime: 24066.145 ms
effective_cache_size = 2GB
it seems the problem is with the 'tweet' table.

The EXPLAIN ANALYZE output shows it taking 24 seconds, 8.9 seconds
of which is in accessing the tweet_topic table and 15.1 seconds in
accessing the tweet table.  It looks like you have a painfully low
cache hit ratio.  The plan looks reasonable to me; it looks like
you need more RAM to cache data if you want better speed.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


There's also an index scan that turns up 1.8 million rows, but only 
1,600 of them wind up in the final output. I'd start with restating the 
"user_id in (select followed_id ...)" as a join against the relationship 
table. The planner is filtering first on the tweet time, but that 
doesn't reduce the set of tweets down very well. Assuming that the user 
being looked up doesn't follow a large proportion of other users, I'd 
figure that reducing the set first by followed users should be quicker.




--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Slow index scan on B-Tree index over timestamp field

2013-11-04 Thread Jeff Janes
On Sun, Nov 3, 2013 at 2:05 PM, Caio Casimiro wrote:

> Hello all,
>
> I have one query running at ~ 7 seconds and I would like to know if it's
> possible to make it run faster, once this query runs lots of time in my
> experiment.
>


Do you mean you want it to be fast because it runs many times, or that you
want it to become fast after it runs many times (i.e. once the data is
fully cached)?  The plan you show takes 24 seconds, not 7 seconds.


>
> Basically the query return the topics of tweets published by users that
> the user N follows and that are published between D1 and D2.
>
> *Query*:
>
> SELECT tt.tweet_id, tt.topic, tt.topic_value
> FROM tweet_topic AS tt LEFT JOIN tweet AS t ON tt.tweet_id =
> t.id
> WHERE creation_time BETWEEN 'D1' AND 'D2' AND user_id in
> (SELECT followed_id FROM relationship WHERE follower_id = N)
> ORDER BY tt.tweet_id;
>


I don't know if this affects the plan at all, but it is silly to do a left
join to "tweet" when the WHERE clause has conditions that can't be
satisfied with a null row.  Also, you could try changing the IN-list to an
EXISTS subquery.

Is there some patterns to D1 and D2 that could help the caching?  For
example, are they both usually in the just-recent past?


Indexes:
> "tweet_plk" PRIMARY KEY, btree (id) CLUSTER
> "tweet_creation_time_index" btree (creation_time)
> "tweet_id_index" hash (id)
> "tweet_ios_index" btree (id, user_id, creation_time)
> "tweet_retweeted_idx" hash (retweeted)
> "tweet_user_id_creation_time_index" btree (creation_time, user_id)
> "tweet_user_id_index" hash (user_id)
>


Are all of those indexes important?  If your table is heavily
updated/inserted, which I assume it is, maintaining those indexes is going
to take up precious RAM that could probably be better used elsewhere.

Cheers,

Jeff


Re: [PERFORM] Slow index scan on B-Tree index over timestamp field

2013-11-04 Thread Caio Casimiro
Thank you very much for your answers guys!


On Mon, Nov 4, 2013 at 5:15 PM, Jeff Janes  wrote:

> On Sun, Nov 3, 2013 at 2:05 PM, Caio Casimiro 
>  wrote:
>
>> Hello all,
>>
>> I have one query running at ~ 7 seconds and I would like to know if it's
>> possible to make it run faster, once this query runs lots of time in my
>> experiment.
>>
>
>
> Do you mean you want it to be fast because it runs many times, or that you
> want it to become fast after it runs many times (i.e. once the data is
> fully cached)?  The plan you show takes 24 seconds, not 7 seconds.
>

I want it to be fast because it runs many times. I have an experiment that
evaluates recommendation algorithms  for a set of twitter users. This query
returns recommendation candidates so it is called a lot of times for
different users and time intervals.


>
>
>>
>> Basically the query return the topics of tweets published by users that
>> the user N follows and that are published between D1 and D2.
>>
>> *Query*:
>>
>> SELECT tt.tweet_id, tt.topic, tt.topic_value
>> FROM tweet_topic AS tt LEFT JOIN tweet AS t ON tt.tweet_id =
>> t.id
>> WHERE creation_time BETWEEN 'D1' AND 'D2' AND user_id in
>> (SELECT followed_id FROM relationship WHERE follower_id = N)
>> ORDER BY tt.tweet_id;
>>
>
>
> I don't know if this affects the plan at all, but it is silly to do a left
> join to "tweet" when the WHERE clause has conditions that can't be
> satisfied with a null row.  Also, you could try changing the IN-list to an
> EXISTS subquery.
>

I'm sorry the ignorance, but I don't understand the issue with the left
join, could you explain more?


> Is there some patterns to D1 and D2 that could help the caching?  For
> example, are they both usually in the just-recent past?
>
The only pattern is that it is always a one day interval, e.g. D1 =
'2013-05-01' and  D2 = '2013-05-02'.

>
>
> Indexes:
>> "tweet_plk" PRIMARY KEY, btree (id) CLUSTER
>> "tweet_creation_time_index" btree (creation_time)
>> "tweet_id_index" hash (id)
>> "tweet_ios_index" btree (id, user_id, creation_time)
>> "tweet_retweeted_idx" hash (retweeted)
>> "tweet_user_id_creation_time_index" btree (creation_time, user_id)
>> "tweet_user_id_index" hash (user_id)
>>
>
>
> Are all of those indexes important?  If your table is heavily
> updated/inserted, which I assume it is, maintaining those indexes is going
> to take up precious RAM that could probably be better used elsewhere.
>

Probably not. But once this database is read only, the quantity of index
grew following my desperation. =)

>
> Cheers,
>
> Jeff
>

Thank you very much again!
Caio


Re: [PERFORM] Slow index scan on B-Tree index over timestamp field

2013-11-04 Thread Caio Casimiro
I should also say that table tweet has more than 400 millions hows and
table tweet_topic has estimated more than 800 millions rows.

Thanks again,
Caio


On Mon, Nov 4, 2013 at 6:44 PM, Caio Casimiro wrote:

> Thank you very much for your answers guys!
>
>
> On Mon, Nov 4, 2013 at 5:15 PM, Jeff Janes  wrote:
>
>> On Sun, Nov 3, 2013 at 2:05 PM, Caio Casimiro 
>>  wrote:
>>
>>> Hello all,
>>>
>>> I have one query running at ~ 7 seconds and I would like to know if it's
>>> possible to make it run faster, once this query runs lots of time in my
>>> experiment.
>>>
>>
>>
>> Do you mean you want it to be fast because it runs many times, or that
>> you want it to become fast after it runs many times (i.e. once the data is
>> fully cached)?  The plan you show takes 24 seconds, not 7 seconds.
>>
>
> I want it to be fast because it runs many times. I have an experiment that
> evaluates recommendation algorithms  for a set of twitter users. This query
> returns recommendation candidates so it is called a lot of times for
> different users and time intervals.
>
>
>>
>>
>>>
>>> Basically the query return the topics of tweets published by users that
>>> the user N follows and that are published between D1 and D2.
>>>
>>> *Query*:
>>>
>>> SELECT tt.tweet_id, tt.topic, tt.topic_value
>>> FROM tweet_topic AS tt LEFT JOIN tweet AS t ON tt.tweet_id =
>>> t.id
>>> WHERE creation_time BETWEEN 'D1' AND 'D2' AND user_id in
>>> (SELECT followed_id FROM relationship WHERE follower_id = N)
>>> ORDER BY tt.tweet_id;
>>>
>>
>>
>> I don't know if this affects the plan at all, but it is silly to do a
>> left join to "tweet" when the WHERE clause has conditions that can't be
>> satisfied with a null row.  Also, you could try changing the IN-list to an
>> EXISTS subquery.
>>
>
> I'm sorry the ignorance, but I don't understand the issue with the left
> join, could you explain more?
>
>
>> Is there some patterns to D1 and D2 that could help the caching?  For
>> example, are they both usually in the just-recent past?
>>
> The only pattern is that it is always a one day interval, e.g. D1 =
> '2013-05-01' and  D2 = '2013-05-02'.
>
>>
>>
>> Indexes:
>>> "tweet_plk" PRIMARY KEY, btree (id) CLUSTER
>>> "tweet_creation_time_index" btree (creation_time)
>>> "tweet_id_index" hash (id)
>>> "tweet_ios_index" btree (id, user_id, creation_time)
>>> "tweet_retweeted_idx" hash (retweeted)
>>> "tweet_user_id_creation_time_index" btree (creation_time, user_id)
>>> "tweet_user_id_index" hash (user_id)
>>>
>>
>>
>> Are all of those indexes important?  If your table is heavily
>> updated/inserted, which I assume it is, maintaining those indexes is going
>> to take up precious RAM that could probably be better used elsewhere.
>>
>
> Probably not. But once this database is read only, the quantity of index
> grew following my desperation. =)
>
>>
>> Cheers,
>>
>> Jeff
>>
>
> Thank you very much again!
> Caio
>


Re: [PERFORM] Slow index scan on B-Tree index over timestamp field

2013-11-04 Thread Igor Neyman


From: pgsql-performance-ow...@postgresql.org 
[mailto:pgsql-performance-ow...@postgresql.org] On Behalf Of Caio Casimiro
Sent: Monday, November 04, 2013 3:44 PM
To: Jeff Janes
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Slow index scan on B-Tree index over timestamp field

Thank you very much for your answers guys!

On Mon, Nov 4, 2013 at 5:15 PM, Jeff Janes  wrote:
On Sun, Nov 3, 2013 at 2:05 PM, Caio Casimiro  wrote:
Hello all,

I have one query running at ~ 7 seconds and I would like to know if it's 
possible to make it run faster, once this query runs lots of time in my 
experiment.


Do you mean you want it to be fast because it runs many times, or that you want 
it to become fast after it runs many times (i.e. once the data is fully 
cached)?  The plan you show takes 24 seconds, not 7 seconds.

I want it to be fast because it runs many times. I have an experiment that 
evaluates recommendation algorithms  for a set of twitter users. This query 
returns recommendation candidates so it is called a lot of times for different 
users and time intervals.
 
 

Basically the query return the topics of tweets published by users that the 
user N follows and that are published between D1 and D2.

Query:

SELECT tt.tweet_id, tt.topic, tt.topic_value
            FROM tweet_topic AS tt LEFT JOIN tweet AS t ON tt.tweet_id = t.id
            WHERE creation_time BETWEEN 'D1' AND 'D2' AND user_id in
            (SELECT followed_id FROM relationship WHERE follower_id = N) ORDER 
BY tt.tweet_id;


I don't know if this affects the plan at all, but it is silly to do a left join 
to "tweet" when the WHERE clause has conditions that can't be satisfied with a 
null row.  Also, you could try changing the IN-list to an EXISTS subquery.

I'm sorry the ignorance, but I don't understand the issue with the left join, 
could you explain more?
...
Thank you very much again!
Caio


Just try the following:

SELECT tt.tweet_id, tt.topic, tt.topic_value
FROM tweet_topic AS tt  JOIN tweet AS t ON (tt.tweet_id = t.id
  AND t.creation_time BETWEEN 
'D1' AND 'D2' AND t.user_id in
 (SELECT followed_id FROM relationship 
WHERE follower_id = N))
 ORDER BY tt.tweet_id;

And see if it helps with performance.

Regards,
Igor Neyman



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Slow index scan on B-Tree index over timestamp field

2013-11-04 Thread Caio Casimiro
Hi Neyman, thank you for your answer.

Unfortunately this query runs almost at the same time:

Sort  (cost=4877693.98..4877702.60 rows=3449 width=20) (actual
time=25820.291..25821.845 rows=1640 loops=1)
  Sort Key: tt.tweet_id
  Sort Method: quicksort  Memory: 97kB
  Buffers: shared hit=1849 read=32788
  ->  Nested Loop  (cost=247.58..4877491.32 rows=3449 width=20) (actual
time=486.839..25814.120 rows=1640 loops=1)
Buffers: shared hit=1849 read=32788
->  Hash Semi Join  (cost=229.62..88553.23 rows=1681 width=8)
(actual time=431.654..13209.159 rows=597 loops=1)
  Hash Cond: (t.user_id = relationship.followed_id)
  Buffers: shared hit=3 read=31870
  ->  Index Scan using tweet_creation_time_index on tweet t
 (cost=0.57..83308.25 rows=1781234 width=16) (actual
time=130.144..10037.764 rows=1759645 loops=1)
Index Cond: ((creation_time >= '2013-05-05
00:00:00-03'::timestamp with time zone) AND (creation_time <= '2013-05-06
00:00:00-03'::timestamp with time zone))
Buffers: shared hit=1 read=31867
  ->  Hash  (cost=227.12..227.12 rows=154 width=8) (actual
time=94.365..94.365 rows=106 loops=1)
Buckets: 1024  Batches: 1  Memory Usage: 3kB
Buffers: shared hit=2 read=3
->  Index Only Scan using relationship_id on
relationship  (cost=0.42..227.12 rows=154 width=8) (actual
time=74.540..94.101 rows=106 loops=1)
  Index Cond: (follower_id = 335093362)
  Heap Fetches: 0
  Buffers: shared hit=2 read=3
->  Bitmap Heap Scan on tweet_topic tt  (cost=17.96..2841.63
rows=723 width=20) (actual time=21.014..21.085 rows=3 loops=597)
  Recheck Cond: (tweet_id = t.id)
  Buffers: shared hit=1846 read=918
  ->  Bitmap Index Scan on tweet_topic_pk  (cost=0.00..17.78
rows=723 width=0) (actual time=15.012..15.012 rows=3 loops=597)
Index Cond: (tweet_id = t.id)
Buffers: shared hit=1763 read=632
Total runtime: 25823.386 ms

I have noticed that in both queries the index scan on
tweet_creation_time_index is very expensive. Is there anything I can do to
make the planner choose a index only scan?

Thank you,
Caio



On Mon, Nov 4, 2013 at 6:52 PM, Igor Neyman  wrote:

>
>
> From: pgsql-performance-ow...@postgresql.org [mailto:
> pgsql-performance-ow...@postgresql.org] On Behalf Of Caio Casimiro
> Sent: Monday, November 04, 2013 3:44 PM
> To: Jeff Janes
> Cc: pgsql-performance@postgresql.org
> Subject: Re: [PERFORM] Slow index scan on B-Tree index over timestamp field
>
> Thank you very much for your answers guys!
>
> On Mon, Nov 4, 2013 at 5:15 PM, Jeff Janes  wrote:
> On Sun, Nov 3, 2013 at 2:05 PM, Caio Casimiro  > wrote:
> Hello all,
>
> I have one query running at ~ 7 seconds and I would like to know if it's
> possible to make it run faster, once this query runs lots of time in my
> experiment.
>
>
> Do you mean you want it to be fast because it runs many times, or that you
> want it to become fast after it runs many times (i.e. once the data is
> fully cached)?  The plan you show takes 24 seconds, not 7 seconds.
>
> I want it to be fast because it runs many times. I have an experiment that
> evaluates recommendation algorithms  for a set of twitter users. This query
> returns recommendation candidates so it is called a lot of times for
> different users and time intervals.
>
>
>
> Basically the query return the topics of tweets published by users that
> the user N follows and that are published between D1 and D2.
>
> Query:
>
> SELECT tt.tweet_id, tt.topic, tt.topic_value
> FROM tweet_topic AS tt LEFT JOIN tweet AS t ON tt.tweet_id =
> t.id
> WHERE creation_time BETWEEN 'D1' AND 'D2' AND user_id in
> (SELECT followed_id FROM relationship WHERE follower_id = N)
> ORDER BY tt.tweet_id;
>
>
> I don't know if this affects the plan at all, but it is silly to do a left
> join to "tweet" when the WHERE clause has conditions that can't be
> satisfied with a null row.  Also, you could try changing the IN-list to an
> EXISTS subquery.
>
> I'm sorry the ignorance, but I don't understand the issue with the left
> join, could you explain more?
> ...
> Thank you very much again!
> Caio
>
>
> Just try the following:
>
> SELECT tt.tweet_id, tt.topic, tt.topic_value
> FROM tweet_topic AS tt  JOIN tweet AS t ON (tt.tweet_id = t.id
>   AND t.creation_time
> BETWEEN 'D1' AND 'D2' AND t.user_id in
>  (SELECT followed_id FROM
> relationship WHERE follower_id = N))
>  ORDER BY tt.tweet_id;
>
> And see if it helps with performance.
>
> Regards,
> Igor Neyman
>
>


Re: [PERFORM] Slow index scan on B-Tree index over timestamp field

2013-11-04 Thread Elliot

On 2013-11-04 16:10, Caio Casimiro wrote:

Hi Neyman, thank you for your answer.

Unfortunately this query runs almost at the same time:

Sort  (cost=4877693.98..4877702.60 rows=3449 width=20) (actual 
time=25820.291..25821.845 rows=1640 loops=1)

  Sort Key: tt.tweet_id
  Sort Method: quicksort  Memory: 97kB
  Buffers: shared hit=1849 read=32788
  ->  Nested Loop  (cost=247.58..4877491.32 rows=3449 width=20) 
(actual time=486.839..25814.120 rows=1640 loops=1)

Buffers: shared hit=1849 read=32788
->  Hash Semi Join  (cost=229.62..88553.23 rows=1681 width=8) 
(actual time=431.654..13209.159 rows=597 loops=1)

  Hash Cond: (t.user_id = relationship.followed_id)
  Buffers: shared hit=3 read=31870
  ->  Index Scan using tweet_creation_time_index on tweet 
t  (cost=0.57..83308.25 rows=1781234 width=16) (actual 
time=130.144..10037.764 rows=1759645 loops=1)
Index Cond: ((creation_time >= '2013-05-05 
00:00:00-03'::timestamp with time zone) AND (creation_time <= 
'2013-05-06 00:00:00-03'::timestamp with time zone))

Buffers: shared hit=1 read=31867
  ->  Hash  (cost=227.12..227.12 rows=154 width=8) (actual 
time=94.365..94.365 rows=106 loops=1)

Buckets: 1024  Batches: 1  Memory Usage: 3kB
Buffers: shared hit=2 read=3
->  Index Only Scan using relationship_id on 
relationship  (cost=0.42..227.12 rows=154 width=8) (actual 
time=74.540..94.101 rows=106 loops=1)

  Index Cond: (follower_id = 335093362)
  Heap Fetches: 0
  Buffers: shared hit=2 read=3
->  Bitmap Heap Scan on tweet_topic tt  (cost=17.96..2841.63 
rows=723 width=20) (actual time=21.014..21.085 rows=3 loops=597)

  Recheck Cond: (tweet_id = t.id )
  Buffers: shared hit=1846 read=918
  ->  Bitmap Index Scan on tweet_topic_pk 
 (cost=0.00..17.78 rows=723 width=0) (actual time=15.012..15.012 
rows=3 loops=597)

Index Cond: (tweet_id = t.id )
Buffers: shared hit=1763 read=632
Total runtime: 25823.386 ms

I have noticed that in both queries the index scan on 
tweet_creation_time_index is very expensive. Is there anything I can 
do to make the planner choose a index only scan?



Yes, because that part of the query is kicking back so many rows, many 
of which are totally unnecessary anyway - you're first getting all the 
tweets in a particular time range, then limiting them down to just users 
that are followed. Here's clarification on the approach I mentioned 
earlier. All you should really need are basic (btree) indexes on your 
different keys (tweet_topic.tweet_id, tweet.id, tweet.user_id, 
relationship.follower_id, relationship.followed_id). I also changed the 
left join to an inner join as somebody pointed out that your logic 
amounted to reducing the match to an inner join anyway.


SELECT tt.tweet_id, tt.topic, tt.topic_value
FROM tweet_topic AS tt
  JOIN tweet AS t
ON tt.tweet_id = t.id
  join relationship
on t.user_id = relationship.followed_id
WHERE creation_time BETWEEN 'D1' AND 'D2'
  AND relationship.follower_id = N
ORDER BY tt.tweet_id
;



Re: [PERFORM] Slow index scan on B-Tree index over timestamp field

2013-11-04 Thread Igor Neyman
From: Caio Casimiro [mailto:casimiro.lis...@gmail.com] 
Sent: Monday, November 04, 2013 4:10 PM
To: Igor Neyman
Cc: Jeff Janes; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Slow index scan on B-Tree index over timestamp field

Hi Neyman, thank you for your answer.
Unfortunately this query runs almost at the same time:

Sort  (cost=4877693.98..4877702.60 rows=3449 width=20) (actual 
time=25820.291..25821.845 rows=1640 loops=1)
  Sort Key: tt.tweet_id
  Sort Method: quicksort  Memory: 97kB
  Buffers: shared hit=1849 read=32788
  ->  Nested Loop  (cost=247.58..4877491.32 rows=3449 width=20) (actual 
time=486.839..25814.120 rows=1640 loops=1)
        Buffers: shared hit=1849 read=32788
        ->  Hash Semi Join  (cost=229.62..88553.23 rows=1681 width=8) (actual 
time=431.654..13209.159 rows=597 loops=1)
              Hash Cond: (t.user_id = relationship.followed_id)
              Buffers: shared hit=3 read=31870
              ->  Index Scan using tweet_creation_time_index on tweet t  
(cost=0.57..83308.25 rows=1781234 width=16) (actual time=130.144..10037.764 
rows=1759645 loops=1)
                    Index Cond: ((creation_time >= '2013-05-05 
00:00:00-03'::timestamp with time zone) AND (creation_time <= '2013-05-06 
00:00:00-03'::timestamp with time zone))
                    Buffers: shared hit=1 read=31867
              ->  Hash  (cost=227.12..227.12 rows=154 width=8) (actual 
time=94.365..94.365 rows=106 loops=1)
                    Buckets: 1024  Batches: 1  Memory Usage: 3kB
                    Buffers: shared hit=2 read=3
                    ->  Index Only Scan using relationship_id on relationship  
(cost=0.42..227.12 rows=154 width=8) (actual time=74.540..94.101 rows=106 
loops=1)
                          Index Cond: (follower_id = 335093362)
                          Heap Fetches: 0
                          Buffers: shared hit=2 read=3
        ->  Bitmap Heap Scan on tweet_topic tt  (cost=17.96..2841.63 rows=723 
width=20) (actual time=21.014..21.085 rows=3 loops=597)
              Recheck Cond: (tweet_id = t.id)
              Buffers: shared hit=1846 read=918
              ->  Bitmap Index Scan on tweet_topic_pk  (cost=0.00..17.78 
rows=723 width=0) (actual time=15.012..15.012 rows=3 loops=597)
                    Index Cond: (tweet_id = t.id)
                    Buffers: shared hit=1763 read=632
Total runtime: 25823.386 ms

I have noticed that in both queries the index scan on tweet_creation_time_index 
is very expensive. Is there anything I can do to make the planner choose a 
index only scan?

Thank you,
Caio

Just try the following:

SELECT tt.tweet_id, tt.topic, tt.topic_value
            FROM tweet_topic AS tt  JOIN tweet AS t ON (tt.tweet_id = t.id
                                                  AND t.creation_time BETWEEN 
'D1' AND 'D2' AND t.user_id in
                                         (SELECT followed_id FROM relationship 
WHERE follower_id = N))
 ORDER BY tt.tweet_id;

And see if it helps with performance.

Regards,
Igor Neyman

What is your hardware configuration, and Postgres config parameters modified 
from default values?

Regards,
Igor Neyman


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Slow index scan on B-Tree index over timestamp field

2013-11-04 Thread Caio Casimiro
These are the parameters I have set in postgresql.conf:

work_mem = 128MB
shared_buffers = 1GB
maintenance_work_mem = 1536MB
fsync = off
synchronous_commit = off
effective_cache_size = 2GB

The hardware is a modest one:
CPU: Intel(R) Atom(TM) CPU  230   @ 1.60GHz
RAM: 2GB
HD: 1TV 7200 RPM (WDC WD10EZEX-00RKKA0)

This machine runs a slackware 14.0 dedicated to the Postgresql.

Thank you,
Caio



On Mon, Nov 4, 2013 at 7:26 PM, Igor Neyman  wrote:

> From: Caio Casimiro [mailto:casimiro.lis...@gmail.com]
> Sent: Monday, November 04, 2013 4:10 PM
> To: Igor Neyman
> Cc: Jeff Janes; pgsql-performance@postgresql.org
> Subject: Re: [PERFORM] Slow index scan on B-Tree index over timestamp field
>
> Hi Neyman, thank you for your answer.
> Unfortunately this query runs almost at the same time:
>
> Sort  (cost=4877693.98..4877702.60 rows=3449 width=20) (actual
> time=25820.291..25821.845 rows=1640 loops=1)
>   Sort Key: tt.tweet_id
>   Sort Method: quicksort  Memory: 97kB
>   Buffers: shared hit=1849 read=32788
>   ->  Nested Loop  (cost=247.58..4877491.32 rows=3449 width=20) (actual
> time=486.839..25814.120 rows=1640 loops=1)
> Buffers: shared hit=1849 read=32788
> ->  Hash Semi Join  (cost=229.62..88553.23 rows=1681 width=8)
> (actual time=431.654..13209.159 rows=597 loops=1)
>   Hash Cond: (t.user_id = relationship.followed_id)
>   Buffers: shared hit=3 read=31870
>   ->  Index Scan using tweet_creation_time_index on tweet t
>  (cost=0.57..83308.25 rows=1781234 width=16) (actual
> time=130.144..10037.764 rows=1759645 loops=1)
> Index Cond: ((creation_time >= '2013-05-05
> 00:00:00-03'::timestamp with time zone) AND (creation_time <= '2013-05-06
> 00:00:00-03'::timestamp with time zone))
> Buffers: shared hit=1 read=31867
>   ->  Hash  (cost=227.12..227.12 rows=154 width=8) (actual
> time=94.365..94.365 rows=106 loops=1)
> Buckets: 1024  Batches: 1  Memory Usage: 3kB
> Buffers: shared hit=2 read=3
> ->  Index Only Scan using relationship_id on
> relationship  (cost=0.42..227.12 rows=154 width=8) (actual
> time=74.540..94.101 rows=106 loops=1)
>   Index Cond: (follower_id = 335093362)
>   Heap Fetches: 0
>   Buffers: shared hit=2 read=3
> ->  Bitmap Heap Scan on tweet_topic tt  (cost=17.96..2841.63
> rows=723 width=20) (actual time=21.014..21.085 rows=3 loops=597)
>   Recheck Cond: (tweet_id = t.id)
>   Buffers: shared hit=1846 read=918
>   ->  Bitmap Index Scan on tweet_topic_pk  (cost=0.00..17.78
> rows=723 width=0) (actual time=15.012..15.012 rows=3 loops=597)
> Index Cond: (tweet_id = t.id)
> Buffers: shared hit=1763 read=632
> Total runtime: 25823.386 ms
>
> I have noticed that in both queries the index scan on
> tweet_creation_time_index is very expensive. Is there anything I can do to
> make the planner choose a index only scan?
>
> Thank you,
> Caio
>
> Just try the following:
>
> SELECT tt.tweet_id, tt.topic, tt.topic_value
> FROM tweet_topic AS tt  JOIN tweet AS t ON (tt.tweet_id = t.id
>   AND t.creation_time
> BETWEEN 'D1' AND 'D2' AND t.user_id in
>  (SELECT followed_id FROM
> relationship WHERE follower_id = N))
>  ORDER BY tt.tweet_id;
>
> And see if it helps with performance.
>
> Regards,
> Igor Neyman
>
> What is your hardware configuration, and Postgres config parameters
> modified from default values?
>
> Regards,
> Igor Neyman
>


Re: [PERFORM] Slow index scan on B-Tree index over timestamp field

2013-11-04 Thread Caio Casimiro
Hi Elliot, thank you for your answer.

I tried this query but it still suffer with index scan on
tweet_creation_time_index:

"Sort  (cost=4899904.57..4899913.19 rows=3447 width=20) (actual
time=37560.938..37562.503 rows=1640 loops=1)"
"  Sort Key: tt.tweet_id"
"  Sort Method: quicksort  Memory: 97kB"
"  Buffers: shared hit=1849 read=32788"
"  ->  Nested Loop  (cost=105592.06..4899702.04 rows=3447 width=20) (actual
time=19151.036..37555.227 rows=1640 loops=1)"
"Buffers: shared hit=1849 read=32788"
"->  Hash Join  (cost=105574.10..116461.68 rows=1679 width=8)
(actual time=19099.848..19127.606 rows=597 loops=1)"
"  Hash Cond: (relationship.followed_id = t.user_id)"
"  Buffers: shared hit=3 read=31870"
"  ->  Index Only Scan using relationship_id on relationship
 (cost=0.42..227.12 rows=154 width=8) (actual time=66.102..89.721 rows=106
loops=1)"
"Index Cond: (follower_id = 335093362)"
"Heap Fetches: 0"
"Buffers: shared hit=2 read=3"
"  ->  Hash  (cost=83308.25..83308.25 rows=1781234 width=16)
(actual time=19031.916..19031.916 rows=1759645 loops=1)"
"Buckets: 262144  Batches: 1  Memory Usage: 61863kB"
"Buffers: shared hit=1 read=31867"
"->  Index Scan using tweet_creation_time_index on
tweet t  (cost=0.57..83308.25 rows=1781234 width=16) (actual
time=48.595..13759.768 rows=1759645 loops=1)"
"  Index Cond: ((creation_time >= '2013-05-05
00:00:00-03'::timestamp with time zone) AND (creation_time <= '2013-05-06
00:00:00-03'::timestamp with time zone))"
"  Buffers: shared hit=1 read=31867"
"->  Bitmap Heap Scan on tweet_topic tt  (cost=17.96..2841.63
rows=723 width=20) (actual time=30.774..30.847 rows=3 loops=597)"
"  Recheck Cond: (tweet_id = t.id)"
"  Buffers: shared hit=1846 read=918"
"  ->  Bitmap Index Scan on tweet_topic_pk  (cost=0.00..17.78
rows=723 width=0) (actual time=23.084..23.084 rows=3 loops=597)"
"Index Cond: (tweet_id = t.id)"
"Buffers: shared hit=1763 read=632"

You said that I would need B-Tree indexes on the fields that I want the
planner to use index only scan, and I think I have them already on the
tweet table:

"tweet_ios_index" btree (id, user_id, creation_time)

Shouldn't the tweet_ios_index be enough to make the scan over
tweet_creation_time_index be a index only scan? And, more important, would
it be really faster?

Thank you very much,
Caio


On Mon, Nov 4, 2013 at 7:22 PM, Elliot  wrote:

>  On 2013-11-04 16:10, Caio Casimiro wrote:
>
> Hi Neyman, thank you for your answer.
>
> Unfortunately this query runs almost at the same time:
>
>   Sort  (cost=4877693.98..4877702.60 rows=3449 width=20) (actual
> time=25820.291..25821.845 rows=1640 loops=1)
>   Sort Key: tt.tweet_id
>   Sort Method: quicksort  Memory: 97kB
>   Buffers: shared hit=1849 read=32788
>   ->  Nested Loop  (cost=247.58..4877491.32 rows=3449 width=20) (actual
> time=486.839..25814.120 rows=1640 loops=1)
> Buffers: shared hit=1849 read=32788
> ->  Hash Semi Join  (cost=229.62..88553.23 rows=1681 width=8)
> (actual time=431.654..13209.159 rows=597 loops=1)
>   Hash Cond: (t.user_id = relationship.followed_id)
>   Buffers: shared hit=3 read=31870
>   ->  Index Scan using tweet_creation_time_index on tweet t
>  (cost=0.57..83308.25 rows=1781234 width=16) (actual
> time=130.144..10037.764 rows=1759645 loops=1)
> Index Cond: ((creation_time >= '2013-05-05
> 00:00:00-03'::timestamp with time zone) AND (creation_time <= '2013-05-06
> 00:00:00-03'::timestamp with time zone))
> Buffers: shared hit=1 read=31867
>   ->  Hash  (cost=227.12..227.12 rows=154 width=8) (actual
> time=94.365..94.365 rows=106 loops=1)
> Buckets: 1024  Batches: 1  Memory Usage: 3kB
> Buffers: shared hit=2 read=3
> ->  Index Only Scan using relationship_id on
> relationship  (cost=0.42..227.12 rows=154 width=8) (actual
> time=74.540..94.101 rows=106 loops=1)
>   Index Cond: (follower_id = 335093362)
>   Heap Fetches: 0
>   Buffers: shared hit=2 read=3
> ->  Bitmap Heap Scan on tweet_topic tt  (cost=17.96..2841.63
> rows=723 width=20) (actual time=21.014..21.085 rows=3 loops=597)
>   Recheck Cond: (tweet_id = t.id)
>   Buffers: shared hit=1846 read=918
>   ->  Bitmap Index Scan on tweet_topic_pk  (cost=0.00..17.78
> rows=723 width=0) (actual time=15.012..15.012 rows=3 loops=597)
> Index Cond: (tweet_id = t.id)
> Buffers: shared hit=1763 read=632
> Total runtime: 25823.386 ms
>
>  I have noticed that in both queries the index scan on
> tw

Re: [PERFORM] Slow index scan on B-Tree index over timestamp field

2013-11-04 Thread desmodemone
Hello,
 I think you could try with an index on tweet table columns
"user_id, creation_time" [in this order , because the first argument is for
the equality predicate and the second with the range scan predicate, the
index tweet_user_id_creation_time_index is not ok because it has the
reverse order ]  so the Hash Join between relationship and tweet   will
become in theory a netsted loop and so the filter relationship.followed_id
= t.user_id   will be pushed on the new index search condition with also
the creation_time > .. and creation_time < ... . In this manner you will
reduce the random i/o of the scanning of 1759645 rows from tweet that are
filter later now in hash join to 1679.

I hope it will work, if not, I hope you could attach the DDL of the table (
with constraints and indexes) to better understand the problem.

Bye


2013/11/4 Caio Casimiro 

> Hi Elliot, thank you for your answer.
>
> I tried this query but it still suffer with index scan on
> tweet_creation_time_index:
>
> "Sort  (cost=4899904.57..4899913.19 rows=3447 width=20) (actual
> time=37560.938..37562.503 rows=1640 loops=1)"
> "  Sort Key: tt.tweet_id"
> "  Sort Method: quicksort  Memory: 97kB"
> "  Buffers: shared hit=1849 read=32788"
> "  ->  Nested Loop  (cost=105592.06..4899702.04 rows=3447 width=20)
> (actual time=19151.036..37555.227 rows=1640 loops=1)"
> "Buffers: shared hit=1849 read=32788"
> "->  Hash Join  (cost=105574.10..116461.68 rows=1679 width=8)
> (actual time=19099.848..19127.606 rows=597 loops=1)"
> "  Hash Cond: (relationship.followed_id = t.user_id)"
> "  Buffers: shared hit=3 read=31870"
> "  ->  Index Only Scan using relationship_id on relationship
>  (cost=0.42..227.12 rows=154 width=8) (actual time=66.102..89.721
> rows=106 loops=1)"
> "Index Cond: (follower_id = 335093362)"
> "Heap Fetches: 0"
> "Buffers: shared hit=2 read=3"
> "  ->  Hash  (cost=83308.25..83308.25 rows=1781234 width=16)
> (actual time=19031.916..19031.916 rows=1759645 loops=1)"
> "Buckets: 262144  Batches: 1  Memory Usage: 61863kB"
> "Buffers: shared hit=1 read=31867"
> "->  Index Scan using tweet_creation_time_index on
> tweet t  (cost=0.57..83308.25 rows=1781234 width=16) (actual
> time=48.595..13759.768 rows=1759645 loops=1)"
> "  Index Cond: ((creation_time >= '2013-05-05
> 00:00:00-03'::timestamp with time zone) AND (creation_time <= '2013-05-06
> 00:00:00-03'::timestamp with time zone))"
> "  Buffers: shared hit=1 read=31867"
> "->  Bitmap Heap Scan on tweet_topic tt  (cost=17.96..2841.63
> rows=723 width=20) (actual time=30.774..30.847 rows=3 loops=597)"
> "  Recheck Cond: (tweet_id = t.id)"
> "  Buffers: shared hit=1846 read=918"
> "  ->  Bitmap Index Scan on tweet_topic_pk  (cost=0.00..17.78
> rows=723 width=0) (actual time=23.084..23.084 rows=3 loops=597)"
> "Index Cond: (tweet_id = t.id)"
> "Buffers: shared hit=1763 read=632"
>
> You said that I would need B-Tree indexes on the fields that I want the
> planner to use index only scan, and I think I have them already on the
> tweet table:
>
> "tweet_ios_index" btree (id, user_id, creation_time)
>
> Shouldn't the tweet_ios_index be enough to make the scan over
> tweet_creation_time_index be a index only scan? And, more important, would
> it be really faster?
>
> Thank you very much,
> Caio
>
>
> On Mon, Nov 4, 2013 at 7:22 PM, Elliot  wrote:
>
>>  On 2013-11-04 16:10, Caio Casimiro wrote:
>>
>> Hi Neyman, thank you for your answer.
>>
>> Unfortunately this query runs almost at the same time:
>>
>>   Sort  (cost=4877693.98..4877702.60 rows=3449 width=20) (actual
>> time=25820.291..25821.845 rows=1640 loops=1)
>>   Sort Key: tt.tweet_id
>>   Sort Method: quicksort  Memory: 97kB
>>   Buffers: shared hit=1849 read=32788
>>   ->  Nested Loop  (cost=247.58..4877491.32 rows=3449 width=20) (actual
>> time=486.839..25814.120 rows=1640 loops=1)
>> Buffers: shared hit=1849 read=32788
>> ->  Hash Semi Join  (cost=229.62..88553.23 rows=1681 width=8)
>> (actual time=431.654..13209.159 rows=597 loops=1)
>>   Hash Cond: (t.user_id = relationship.followed_id)
>>   Buffers: shared hit=3 read=31870
>>   ->  Index Scan using tweet_creation_time_index on tweet t
>>  (cost=0.57..83308.25 rows=1781234 width=16) (actual
>> time=130.144..10037.764 rows=1759645 loops=1)
>> Index Cond: ((creation_time >= '2013-05-05
>> 00:00:00-03'::timestamp with time zone) AND (creation_time <= '2013-05-06
>> 00:00:00-03'::timestamp with time zone))
>> Buffers: shared hit=1 read=31867
>>   ->  Hash  (cost=227.12..227.12 rows=154 width=8) (actual
>> time=94.365..94.365 rows=106 loops=1)
>>

Re: [PERFORM] Slow index scan on B-Tree index over timestamp field

2013-11-04 Thread Caio Casimiro
Hello, thank you for your answer. I will give it a try and then I post here
the results.
In the original email I post the output of \d+ tweet, which contains the
indexes and constraints.

Best regards,
Caio


On Mon, Nov 4, 2013 at 8:59 PM, desmodemone  wrote:

> Hello,
>  I think you could try with an index on tweet table columns
> "user_id, creation_time" [in this order , because the first argument is for
> the equality predicate and the second with the range scan predicate, the
> index tweet_user_id_creation_time_index is not ok because it has the
> reverse order ]  so the Hash Join between relationship and tweet   will
> become in theory a netsted loop and so the filter relationship.followed_id
> = t.user_id   will be pushed on the new index search condition with also
> the creation_time > .. and creation_time < ... . In this manner you will
> reduce the random i/o of the scanning of 1759645 rows from tweet that are
> filter later now in hash join to 1679.
>
> I hope it will work, if not, I hope you could attach the DDL of the table
> ( with constraints and indexes) to better understand the problem.
>
> Bye
>
>
> 2013/11/4 Caio Casimiro 
>
>> Hi Elliot, thank you for your answer.
>>
>> I tried this query but it still suffer with index scan on
>> tweet_creation_time_index:
>>
>> "Sort  (cost=4899904.57..4899913.19 rows=3447 width=20) (actual
>> time=37560.938..37562.503 rows=1640 loops=1)"
>>  "  Sort Key: tt.tweet_id"
>> "  Sort Method: quicksort  Memory: 97kB"
>> "  Buffers: shared hit=1849 read=32788"
>> "  ->  Nested Loop  (cost=105592.06..4899702.04 rows=3447 width=20)
>> (actual time=19151.036..37555.227 rows=1640 loops=1)"
>> "Buffers: shared hit=1849 read=32788"
>> "->  Hash Join  (cost=105574.10..116461.68 rows=1679 width=8)
>> (actual time=19099.848..19127.606 rows=597 loops=1)"
>> "  Hash Cond: (relationship.followed_id = t.user_id)"
>> "  Buffers: shared hit=3 read=31870"
>> "  ->  Index Only Scan using relationship_id on relationship
>>  (cost=0.42..227.12 rows=154 width=8) (actual time=66.102..89.721
>> rows=106 loops=1)"
>>  "Index Cond: (follower_id = 335093362)"
>> "Heap Fetches: 0"
>> "Buffers: shared hit=2 read=3"
>> "  ->  Hash  (cost=83308.25..83308.25 rows=1781234 width=16)
>> (actual time=19031.916..19031.916 rows=1759645 loops=1)"
>> "Buckets: 262144  Batches: 1  Memory Usage: 61863kB"
>> "Buffers: shared hit=1 read=31867"
>> "->  Index Scan using tweet_creation_time_index on
>> tweet t  (cost=0.57..83308.25 rows=1781234 width=16) (actual
>> time=48.595..13759.768 rows=1759645 loops=1)"
>>  "  Index Cond: ((creation_time >= '2013-05-05
>> 00:00:00-03'::timestamp with time zone) AND (creation_time <= '2013-05-06
>> 00:00:00-03'::timestamp with time zone))"
>> "  Buffers: shared hit=1 read=31867"
>> "->  Bitmap Heap Scan on tweet_topic tt  (cost=17.96..2841.63
>> rows=723 width=20) (actual time=30.774..30.847 rows=3 loops=597)"
>>  "  Recheck Cond: (tweet_id = t.id)"
>> "  Buffers: shared hit=1846 read=918"
>> "  ->  Bitmap Index Scan on tweet_topic_pk  (cost=0.00..17.78
>> rows=723 width=0) (actual time=23.084..23.084 rows=3 loops=597)"
>>  "Index Cond: (tweet_id = t.id)"
>> "Buffers: shared hit=1763 read=632"
>>
>> You said that I would need B-Tree indexes on the fields that I want the
>> planner to use index only scan, and I think I have them already on the
>> tweet table:
>>
>> "tweet_ios_index" btree (id, user_id, creation_time)
>>
>> Shouldn't the tweet_ios_index be enough to make the scan over
>> tweet_creation_time_index be a index only scan? And, more important, would
>> it be really faster?
>>
>> Thank you very much,
>> Caio
>>
>>
>> On Mon, Nov 4, 2013 at 7:22 PM, Elliot wrote:
>>
>>>  On 2013-11-04 16:10, Caio Casimiro wrote:
>>>
>>> Hi Neyman, thank you for your answer.
>>>
>>> Unfortunately this query runs almost at the same time:
>>>
>>>   Sort  (cost=4877693.98..4877702.60 rows=3449 width=20) (actual
>>> time=25820.291..25821.845 rows=1640 loops=1)
>>>   Sort Key: tt.tweet_id
>>>   Sort Method: quicksort  Memory: 97kB
>>>   Buffers: shared hit=1849 read=32788
>>>   ->  Nested Loop  (cost=247.58..4877491.32 rows=3449 width=20) (actual
>>> time=486.839..25814.120 rows=1640 loops=1)
>>> Buffers: shared hit=1849 read=32788
>>> ->  Hash Semi Join  (cost=229.62..88553.23 rows=1681 width=8)
>>> (actual time=431.654..13209.159 rows=597 loops=1)
>>>   Hash Cond: (t.user_id = relationship.followed_id)
>>>   Buffers: shared hit=3 read=31870
>>>   ->  Index Scan using tweet_creation_time_index on tweet t
>>>  (cost=0.57..83308.25 rows=1781234 width=16) (actual
>>> time=130.144..10037.764 rows=1759645 loop