[PERFORM] Index with all necessary columns - Postgres vs MSSQL

2012-02-01 Thread Gudmundur Johannesson
Hi,

I have a table in Postgres like:
CREATE TABLE test
(
  id integer,
  dtstamp timestamp without time zone,
  rating real
)
CREATE INDEX test_all
  ON test
  USING btree
  (id , dtstamp , rating);

My db has around 200M rows and I have reduced my test select statement down
to:
SELECT count(1) FROM test
WHERE id in (58,83,88,98,124,141,170,195,202,252,265,293,305,331,348)
AND dtstamp between cast('2011-10-19 08:00:00' as timestamp)  and
cast('2011-10-19 16:00:00' as timestamp)

In Postgres this takes about 23 sec.
In MSSQL this takes about 1 sec.

MSSQL only accesses the index and does not access the table it self (uses
only index scan)

Postgres has the following plan:
"Aggregate  (cost=130926.24..130926.25 rows=1 width=0)"
"  ->  Bitmap Heap Scan on test  (cost=1298.97..130832.92 rows=37330
width=0)"
"Recheck Cond: ((id = ANY
('{58,83,88,98,124,141,170,195,202,252,265,293,305,331,348}'::integer[]))
AND (dtstamp >= '2011-10-19 08:00:00'::timestamp without time zone) AND
(dtstamp <= '2011-10-19 16:00:00'::timestamp without time zone))"
"->  Bitmap Index Scan on test_all  (cost=0.00..1289.64 rows=37330
width=0)"
"  Index Cond: ((id = ANY
('{58,83,88,98,124,141,170,195,202,252,265,293,305,331,348}'::integer[]))
AND (dtstamp >= '2011-10-19 08:00:00'::timestamp without time zone) AND
(dtstamp <= '2011-10-19 16:00:00'::timestamp without time zone))"

The results are disappointing since I want to switch to Postgres but I have
not been able to force Postgres to only use the index :-(

Any hints that may lead me back on track?

Thanks,
   - Gummi


Re: [PERFORM] Index with all necessary columns - Postgres vs MSSQL

2012-02-01 Thread Merlin Moncure
On Wed, Feb 1, 2012 at 11:10 AM, Gudmundur Johannesson
 wrote:
> Hi,
>
> I have a table in Postgres like:
> CREATE TABLE test
> (
>   id integer,
>   dtstamp timestamp without time zone,
>   rating real
> )
> CREATE INDEX test_all
>   ON test
>   USING btree
>   (id , dtstamp , rating);
>
> My db has around 200M rows and I have reduced my test select statement down
> to:
> SELECT count(1) FROM test
> WHERE id in (58,83,88,98,124,141,170,195,202,252,265,293,305,331,348)
> AND dtstamp between cast('2011-10-19 08:00:00' as timestamp)  and
> cast('2011-10-19 16:00:00' as timestamp)
>
> In Postgres this takes about 23 sec.
> In MSSQL this takes about 1 sec.
>
> MSSQL only accesses the index and does not access the table it self (uses
> only index scan)
>
> Postgres has the following plan:
> "Aggregate  (cost=130926.24..130926.25 rows=1 width=0)"
> "  ->  Bitmap Heap Scan on test  (cost=1298.97..130832.92 rows=37330
> width=0)"
> "    Recheck Cond: ((id = ANY
> ('{58,83,88,98,124,141,170,195,202,252,265,293,305,331,348}'::integer[]))
> AND (dtstamp >= '2011-10-19 08:00:00'::timestamp without time zone) AND
> (dtstamp <= '2011-10-19 16:00:00'::timestamp without time zone))"
> "    ->  Bitmap Index Scan on test_all  (cost=0.00..1289.64 rows=37330
> width=0)"
> "  Index Cond: ((id = ANY
> ('{58,83,88,98,124,141,170,195,202,252,265,293,305,331,348}'::integer[]))
> AND (dtstamp >= '2011-10-19 08:00:00'::timestamp without time zone) AND
> (dtstamp <= '2011-10-19 16:00:00'::timestamp without time zone))"
>
> The results are disappointing since I want to switch to Postgres but I have
> not been able to force Postgres to only use the index :-(
>
> Any hints that may lead me back on track?

*) are the times in postgres stable across calls?
*) where is the 'id list' coming from?
*) how long does this query take?

SELECT count(1) FROM test WHERE id = 202 AND AND dtstamp between
'2011-10-19 08:00:00'::timestamp  and '2011-10-19
16:00:00'::timestamp; ?

The feature you're looking for in postgres is called 'index only
scans' and an 9.2 will contain an implementation of that feature (see:
http://rhaas.blogspot.com/2011/10/index-only-scans-weve-got-em.html).

merlin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] From Simple to Complex

2012-02-01 Thread Alessandro Gagliardi
Final update on this thread: since it is only necessary for me to get a
rough ratio of the distribution (and not the absolute count), I refactored
the query to include a subquery that samples from the moments table
thus: SELECT moment_id, block_id FROM moments WHERE inserted BETWEEN
'yesterday' AND 'today' ORDER BY RANDOM() LIMIT 1; I also took
advantage of another table called blocks that happens to contain the
moment_type as well (thus making it so I don't need to reference pg_class).
The final query looks like:

SELECT moment_type, emotion, COUNT(feedback_id)
  FROM (SELECT moment_id, block_id
  FROM moments
 WHERE inserted BETWEEN 'yesterday' AND 'today'
 ORDER BY RANDOM() LIMIT 1) AS sample_moments
  JOIN blocks USING (block_id)
  JOIN emotions USING (moment_id)
 GROUP BY moment_type, emotion
 ORDER BY moment_type, emotion

The explain is at http://explain.depesz.com/s/lYh

Interestingly, increasing the limit does not seem to increase the runtime
in a linear fashion. When I run it with a limit of 6 I get a runtime
of 14991 ms. But if I run it with a limit of 7 I get a runtime of 77744
ms. I assume that that's because I'm hitting a memory limit and paging out.
Is that right?

On Tue, Jan 31, 2012 at 3:43 PM, Alessandro Gagliardi
wrote:

> I just got a pointer on presenting EXPLAIN ANALYZE in a more human
> friendly fashion (thanks, Agent M!): http://explain.depesz.com/s/A9S
>
> From this it looks like the bottleneck happens when Postgres does an Index
> Scan using emotions_moment_id_idx on emotions before filtering on
> moments.inserted so I thought I'd try filtering on emotions.inserted
> instead but that only made it worse. At the same time, I noticed that "FROM
> pg_class, moments WHERE moments.tableoid = pg_class.oid" tends to run a bit
> faster than "FROM pg_class JOIN moments ON moments.tableoid =
> pg_class.oid". So I tried:
>
> SELECT relname, emotion, COUNT(feedback_id)
>   FROM pg_class, moments, emotions
>  WHERE moments.tableoid = pg_class.oid
>AND emotions.inserted > 'yesterday'
>AND moments.inserted BETWEEN 'yesterday' AND 'today'
>AND emotions.moment_id = moments.moment_id
>  GROUP BY relname, emotion
>  ORDER BY relname, emotion;
>
> That was a bit faster, but still very slow. Here's the EXPLAIN:
> http://explain.depesz.com/s/ZdF
>
> On Tue, Jan 31, 2012 at 2:53 PM, Alessandro Gagliardi  > wrote:
>
>> I changed the query a bit so the results would not change over the
>> course of the day to:
>>
>> SELECT relname, emotion, COUNT(feedback_id) FROM pg_class, moments
>> JOIN emotions USING (moment_id)
>> WHERE moments.inserted BETWEEN 'yesterday' AND 'today' AND
>> moments.tableoid = pg_class.oid
>> GROUP BY relname, emotion ORDER BY relname, emotion;
>>
>


Re: [PERFORM] Index with all necessary columns - Postgres vs MSSQL

2012-02-01 Thread Scott Marlowe
On Wed, Feb 1, 2012 at 10:10 AM, Gudmundur Johannesson
 wrote:
> Hi,
>
> I have a table in Postgres like:
> CREATE TABLE test
> (
>   id integer,
>   dtstamp timestamp without time zone,
>   rating real
> )
> CREATE INDEX test_all
>   ON test
>   USING btree
>   (id , dtstamp , rating);
>
> My db has around 200M rows and I have reduced my test select statement down
> to:
> SELECT count(1) FROM test
> WHERE id in (58,83,88,98,124,141,170,195,202,252,265,293,305,331,348)
> AND dtstamp between cast('2011-10-19 08:00:00' as timestamp)  and
> cast('2011-10-19 16:00:00' as timestamp)
>
> In Postgres this takes about 23 sec.
> In MSSQL this takes about 1 sec.
>
> MSSQL only accesses the index and does not access the table it self (uses
> only index scan)
>
> Postgres has the following plan:
> "Aggregate  (cost=130926.24..130926.25 rows=1 width=0)"
> "  ->  Bitmap Heap Scan on test  (cost=1298.97..130832.92 rows=37330
> width=0)"
> "    Recheck Cond: ((id = ANY
> ('{58,83,88,98,124,141,170,195,202,252,265,293,305,331,348}'::integer[]))
> AND (dtstamp >= '2011-10-19 08:00:00'::timestamp without time zone) AND
> (dtstamp <= '2011-10-19 16:00:00'::timestamp without time zone))"
> "    ->  Bitmap Index Scan on test_all  (cost=0.00..1289.64 rows=37330
> width=0)"
> "  Index Cond: ((id = ANY
> ('{58,83,88,98,124,141,170,195,202,252,265,293,305,331,348}'::integer[]))
> AND (dtstamp >= '2011-10-19 08:00:00'::timestamp without time zone) AND
> (dtstamp <= '2011-10-19 16:00:00'::timestamp without time zone))"
>
> The results are disappointing since I want to switch to Postgres but I have
> not been able to force Postgres to only use the index :-(
>
> Any hints that may lead me back on track?

As Merlin mentioned postgres doesn't have "covering" indexes yet.  I
was wondering what explain ANALYZE of your query looks like, and what
version of pgsql you're running.  It might be that we can at least get
that 23 seconds down to something closer to 1 second rather than
waiting for pg 9.2 to get here.

First try individual indexes on the two fields, and also try a two
column index on the two fields, both with id first and with date
first.  Use explain analyze to see if this does any better.  also look
at this wiki page and see if there's anything there that helps:
http://wiki.postgresql.org/wiki/SlowQueryQuestions  Especially this
part: http://wiki.postgresql.org/wiki/Guide_to_reporting_problems

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] From Simple to Complex

2012-02-01 Thread Scott Marlowe
On Wed, Feb 1, 2012 at 11:19 AM, Alessandro Gagliardi
 wrote:
> Final update on this thread: since it is only necessary for me to get a
> rough ratio of the distribution (and not the absolute count), I refactored
> the query to include a subquery that samples from the moments table
> thus: SELECT moment_id, block_id FROM moments WHERE inserted BETWEEN
> 'yesterday' AND 'today' ORDER BY RANDOM() LIMIT 1; I also took advantage
> of another table called blocks that happens to contain the moment_type as
> well (thus making it so I don't need to reference pg_class). The final query
> looks like:
>
> SELECT moment_type, emotion, COUNT(feedback_id)
>   FROM (SELECT moment_id, block_id
>           FROM moments
>          WHERE inserted BETWEEN 'yesterday' AND 'today'
>          ORDER BY RANDOM() LIMIT 1) AS sample_moments
>   JOIN blocks USING (block_id)
>   JOIN emotions USING (moment_id)
>  GROUP BY moment_type, emotion
>  ORDER BY moment_type, emotion
>
> The explain is at http://explain.depesz.com/s/lYh
>
> Interestingly, increasing the limit does not seem to increase the runtime in
> a linear fashion. When I run it with a limit of 6 I get a runtime
> of 14991 ms. But if I run it with a limit of 7 I get a runtime of 77744
> ms. I assume that that's because I'm hitting a memory limit and paging out.
> Is that right?

Hard to say.  more likely your query plan changes at that point.  Run
the queries with "explain analyze" in front of them to find out.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] From Simple to Complex

2012-02-01 Thread Alessandro Gagliardi
LIMIT 65536; Total query runtime: 14846 ms. -
http://explain.depesz.com/s/I3E
LIMIT 69632: Total query runtime: 80141 ms. -
http://explain.depesz.com/s/9hp

So it looks like when the limit crosses a certain threshold (somewhere
north of 2^16), Postgres decides to do a Seq Scan instead of an Index Scan.
I've already lowered random_page_cost to 2. Maybe I should lower it to 1.5?
Actually 60K should be plenty for my purposes anyway.

On Wed, Feb 1, 2012 at 10:35 AM, Scott Marlowe wrote:

> On Wed, Feb 1, 2012 at 11:19 AM, Alessandro Gagliardi
>  wrote:
> > Interestingly, increasing the limit does not seem to increase the
> runtime in
> > a linear fashion. When I run it with a limit of 6 I get a runtime
> > of 14991 ms. But if I run it with a limit of 7 I get a runtime
> of 77744
> > ms. I assume that that's because I'm hitting a memory limit and paging
> out.
> > Is that right?
>
> Hard to say.  more likely your query plan changes at that point.  Run
> the queries with "explain analyze" in front of them to find out.
>


Re: [PERFORM] From Simple to Complex

2012-02-01 Thread Scott Marlowe
On Wed, Feb 1, 2012 at 11:48 AM, Alessandro Gagliardi
 wrote:
> LIMIT 65536; Total query runtime: 14846 ms.
> - http://explain.depesz.com/s/I3E
> LIMIT 69632: Total query runtime: 80141 ms.
> - http://explain.depesz.com/s/9hp
>
> So it looks like when the limit crosses a certain threshold (somewhere north
> of 2^16), Postgres decides to do a Seq Scan instead of an Index Scan.
> I've already lowered random_page_cost to 2. Maybe I should lower it to 1.5?
> Actually 60K should be plenty for my purposes anyway.

It's important to set random_page_cost according to more than just one
query, but yeah, at this point it's likely a good idea to set it
closer to 1.0.  You're on heroku right?  Something closer to 1.0 is
likely called for if so.  1.2 to 1.4 or so.

If you've got other queries you can test the change on all the better.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] From Simple to Complex

2012-02-01 Thread Bob Lunney
Possibly.  What does

  psql > show work_mem;

say?

Bob Lunney



 From: Alessandro Gagliardi 
To: pgsql-performance@postgresql.org 
Sent: Wednesday, February 1, 2012 12:19 PM
Subject: Re: [PERFORM] From Simple to Complex
 

Final update on this thread: since it is only necessary for me to get a rough 
ratio of the distribution (and not the absolute count), I refactored the query 
to include a subquery that samples from the moments table thus: SELECT 
moment_id, block_id FROM moments WHERE inserted BETWEEN 'yesterday' AND 
'today' ORDER BY RANDOM() LIMIT 1; I also took advantage of another table 
called blocks that happens to contain the moment_type as well (thus making it 
so I don't need to reference pg_class). The final query looks like:

SELECT moment_type, emotion, COUNT(feedback_id) 
  FROM (SELECT moment_id, block_id 
          FROM moments 
         WHERE inserted BETWEEN 'yesterday' AND 'today' 
         ORDER BY RANDOM() LIMIT 1) AS sample_moments
  JOIN blocks USING (block_id)
  JOIN emotions USING (moment_id)
 GROUP BY moment_type, emotion
 ORDER BY moment_type, emotion

The explain is at http://explain.depesz.com/s/lYh

Interestingly, increasing the limit does not seem to increase the runtime in a 
linear fashion. When I run it with a limit of 6 I get a runtime of 14991 
ms. But if I run it with a limit of 7 I get a runtime of 77744 ms. I assume 
that that's because I'm hitting a memory limit and paging out. Is that right?

On Tue, Jan 31, 2012 at 3:43 PM, Alessandro Gagliardi  
wrote:

I just got a pointer on presenting EXPLAIN ANALYZE in a more human friendly 
fashion (thanks, Agent M!): http://explain.depesz.com/s/A9S
>
>
>From this it looks like the bottleneck happens when Postgres does an Index 
>Scan using emotions_moment_id_idx on emotions before filtering on 
>moments.inserted so I thought I'd try filtering on emotions.inserted instead 
>but that only made it worse. At the same time, I noticed that "FROM pg_class, 
>moments WHERE moments.tableoid = pg_class.oid" tends to run a bit faster than 
>"FROM pg_class JOIN moments ON moments.tableoid = pg_class.oid". So I tried:
>
>
>SELECT relname, emotion, COUNT(feedback_id) 
>  FROM pg_class, moments, emotions
> WHERE moments.tableoid = pg_class.oid 
>   AND emotions.inserted > 'yesterday' 
>   AND moments.inserted BETWEEN 'yesterday' AND 'today' 
>   AND emotions.moment_id = moments.moment_id
> GROUP BY relname, emotion 
> ORDER BY relname, emotion;
>
>
>That was a bit faster, but still very slow. Here's the 
>EXPLAIN: http://explain.depesz.com/s/ZdF
>
>
>On Tue, Jan 31, 2012 at 2:53 PM, Alessandro Gagliardi  
>wrote:
>
>I changed the query a bit so the results would not change over the
>>course of the day to:
>>
>>
>>SELECT relname, emotion, COUNT(feedback_id) FROM pg_class, moments
>>JOIN emotions USING (moment_id)
>>WHERE moments.inserted BETWEEN 'yesterday' AND 'today' AND
>>
>>moments.tableoid = pg_class.oid
>>GROUP BY relname, emotion ORDER BY relname, emotion;
>>

Re: [PERFORM] From Simple to Complex

2012-02-01 Thread Alessandro Gagliardi
On Wed, Feb 1, 2012 at 11:04 AM, Bob Lunney  wrote:

> Possibly.  What does
>
>   psql > show work_mem;
>
> say?
>
> 100MB


Re: [PERFORM] Index with all necessary columns - Postgres vs MSSQL

2012-02-01 Thread Merlin Moncure
On Wed, Feb 1, 2012 at 12:50 PM, Gudmundur Johannesson
 wrote:
> Here are the answers to your questions:
> 1) I change the select statement so I am refering to 1 day at a time.  In
> that case the response time is similar.  Basically, the data is not in cache
> when I do that and the response time is about 23 seconds.

what's the difference between the first and the second run time?
Note, if you are only interested in the date the dtStamp falls on, you
can exploit that in the index to knock 4 bytes off your index entry:

CREATE INDEX test_all
  ON test
  USING btree
  (id , (dtstamp::date) , rating);

and then use a similar expression to query it back out.

> 3) The query takes 23 sec vs 1 sec or lower in mssql.

I asked you to time a different query.  Look again (and I'd like to
see cached and uncached times).

> We never update/delete and therefore the data is alway correct in the index
> (never dirty).  Therefore, Postgres could have used the data in it.
>
> I started to add columns into indexes in Oracle for approx 15 years ago and
> it was a brilliant discovery.  This looks like a show stopper for me but I

I doubt covering indexes is going to make that query 23x faster.
However, I bet we can get something worked out.

merlin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance