[PERFORM] Index with all necessary columns - Postgres vs MSSQL
Hi, I have a table in Postgres like: CREATE TABLE test ( id integer, dtstamp timestamp without time zone, rating real ) CREATE INDEX test_all ON test USING btree (id , dtstamp , rating); My db has around 200M rows and I have reduced my test select statement down to: SELECT count(1) FROM test WHERE id in (58,83,88,98,124,141,170,195,202,252,265,293,305,331,348) AND dtstamp between cast('2011-10-19 08:00:00' as timestamp) and cast('2011-10-19 16:00:00' as timestamp) In Postgres this takes about 23 sec. In MSSQL this takes about 1 sec. MSSQL only accesses the index and does not access the table it self (uses only index scan) Postgres has the following plan: "Aggregate (cost=130926.24..130926.25 rows=1 width=0)" " -> Bitmap Heap Scan on test (cost=1298.97..130832.92 rows=37330 width=0)" "Recheck Cond: ((id = ANY ('{58,83,88,98,124,141,170,195,202,252,265,293,305,331,348}'::integer[])) AND (dtstamp >= '2011-10-19 08:00:00'::timestamp without time zone) AND (dtstamp <= '2011-10-19 16:00:00'::timestamp without time zone))" "-> Bitmap Index Scan on test_all (cost=0.00..1289.64 rows=37330 width=0)" " Index Cond: ((id = ANY ('{58,83,88,98,124,141,170,195,202,252,265,293,305,331,348}'::integer[])) AND (dtstamp >= '2011-10-19 08:00:00'::timestamp without time zone) AND (dtstamp <= '2011-10-19 16:00:00'::timestamp without time zone))" The results are disappointing since I want to switch to Postgres but I have not been able to force Postgres to only use the index :-( Any hints that may lead me back on track? Thanks, - Gummi
Re: [PERFORM] Index with all necessary columns - Postgres vs MSSQL
On Wed, Feb 1, 2012 at 11:10 AM, Gudmundur Johannesson wrote: > Hi, > > I have a table in Postgres like: > CREATE TABLE test > ( > id integer, > dtstamp timestamp without time zone, > rating real > ) > CREATE INDEX test_all > ON test > USING btree > (id , dtstamp , rating); > > My db has around 200M rows and I have reduced my test select statement down > to: > SELECT count(1) FROM test > WHERE id in (58,83,88,98,124,141,170,195,202,252,265,293,305,331,348) > AND dtstamp between cast('2011-10-19 08:00:00' as timestamp) and > cast('2011-10-19 16:00:00' as timestamp) > > In Postgres this takes about 23 sec. > In MSSQL this takes about 1 sec. > > MSSQL only accesses the index and does not access the table it self (uses > only index scan) > > Postgres has the following plan: > "Aggregate (cost=130926.24..130926.25 rows=1 width=0)" > " -> Bitmap Heap Scan on test (cost=1298.97..130832.92 rows=37330 > width=0)" > " Recheck Cond: ((id = ANY > ('{58,83,88,98,124,141,170,195,202,252,265,293,305,331,348}'::integer[])) > AND (dtstamp >= '2011-10-19 08:00:00'::timestamp without time zone) AND > (dtstamp <= '2011-10-19 16:00:00'::timestamp without time zone))" > " -> Bitmap Index Scan on test_all (cost=0.00..1289.64 rows=37330 > width=0)" > " Index Cond: ((id = ANY > ('{58,83,88,98,124,141,170,195,202,252,265,293,305,331,348}'::integer[])) > AND (dtstamp >= '2011-10-19 08:00:00'::timestamp without time zone) AND > (dtstamp <= '2011-10-19 16:00:00'::timestamp without time zone))" > > The results are disappointing since I want to switch to Postgres but I have > not been able to force Postgres to only use the index :-( > > Any hints that may lead me back on track? *) are the times in postgres stable across calls? *) where is the 'id list' coming from? *) how long does this query take? SELECT count(1) FROM test WHERE id = 202 AND AND dtstamp between '2011-10-19 08:00:00'::timestamp and '2011-10-19 16:00:00'::timestamp; ? The feature you're looking for in postgres is called 'index only scans' and an 9.2 will contain an implementation of that feature (see: http://rhaas.blogspot.com/2011/10/index-only-scans-weve-got-em.html). merlin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] From Simple to Complex
Final update on this thread: since it is only necessary for me to get a rough ratio of the distribution (and not the absolute count), I refactored the query to include a subquery that samples from the moments table thus: SELECT moment_id, block_id FROM moments WHERE inserted BETWEEN 'yesterday' AND 'today' ORDER BY RANDOM() LIMIT 1; I also took advantage of another table called blocks that happens to contain the moment_type as well (thus making it so I don't need to reference pg_class). The final query looks like: SELECT moment_type, emotion, COUNT(feedback_id) FROM (SELECT moment_id, block_id FROM moments WHERE inserted BETWEEN 'yesterday' AND 'today' ORDER BY RANDOM() LIMIT 1) AS sample_moments JOIN blocks USING (block_id) JOIN emotions USING (moment_id) GROUP BY moment_type, emotion ORDER BY moment_type, emotion The explain is at http://explain.depesz.com/s/lYh Interestingly, increasing the limit does not seem to increase the runtime in a linear fashion. When I run it with a limit of 6 I get a runtime of 14991 ms. But if I run it with a limit of 7 I get a runtime of 77744 ms. I assume that that's because I'm hitting a memory limit and paging out. Is that right? On Tue, Jan 31, 2012 at 3:43 PM, Alessandro Gagliardi wrote: > I just got a pointer on presenting EXPLAIN ANALYZE in a more human > friendly fashion (thanks, Agent M!): http://explain.depesz.com/s/A9S > > From this it looks like the bottleneck happens when Postgres does an Index > Scan using emotions_moment_id_idx on emotions before filtering on > moments.inserted so I thought I'd try filtering on emotions.inserted > instead but that only made it worse. At the same time, I noticed that "FROM > pg_class, moments WHERE moments.tableoid = pg_class.oid" tends to run a bit > faster than "FROM pg_class JOIN moments ON moments.tableoid = > pg_class.oid". So I tried: > > SELECT relname, emotion, COUNT(feedback_id) > FROM pg_class, moments, emotions > WHERE moments.tableoid = pg_class.oid >AND emotions.inserted > 'yesterday' >AND moments.inserted BETWEEN 'yesterday' AND 'today' >AND emotions.moment_id = moments.moment_id > GROUP BY relname, emotion > ORDER BY relname, emotion; > > That was a bit faster, but still very slow. Here's the EXPLAIN: > http://explain.depesz.com/s/ZdF > > On Tue, Jan 31, 2012 at 2:53 PM, Alessandro Gagliardi > wrote: > >> I changed the query a bit so the results would not change over the >> course of the day to: >> >> SELECT relname, emotion, COUNT(feedback_id) FROM pg_class, moments >> JOIN emotions USING (moment_id) >> WHERE moments.inserted BETWEEN 'yesterday' AND 'today' AND >> moments.tableoid = pg_class.oid >> GROUP BY relname, emotion ORDER BY relname, emotion; >> >
Re: [PERFORM] Index with all necessary columns - Postgres vs MSSQL
On Wed, Feb 1, 2012 at 10:10 AM, Gudmundur Johannesson wrote: > Hi, > > I have a table in Postgres like: > CREATE TABLE test > ( > id integer, > dtstamp timestamp without time zone, > rating real > ) > CREATE INDEX test_all > ON test > USING btree > (id , dtstamp , rating); > > My db has around 200M rows and I have reduced my test select statement down > to: > SELECT count(1) FROM test > WHERE id in (58,83,88,98,124,141,170,195,202,252,265,293,305,331,348) > AND dtstamp between cast('2011-10-19 08:00:00' as timestamp) and > cast('2011-10-19 16:00:00' as timestamp) > > In Postgres this takes about 23 sec. > In MSSQL this takes about 1 sec. > > MSSQL only accesses the index and does not access the table it self (uses > only index scan) > > Postgres has the following plan: > "Aggregate (cost=130926.24..130926.25 rows=1 width=0)" > " -> Bitmap Heap Scan on test (cost=1298.97..130832.92 rows=37330 > width=0)" > " Recheck Cond: ((id = ANY > ('{58,83,88,98,124,141,170,195,202,252,265,293,305,331,348}'::integer[])) > AND (dtstamp >= '2011-10-19 08:00:00'::timestamp without time zone) AND > (dtstamp <= '2011-10-19 16:00:00'::timestamp without time zone))" > " -> Bitmap Index Scan on test_all (cost=0.00..1289.64 rows=37330 > width=0)" > " Index Cond: ((id = ANY > ('{58,83,88,98,124,141,170,195,202,252,265,293,305,331,348}'::integer[])) > AND (dtstamp >= '2011-10-19 08:00:00'::timestamp without time zone) AND > (dtstamp <= '2011-10-19 16:00:00'::timestamp without time zone))" > > The results are disappointing since I want to switch to Postgres but I have > not been able to force Postgres to only use the index :-( > > Any hints that may lead me back on track? As Merlin mentioned postgres doesn't have "covering" indexes yet. I was wondering what explain ANALYZE of your query looks like, and what version of pgsql you're running. It might be that we can at least get that 23 seconds down to something closer to 1 second rather than waiting for pg 9.2 to get here. First try individual indexes on the two fields, and also try a two column index on the two fields, both with id first and with date first. Use explain analyze to see if this does any better. also look at this wiki page and see if there's anything there that helps: http://wiki.postgresql.org/wiki/SlowQueryQuestions Especially this part: http://wiki.postgresql.org/wiki/Guide_to_reporting_problems -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] From Simple to Complex
On Wed, Feb 1, 2012 at 11:19 AM, Alessandro Gagliardi wrote: > Final update on this thread: since it is only necessary for me to get a > rough ratio of the distribution (and not the absolute count), I refactored > the query to include a subquery that samples from the moments table > thus: SELECT moment_id, block_id FROM moments WHERE inserted BETWEEN > 'yesterday' AND 'today' ORDER BY RANDOM() LIMIT 1; I also took advantage > of another table called blocks that happens to contain the moment_type as > well (thus making it so I don't need to reference pg_class). The final query > looks like: > > SELECT moment_type, emotion, COUNT(feedback_id) > FROM (SELECT moment_id, block_id > FROM moments > WHERE inserted BETWEEN 'yesterday' AND 'today' > ORDER BY RANDOM() LIMIT 1) AS sample_moments > JOIN blocks USING (block_id) > JOIN emotions USING (moment_id) > GROUP BY moment_type, emotion > ORDER BY moment_type, emotion > > The explain is at http://explain.depesz.com/s/lYh > > Interestingly, increasing the limit does not seem to increase the runtime in > a linear fashion. When I run it with a limit of 6 I get a runtime > of 14991 ms. But if I run it with a limit of 7 I get a runtime of 77744 > ms. I assume that that's because I'm hitting a memory limit and paging out. > Is that right? Hard to say. more likely your query plan changes at that point. Run the queries with "explain analyze" in front of them to find out. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] From Simple to Complex
LIMIT 65536; Total query runtime: 14846 ms. - http://explain.depesz.com/s/I3E LIMIT 69632: Total query runtime: 80141 ms. - http://explain.depesz.com/s/9hp So it looks like when the limit crosses a certain threshold (somewhere north of 2^16), Postgres decides to do a Seq Scan instead of an Index Scan. I've already lowered random_page_cost to 2. Maybe I should lower it to 1.5? Actually 60K should be plenty for my purposes anyway. On Wed, Feb 1, 2012 at 10:35 AM, Scott Marlowe wrote: > On Wed, Feb 1, 2012 at 11:19 AM, Alessandro Gagliardi > wrote: > > Interestingly, increasing the limit does not seem to increase the > runtime in > > a linear fashion. When I run it with a limit of 6 I get a runtime > > of 14991 ms. But if I run it with a limit of 7 I get a runtime > of 77744 > > ms. I assume that that's because I'm hitting a memory limit and paging > out. > > Is that right? > > Hard to say. more likely your query plan changes at that point. Run > the queries with "explain analyze" in front of them to find out. >
Re: [PERFORM] From Simple to Complex
On Wed, Feb 1, 2012 at 11:48 AM, Alessandro Gagliardi wrote: > LIMIT 65536; Total query runtime: 14846 ms. > - http://explain.depesz.com/s/I3E > LIMIT 69632: Total query runtime: 80141 ms. > - http://explain.depesz.com/s/9hp > > So it looks like when the limit crosses a certain threshold (somewhere north > of 2^16), Postgres decides to do a Seq Scan instead of an Index Scan. > I've already lowered random_page_cost to 2. Maybe I should lower it to 1.5? > Actually 60K should be plenty for my purposes anyway. It's important to set random_page_cost according to more than just one query, but yeah, at this point it's likely a good idea to set it closer to 1.0. You're on heroku right? Something closer to 1.0 is likely called for if so. 1.2 to 1.4 or so. If you've got other queries you can test the change on all the better. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] From Simple to Complex
Possibly. What does psql > show work_mem; say? Bob Lunney From: Alessandro Gagliardi To: pgsql-performance@postgresql.org Sent: Wednesday, February 1, 2012 12:19 PM Subject: Re: [PERFORM] From Simple to Complex Final update on this thread: since it is only necessary for me to get a rough ratio of the distribution (and not the absolute count), I refactored the query to include a subquery that samples from the moments table thus: SELECT moment_id, block_id FROM moments WHERE inserted BETWEEN 'yesterday' AND 'today' ORDER BY RANDOM() LIMIT 1; I also took advantage of another table called blocks that happens to contain the moment_type as well (thus making it so I don't need to reference pg_class). The final query looks like: SELECT moment_type, emotion, COUNT(feedback_id) FROM (SELECT moment_id, block_id FROM moments WHERE inserted BETWEEN 'yesterday' AND 'today' ORDER BY RANDOM() LIMIT 1) AS sample_moments JOIN blocks USING (block_id) JOIN emotions USING (moment_id) GROUP BY moment_type, emotion ORDER BY moment_type, emotion The explain is at http://explain.depesz.com/s/lYh Interestingly, increasing the limit does not seem to increase the runtime in a linear fashion. When I run it with a limit of 6 I get a runtime of 14991 ms. But if I run it with a limit of 7 I get a runtime of 77744 ms. I assume that that's because I'm hitting a memory limit and paging out. Is that right? On Tue, Jan 31, 2012 at 3:43 PM, Alessandro Gagliardi wrote: I just got a pointer on presenting EXPLAIN ANALYZE in a more human friendly fashion (thanks, Agent M!): http://explain.depesz.com/s/A9S > > >From this it looks like the bottleneck happens when Postgres does an Index >Scan using emotions_moment_id_idx on emotions before filtering on >moments.inserted so I thought I'd try filtering on emotions.inserted instead >but that only made it worse. At the same time, I noticed that "FROM pg_class, >moments WHERE moments.tableoid = pg_class.oid" tends to run a bit faster than >"FROM pg_class JOIN moments ON moments.tableoid = pg_class.oid". So I tried: > > >SELECT relname, emotion, COUNT(feedback_id) > FROM pg_class, moments, emotions > WHERE moments.tableoid = pg_class.oid > AND emotions.inserted > 'yesterday' > AND moments.inserted BETWEEN 'yesterday' AND 'today' > AND emotions.moment_id = moments.moment_id > GROUP BY relname, emotion > ORDER BY relname, emotion; > > >That was a bit faster, but still very slow. Here's the >EXPLAIN: http://explain.depesz.com/s/ZdF > > >On Tue, Jan 31, 2012 at 2:53 PM, Alessandro Gagliardi >wrote: > >I changed the query a bit so the results would not change over the >>course of the day to: >> >> >>SELECT relname, emotion, COUNT(feedback_id) FROM pg_class, moments >>JOIN emotions USING (moment_id) >>WHERE moments.inserted BETWEEN 'yesterday' AND 'today' AND >> >>moments.tableoid = pg_class.oid >>GROUP BY relname, emotion ORDER BY relname, emotion; >>
Re: [PERFORM] From Simple to Complex
On Wed, Feb 1, 2012 at 11:04 AM, Bob Lunney wrote: > Possibly. What does > > psql > show work_mem; > > say? > > 100MB
Re: [PERFORM] Index with all necessary columns - Postgres vs MSSQL
On Wed, Feb 1, 2012 at 12:50 PM, Gudmundur Johannesson wrote: > Here are the answers to your questions: > 1) I change the select statement so I am refering to 1 day at a time. In > that case the response time is similar. Basically, the data is not in cache > when I do that and the response time is about 23 seconds. what's the difference between the first and the second run time? Note, if you are only interested in the date the dtStamp falls on, you can exploit that in the index to knock 4 bytes off your index entry: CREATE INDEX test_all ON test USING btree (id , (dtstamp::date) , rating); and then use a similar expression to query it back out. > 3) The query takes 23 sec vs 1 sec or lower in mssql. I asked you to time a different query. Look again (and I'd like to see cached and uncached times). > We never update/delete and therefore the data is alway correct in the index > (never dirty). Therefore, Postgres could have used the data in it. > > I started to add columns into indexes in Oracle for approx 15 years ago and > it was a brilliant discovery. This looks like a show stopper for me but I I doubt covering indexes is going to make that query 23x faster. However, I bet we can get something worked out. merlin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance