Re: [PERFORM] RAID arrays and performance

2008-09-19 Thread Matthew Wakeling

On Thu, 18 Sep 2008, Greg Smith wrote:
It's just that our system is doing a lot of bitmap index scans at the 
moment, and it'd help to be able to spread them across the 16 discs in the 
RAID array. It's the bottleneck in our system at the moment.


If you have some specific bitmap index scan test case suggestions you can 
pass along (either publicly or in private to me, I can probably help 
anonymize them), that's one of the things that has been holding this up.


Okay, here's a description of what we're doing. We are importing data from 
a large file (so we don't have a choice on the order that the data comes 
in). For each entry in the source, we have to look up the corresponding 
row in the database, and issue the correct "UPDATE" command according to 
what's in the database and what's in the data source. Long story - it 
isn't feasible to change the overall process.


In order to improve the performance, I made the system look ahead in the 
source, in groups of a thousand entries, so instead of running:


SELECT * FROM table WHERE field = 'something';

a thousand times, we now run:

SELECT * FROM table WHERE field IN ('something', 'something else'...);

with a thousand things in the IN. Very simple query. It does run faster 
than the individual queries, but it still takes quite a while. Here is an 
example query:


SELECT a1_.id AS a1_id, a1_.primaryIdentifier AS a2_ FROM Gene AS a1_ 
WHERE a1_.primaryIdentifier IN ('SPAC11D3.15', 'SPAC11D3.16c', 
'SPAC11D3.17', 'SPAC11D3.18c', 'SPAC15F9.01c', 'SPAC15F9.02', 'SPAC16.01', 
'SPAC18G6.01c', 'SPAC18G6.02c', 'SPAC18G6.04c', 'SPAC18G6.05c', 
'SPAC18G6.06', 'SPAC18G6.07c', 'SPAC18G6.09c', 'SPAC18G6.10', 
'SPAC18G6.11c', 'SPAC18G6.12c', 'SPAC18G6.13', 'SPAC18G6.14c', 
'SPAC18G6.15', 'SPAC1B9.01c', 'SPAC1D4.02c', 'SPAC1D4.03c', 'SPAC1D4.04', 
'SPAC1D4.05c', 'SPAC1D4.07c', 'SPAC1D4.08', 'SPAC1D4.09c', 'SPAC1D4.10', 
'SPAC1D4.11c', 'SPAC1F3.11', 'SPAC23A1.10', 'SPAC23E2.01', 'SPAC23E2.02', 
'SPAC23E2.03c', 'SPAC26A3.02', 'SPAC26A3.03c', 'SPAC26A3.05', 
'SPAC26A3.06', 'SPAC26A3.07c', 'SPAC26A3.08', 'SPAC26A3.09c', 
'SPAC26A3.10', 'SPAC26A3.11', 'SPAC26A3.14c', 'SPAC26A3.15c', 
'SPAC26A3.16', 'SPAC27F1.01c', 'SPAC27F1.03c', 'SPAC27F1.04c', 
'SPAC27F1.05c', 'SPAC27F1.06c', 'SPAC3H8.02', 'SPAC3H8.03', 'SPAC3H8.04', 
'SPAC3H8.05c', 'SPAC3H8.06', 'SPAC3H8.07c', 'SPAC3H8.08c', 'SPAC3H8.09c', 
'SPAC3H8.10', 'SPAC3H8.11', 'SPAC8E11.11', 'SPBC106.15', 'SPBC17G9.10', 
'SPBC24E9.15c', 'WBGene0969', 'WBGene3035', 'WBGene4095', 
'WBGene00016011', 'WBGene00018672', 'WBGene00018674', 'WBGene00018675', 
'WBGene00018676', 'WBGene00018959', 'WBGene00018960', 'WBGene00018961', 
'WBGene00023407') ORDER BY a1_.id LIMIT 2000;


And the corresponding EXPLAIN ANALYSE:

 Limit  (cost=331.28..331.47 rows=77 width=17) (actual time=121.973..122.501 
rows=78 loops=1)
   ->  Sort  (cost=331.28..331.47 rows=77 width=17) (actual 
time=121.968..122.152 rows=78 loops=1)
 Sort Key: id
 Sort Method:  quicksort  Memory: 29kB
 ->  Bitmap Heap Scan on gene a1_  (cost=174.24..328.87 rows=77 
width=17) (actual time=114.311..121.705 rows=78 loops=1)
   Recheck Cond: (primaryidentifier = ANY ('{SPAC11D3.15...
   ->  Bitmap Index Scan on gene__key_primaryidentifier  
(cost=0.00..174.22 rows=77 width=0) (actual time=44.434..44.434 rows=150 loops=1)
 Index Cond: (primaryidentifier = ANY 
('{SPAC11D3.15,SPAC11D3.16c...
 Total runtime: 122.733 ms
(9 rows)

Although it's probably in the cache, as it took 1073 ms the first time. 
The table has half a million rows, but tables all over the database are 
being accessed, so the cache is shared between several hundred million 
rows.


Postgres executes this query in two stages. First it does a trawl of the 
index (on field), and builds an bitmap. Then it fetches the pages 
according to the bitmap. I can see the second stage being quite easy to 
adapt for fadvise, but the first stage would be a little more tricky. Both 
stages are equally important, as they take a comparable amount of time.


We are running this database on a 16-spindle RAID array, so the benefits 
to our process of fully utilising them would be quite large. I'm 
considering if I can parallelise things a little though.


Alternately, if you'd like to join in on testing this all out more help would 
certainly be welcome.


How would you like me to help?

Matthew

--
What goes up must come down. Ask any system administrator.

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] RAID arrays and performance

2008-09-19 Thread Tom Lane
Matthew Wakeling <[EMAIL PROTECTED]> writes:
> In order to improve the performance, I made the system look ahead in the 
> source, in groups of a thousand entries, so instead of running:
> SELECT * FROM table WHERE field = 'something';
> a thousand times, we now run:
> SELECT * FROM table WHERE field IN ('something', 'something else'...);
> with a thousand things in the IN. Very simple query. It does run faster 
> than the individual queries, but it still takes quite a while. Here is an 
> example query:

Your example shows the IN-list as being sorted, but I wonder whether you
actually are sorting the items in practice?  If not, you might try that
to improve locality of access to the index.

Also, parsing/planning time could be part of your problem here with 1000
things to look at.  Can you adjust your client code to use a prepared
query?  I'd try
SELECT * FROM table WHERE field = ANY($1::text[])
(or whatever the field datatype actually is) and then push the list
over as a single parameter value using array syntax.  You might find
that it scales to much larger IN-lists that way.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] RAID arrays and performance

2008-09-19 Thread Matthew Wakeling

On Fri, 19 Sep 2008, Tom Lane wrote:

Your example shows the IN-list as being sorted, but I wonder whether you
actually are sorting the items in practice?  If not, you might try that
to improve locality of access to the index.


Well, like I said, we generally don't have the luxury of dictating the 
order of entries in the data source. However, the IN list itself is sorted 
- more to do with making the logs readable and the tests reproducable than 
for performance.


However, I have been looking at changing the order of the input data. This 
particular data source is a 29GB xml file, and I wrote a quick program 
which sorts that by one key in 40 minutes, which will hopefully allow 
later data sources (which are easier to sort) to take advantage of spacial 
locality in the table. However, that key is not the same one as the one 
used in the query above, hence why I say we can't really dictate the order 
of the entries. There's another complication which I won't go into.



Also, parsing/planning time could be part of your problem here with 1000
things to look at.  Can you adjust your client code to use a prepared
query?  I'd try
SELECT * FROM table WHERE field = ANY($1::text[])
(or whatever the field datatype actually is) and then push the list
over as a single parameter value using array syntax.  You might find
that it scales to much larger IN-lists that way.


Yes, that is a useful suggestion. However, I am fairly clear that the 
system is disk seek-bound at the moment, so it probably wouldn't make a 
massive improvement. It would also unfortunately require changing a lot of 
our code. Worth doing at some point.


Matthew

--
"Interwoven alignment preambles are not allowed."
If you have been so devious as to get this message, you will understand
it, and you deserve no sympathy.  -- Knuth, in the TeXbook

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] RAID arrays and performance

2008-09-19 Thread Mark Mielke

Matthew Wakeling <[EMAIL PROTECTED]> writes:
In order to improve the performance, I made the system look ahead in the 
source, in groups of a thousand entries, so instead of running:

SELECT * FROM table WHERE field = 'something';
a thousand times, we now run:
SELECT * FROM table WHERE field IN ('something', 'something else'...);
with a thousand things in the IN. Very simple query. It does run faster 
than the individual queries, but it still takes quite a while. Here is an 
example query:



Have you considered temporary tables? Use COPY to put everything you 
want to query into a temporary table, then SELECT to join the results, 
and pull all of the results, doing additional processing (UPDATE) as you 
pull?


Cheers,
mark

--
Mark Mielke <[EMAIL PROTECTED]>



Re: [PERFORM] why does this use the wrong index?

2008-09-19 Thread Jeff Davis
> So, What can I do to encourage Postgres to use the first index even when the
> date range is smaller.
> 

It looks like PostgreSQL is estimating the selectivity of your date
ranges poorly. In the second (bad) plan it estimates that the index scan
with the filter will return 1 row (and that's probably because it
estimates that the date range you specify will match only one row).

This leads PostgreSQL to choose the narrower index because, if the index
scan is only going to return one row anyway, it might as well scan the
smaller index.

What's the n_distinct for start_time?

=> select n_distinct from pg_stats where tablename='ad_log' and
attname='start_time';

If n_distinct is near -1, that would explain why it thinks that it will
only get one result.

Based on the difference between the good index scan (takes 0.056ms per
loop) and the bad index scan with the filter (311ms per loop), the
"player" condition must be very selective, but PostgreSQL doesn't care
because it already thinks that the date range is selective.

Regards,
Jeff Davis


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] why does this use the wrong index?

2008-09-19 Thread Jeff Davis
On Fri, 2008-09-19 at 11:25 -0700, Jeff Davis wrote:
> What's the n_distinct for start_time?

Actually, I take that back. Apparently, PostgreSQL can't change "x
BETWEEN y AND y" into "x=y", so PostgreSQL can't use n_distinct at all.

That's your problem. If it's one day only, change it to equality and it
should be fine.

Regards,
Jeff Davis


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Intel's X25-M SSD

2008-09-19 Thread Greg Smith

On Mon, 8 Sep 2008, Merlin Moncure wrote:


What's interesting about the X25 is that they managed to pull the
numbers they got out of a MLC flash product.  They managed this with a
DRAM buffer and the custom controller.


I finally found a good analysis of what's wrong with most of the cheap MLC 
drives:


http://www.anandtech.com/cpuchipsets/intel/showdoc.aspx?i=3403&p=7

240ms random write latency...wow, no wonder I keep hearing so many reports 
of cheap SSD just performing miserably.  JMicron is one of those companies 
I really avoid, never seen a design from them that wasn't cheap junk. 
Shame their awful part is in so many of the MLC flash products.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance