Re: [PERFORM] RAID arrays and performance
On Thu, 18 Sep 2008, Greg Smith wrote: It's just that our system is doing a lot of bitmap index scans at the moment, and it'd help to be able to spread them across the 16 discs in the RAID array. It's the bottleneck in our system at the moment. If you have some specific bitmap index scan test case suggestions you can pass along (either publicly or in private to me, I can probably help anonymize them), that's one of the things that has been holding this up. Okay, here's a description of what we're doing. We are importing data from a large file (so we don't have a choice on the order that the data comes in). For each entry in the source, we have to look up the corresponding row in the database, and issue the correct "UPDATE" command according to what's in the database and what's in the data source. Long story - it isn't feasible to change the overall process. In order to improve the performance, I made the system look ahead in the source, in groups of a thousand entries, so instead of running: SELECT * FROM table WHERE field = 'something'; a thousand times, we now run: SELECT * FROM table WHERE field IN ('something', 'something else'...); with a thousand things in the IN. Very simple query. It does run faster than the individual queries, but it still takes quite a while. Here is an example query: SELECT a1_.id AS a1_id, a1_.primaryIdentifier AS a2_ FROM Gene AS a1_ WHERE a1_.primaryIdentifier IN ('SPAC11D3.15', 'SPAC11D3.16c', 'SPAC11D3.17', 'SPAC11D3.18c', 'SPAC15F9.01c', 'SPAC15F9.02', 'SPAC16.01', 'SPAC18G6.01c', 'SPAC18G6.02c', 'SPAC18G6.04c', 'SPAC18G6.05c', 'SPAC18G6.06', 'SPAC18G6.07c', 'SPAC18G6.09c', 'SPAC18G6.10', 'SPAC18G6.11c', 'SPAC18G6.12c', 'SPAC18G6.13', 'SPAC18G6.14c', 'SPAC18G6.15', 'SPAC1B9.01c', 'SPAC1D4.02c', 'SPAC1D4.03c', 'SPAC1D4.04', 'SPAC1D4.05c', 'SPAC1D4.07c', 'SPAC1D4.08', 'SPAC1D4.09c', 'SPAC1D4.10', 'SPAC1D4.11c', 'SPAC1F3.11', 'SPAC23A1.10', 'SPAC23E2.01', 'SPAC23E2.02', 'SPAC23E2.03c', 'SPAC26A3.02', 'SPAC26A3.03c', 'SPAC26A3.05', 'SPAC26A3.06', 'SPAC26A3.07c', 'SPAC26A3.08', 'SPAC26A3.09c', 'SPAC26A3.10', 'SPAC26A3.11', 'SPAC26A3.14c', 'SPAC26A3.15c', 'SPAC26A3.16', 'SPAC27F1.01c', 'SPAC27F1.03c', 'SPAC27F1.04c', 'SPAC27F1.05c', 'SPAC27F1.06c', 'SPAC3H8.02', 'SPAC3H8.03', 'SPAC3H8.04', 'SPAC3H8.05c', 'SPAC3H8.06', 'SPAC3H8.07c', 'SPAC3H8.08c', 'SPAC3H8.09c', 'SPAC3H8.10', 'SPAC3H8.11', 'SPAC8E11.11', 'SPBC106.15', 'SPBC17G9.10', 'SPBC24E9.15c', 'WBGene0969', 'WBGene3035', 'WBGene4095', 'WBGene00016011', 'WBGene00018672', 'WBGene00018674', 'WBGene00018675', 'WBGene00018676', 'WBGene00018959', 'WBGene00018960', 'WBGene00018961', 'WBGene00023407') ORDER BY a1_.id LIMIT 2000; And the corresponding EXPLAIN ANALYSE: Limit (cost=331.28..331.47 rows=77 width=17) (actual time=121.973..122.501 rows=78 loops=1) -> Sort (cost=331.28..331.47 rows=77 width=17) (actual time=121.968..122.152 rows=78 loops=1) Sort Key: id Sort Method: quicksort Memory: 29kB -> Bitmap Heap Scan on gene a1_ (cost=174.24..328.87 rows=77 width=17) (actual time=114.311..121.705 rows=78 loops=1) Recheck Cond: (primaryidentifier = ANY ('{SPAC11D3.15... -> Bitmap Index Scan on gene__key_primaryidentifier (cost=0.00..174.22 rows=77 width=0) (actual time=44.434..44.434 rows=150 loops=1) Index Cond: (primaryidentifier = ANY ('{SPAC11D3.15,SPAC11D3.16c... Total runtime: 122.733 ms (9 rows) Although it's probably in the cache, as it took 1073 ms the first time. The table has half a million rows, but tables all over the database are being accessed, so the cache is shared between several hundred million rows. Postgres executes this query in two stages. First it does a trawl of the index (on field), and builds an bitmap. Then it fetches the pages according to the bitmap. I can see the second stage being quite easy to adapt for fadvise, but the first stage would be a little more tricky. Both stages are equally important, as they take a comparable amount of time. We are running this database on a 16-spindle RAID array, so the benefits to our process of fully utilising them would be quite large. I'm considering if I can parallelise things a little though. Alternately, if you'd like to join in on testing this all out more help would certainly be welcome. How would you like me to help? Matthew -- What goes up must come down. Ask any system administrator. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] RAID arrays and performance
Matthew Wakeling <[EMAIL PROTECTED]> writes: > In order to improve the performance, I made the system look ahead in the > source, in groups of a thousand entries, so instead of running: > SELECT * FROM table WHERE field = 'something'; > a thousand times, we now run: > SELECT * FROM table WHERE field IN ('something', 'something else'...); > with a thousand things in the IN. Very simple query. It does run faster > than the individual queries, but it still takes quite a while. Here is an > example query: Your example shows the IN-list as being sorted, but I wonder whether you actually are sorting the items in practice? If not, you might try that to improve locality of access to the index. Also, parsing/planning time could be part of your problem here with 1000 things to look at. Can you adjust your client code to use a prepared query? I'd try SELECT * FROM table WHERE field = ANY($1::text[]) (or whatever the field datatype actually is) and then push the list over as a single parameter value using array syntax. You might find that it scales to much larger IN-lists that way. regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] RAID arrays and performance
On Fri, 19 Sep 2008, Tom Lane wrote: Your example shows the IN-list as being sorted, but I wonder whether you actually are sorting the items in practice? If not, you might try that to improve locality of access to the index. Well, like I said, we generally don't have the luxury of dictating the order of entries in the data source. However, the IN list itself is sorted - more to do with making the logs readable and the tests reproducable than for performance. However, I have been looking at changing the order of the input data. This particular data source is a 29GB xml file, and I wrote a quick program which sorts that by one key in 40 minutes, which will hopefully allow later data sources (which are easier to sort) to take advantage of spacial locality in the table. However, that key is not the same one as the one used in the query above, hence why I say we can't really dictate the order of the entries. There's another complication which I won't go into. Also, parsing/planning time could be part of your problem here with 1000 things to look at. Can you adjust your client code to use a prepared query? I'd try SELECT * FROM table WHERE field = ANY($1::text[]) (or whatever the field datatype actually is) and then push the list over as a single parameter value using array syntax. You might find that it scales to much larger IN-lists that way. Yes, that is a useful suggestion. However, I am fairly clear that the system is disk seek-bound at the moment, so it probably wouldn't make a massive improvement. It would also unfortunately require changing a lot of our code. Worth doing at some point. Matthew -- "Interwoven alignment preambles are not allowed." If you have been so devious as to get this message, you will understand it, and you deserve no sympathy. -- Knuth, in the TeXbook -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] RAID arrays and performance
Matthew Wakeling <[EMAIL PROTECTED]> writes: In order to improve the performance, I made the system look ahead in the source, in groups of a thousand entries, so instead of running: SELECT * FROM table WHERE field = 'something'; a thousand times, we now run: SELECT * FROM table WHERE field IN ('something', 'something else'...); with a thousand things in the IN. Very simple query. It does run faster than the individual queries, but it still takes quite a while. Here is an example query: Have you considered temporary tables? Use COPY to put everything you want to query into a temporary table, then SELECT to join the results, and pull all of the results, doing additional processing (UPDATE) as you pull? Cheers, mark -- Mark Mielke <[EMAIL PROTECTED]>
Re: [PERFORM] why does this use the wrong index?
> So, What can I do to encourage Postgres to use the first index even when the > date range is smaller. > It looks like PostgreSQL is estimating the selectivity of your date ranges poorly. In the second (bad) plan it estimates that the index scan with the filter will return 1 row (and that's probably because it estimates that the date range you specify will match only one row). This leads PostgreSQL to choose the narrower index because, if the index scan is only going to return one row anyway, it might as well scan the smaller index. What's the n_distinct for start_time? => select n_distinct from pg_stats where tablename='ad_log' and attname='start_time'; If n_distinct is near -1, that would explain why it thinks that it will only get one result. Based on the difference between the good index scan (takes 0.056ms per loop) and the bad index scan with the filter (311ms per loop), the "player" condition must be very selective, but PostgreSQL doesn't care because it already thinks that the date range is selective. Regards, Jeff Davis -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] why does this use the wrong index?
On Fri, 2008-09-19 at 11:25 -0700, Jeff Davis wrote: > What's the n_distinct for start_time? Actually, I take that back. Apparently, PostgreSQL can't change "x BETWEEN y AND y" into "x=y", so PostgreSQL can't use n_distinct at all. That's your problem. If it's one day only, change it to equality and it should be fine. Regards, Jeff Davis -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Intel's X25-M SSD
On Mon, 8 Sep 2008, Merlin Moncure wrote: What's interesting about the X25 is that they managed to pull the numbers they got out of a MLC flash product. They managed this with a DRAM buffer and the custom controller. I finally found a good analysis of what's wrong with most of the cheap MLC drives: http://www.anandtech.com/cpuchipsets/intel/showdoc.aspx?i=3403&p=7 240ms random write latency...wow, no wonder I keep hearing so many reports of cheap SSD just performing miserably. JMicron is one of those companies I really avoid, never seen a design from them that wasn't cheap junk. Shame their awful part is in so many of the MLC flash products. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance