[PERFORM] System overload / context switching / oom, 8.3
total-cpu-usage run blk new|__in_ _out_|_read _writ|_int_ _csw_|usr sys idl wai hiq siq SPECS: PostgreSQL 8.3.9 on i486-pc-linux-gnu, compiled by GCC cc (GCC) 4.1.2 20061115 (prerelease) (Debian 4.1.1-21) Installed from the debian etch-backports package. Linux 2.6.18-6-686-bigmem #1 SMP Thu Nov 5 17:30:05 UTC 2009 i686 GNU/Linux (Debian Etch) 8 MB RAM 4 Quad Core Intel(R) Xeon(R) CPU E5440 @ 2.83GHz stepping 06 L1 I cache: 32K, L1 D cache: 32K, L2 cache: 6144K LSI Logic SAS based MegaRAID driver (batter backed/write cache enabled) Dell PERC 6/i # 8 SEAGATE Model: ST973451SS Rev: SM04 (72 GB) ANSI SCSI revision: 05 RAID Configuration: sda RAID1 2 disks (with pg_xlog wal files on it's own partition) sdb RAID10 6 disks (pg base dir only) POSTGRES: 261 databases 238 active databases (w/connection processes) 863 connections to those 238 databases postgresql.conf: max_connections = 1100 shared_buffers = 800MB max_prepared_transactions = 0 work_mem = 32MB maintenance_work_mem = 64MB max_fsm_pages = 330 max_fsm_relations = 1 vacuum_cost_delay = 50ms bgwriter_delay = 150ms bgwriter_lru_maxpages = 250 bgwriter_lru_multiplier = 2.5 wal_buffers = 8MB checkpoint_segments = 32 checkpoint_timeout = 5min checkpoint_completion_target = 0.9 effective_cache_size = 5000MB default_statistics_target = 100 log_min_duration_statement = 1000 log_checkpoints = on log_connections = on log_disconnections = on log_temp_files = 0 track_counts = on autovacuum = on log_autovacuum_min_duration = 0 Thanks for any ideas! Rob -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] System overload / context switching / oom, 8.3
Scott Marlowe wrote: > On Tue, Feb 2, 2010 at 12:11 PM, Rob wrote: > >> postgresql.conf: >> max_connections = 1100 >> work_mem = 32MB >> > > 32MB * 1000 = 32,000MB... And that's if you max out connections and > they each only do 1 sort. If you're running many queries that run > 1 > sorts it'll happen a lot sooner. > > Either drop max connections or work_mem is what I'd do to start with. > If you have one or two reporting apps that need it higher, then set it > higher for just those connections / users Thanks much. So does dropping work_mem to the default of 1MB sound good? By moving databases around we're getting max_connections below 600 or 700.
Re: [PERFORM] System overload / context switching / oom, 8.3
Kevin Grittner wrote: > Rob wrote: > > >> 8gb ram >> ~240 active databases >> 800+ db connections via tcp. >> > > 8 GB RAM divided by 800 DB connections is 10 MB per connection. You > seriously need to find some way to use connection pooling. I'm not > sure the best way to do that with 240 active databases. > By wrangling the applications, We've got the number of connections down to 530 and number of active databases down to 186. The application's poor connection management exacerbates the problem. Thanks for the idea, Rob
[PERFORM] Identical Queries
Question for anyone... I tried posting to the bugs, and they said this is a better question for here. I have to queries. One runs in about 2 seconds. The other takes upwards of 2 minutes. I have a temp table that is created with 2 columns. This table is joined with the larger database of call detail records. However, these 2 queries are handled very differently. The queries: First calldetail=> EXPLAIN SELECT current.* FROM current JOIN anitmp ON current.destnum=anitmp.ani AND istf=true; QUERY PLAN -- Nested Loop (cost=0.00..2026113.09 rows=500908 width=108) -> Seq Scan on anitmp (cost=0.00..33.62 rows=945 width=8) Filter: (istf = true) -> Index Scan using i_destnum on current (cost=0.00..2137.36 rows=531 width=108) Index Cond: (current.destnum = "outer".ani) (5 rows) Second calldetail=> EXPLAIN SELECT current.* FROM current JOIN anitmp ON current.orignum=anitmp.ani AND istf=false; QUERY PLAN --- Hash Join (cost=35.99..3402035.53 rows=5381529 width=108) Hash Cond: ("outer".orignum = "inner".ani) -> Seq Scan on current (cost=0.00..907191.05 rows=10170805 width=108) -> Hash (cost=33.62..33.62 rows=945 width=8) -> Seq Scan on anitmp (cost=0.00..33.62 rows=945 width=8) Filter: (istf = false) (6 rows) The tables: Table "public.current" Column |Type | Modifiers --+-+--- datetime | timestamp without time zone | orignum | bigint | destnum | bigint | billto | bigint | cost | numeric(6,4)| duration | numeric(8,1)| origcity | character(12) | destcity | character(12) | file | character varying(30) | linenum | integer | carrier | character(1)| Indexes: "i_destnum" btree (destnum) "i_orignum" btree (orignum) Table "public.anitmp" Column | Type | Modifiers +-+--- ani| bigint | istf | boolean | I was also asked to post the EXPLAIN ANALYZE for both: calldetail=> EXPLAIN ANALYZE SELECT current.* FROM anitmp JOIN current ON istf=false AND current.orignum=anitmp.ani; QUERY PLAN --- Hash Join (cost=35.99..3427123.39 rows=5421215 width=108) (actual time=1994.164..157443.544 rows=157 loops=1) Hash Cond: ("outer".orignum = "inner".ani) -> Seq Scan on current (cost=0.00..913881.09 rows=10245809 width=108) (actual time=710.986..137963.320 rows=10893541 loops=1) -> Hash (cost=33.62..33.62 rows=945 width=8) (actual time=10.948..10.948 rows=0 loops=1) -> Seq Scan on anitmp (cost=0.00..33.62 rows=945 width=8) (actual time=10.934..10.939 rows=2 loops=1) Filter: (istf = false) Total runtime: 157443.900 ms (7 rows) calldetail=> EXPLAIN ANALYZE SELECT current.* FROM current JOIN anitmp ON current.destnum=anitmp.ani AND istf=true; QUERY PLAN - Nested Loop (cost=0.00..2037526.69 rows=504602 width=108) (actual time=88.752..1050.295 rows=1445 loops=1) -> Seq Scan on anitmp (cost=0.00..33.62 rows=945 width=8) (actual time=8.189..8.202 rows=2 loops=1) Filter: (istf = true) -> Index Scan using i_destnum on current (cost=0.00..2149.40 rows=534 width=108) (actual time=62.365..517.454 rows=722 loops=2) Index Cond: (current.destnum = "outer".ani) Total runtime: 1052.862 ms (6 rows) Anyone have any ideas for me? I have indexes on each of the necessary columns. Rob ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] Identical Queries
There are 4 entries (wanted to make the playing field level for this test). There are 2 with true for istf and 2 with false. Rob Stephan Szabo wrote: > On Thu, 1 Mar 2007, Rob Schall wrote: > > >> Question for anyone... >> >> I tried posting to the bugs, and they said this is a better question for >> here. >> I have to queries. One runs in about 2 seconds. The other takes upwards >> of 2 minutes. I have a temp table that is created with 2 columns. This >> table is joined with the larger database of call detail records. >> However, these 2 queries are handled very differently. >> > > How many rows are there in anitmp and how many rows in anitmp have > istf=true and how many have istf=false? If you don't currently analyze the > temp table after adding the rows, you might find that doing an analyze > helps, or at least makes the row estimates better. > ---(end of broadcast)--- TIP 6: explain analyze is your friend
[PERFORM] Building multiple indexes concurrently
Lets say I have a large table bigTable to which I would like to add two btree indexes. Is there a more efficient way to create indexes than: CREATE INDEX idx_foo on bigTable (foo); CREATE INDEX idx_baz on bigTable (baz); Or CREATE INDEX CONCURRENTLY idx_foo on bigTable (foo); CREATE INDEX CONCURRENTLY idx_baz on bigTable (baz); Are there any particular performance optimizations that would be in play in such a scenario? At a minimum I assume that if both of the commands were started at about the same time they would each scan the table in the same direction and whichever creation was slower would benefit from most of the table data it needed being prepopulated in shared buffers. Is this the case? -- Rob Wultsch wult...@gmail.com -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Building multiple indexes concurrently
On Wed, Mar 17, 2010 at 7:30 AM, Tom Lane wrote: > Greg Smith writes: >> Rob Wultsch wrote: >>> At a minimum I assume that if both of the commands were started at >>> about the same time they would each scan the table in the same >>> direction and whichever creation was slower would benefit from most of >>> the table data it needed being prepopulated in shared buffers. Is this >>> the case? > >> This might be optimistic; > > No, it's not optimistic in the least, at least not since we implemented > synchronized seqscans (in 8.3 or thereabouts). > > regards, tom lane > Where can I find details about this in the documentation? -- Rob Wultsch wult...@gmail.com -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Random Page Cost and Planner
On Tue, May 25, 2010 at 4:26 PM, David Jarvis wrote: > shared_buffers = 1GB > temp_buffers = 32MB > work_mem = 32MB > maintenance_work_mem = 64MB > effective_cache_size = 256MB Shouldn't effective_cache_size be significantly larger? -- Rob Wultsch wult...@gmail.com -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] performance of temporary vs. regular tables
On Fri, May 28, 2010 at 4:04 AM, Joachim Worringen wrote: > On 05/26/2010 06:03 PM, Joachim Worringen wrote: >> >> Am 25.05.2010 12:41, schrieb Andres Freund: >>> >>> On Tuesday 25 May 2010 11:00:24 Joachim Worringen wrote: >>>> >>>> Thanks. So, the Write-Ahead-Logging (being used or not) does not matter? >>> >>> It does matter quite significantly in my experience. Both from an io >>> and a cpu >>> overhead perspective. >> >> O.k., looks as if I have to make my own experience... I'll let you know >> if possible. > > As promised, I did a tiny benchmark - basically, 8 empty tables are filled > with 100k rows each within 8 transactions (somewhat typically for my > application). The test machine has 4 cores, 64G RAM and RAID1 10k drives for > data. > > # INSERTs into a TEMPORARY table: > [joac...@testsrv scaling]$ time pb query -d scaling_qry_1.xml > > real 3m18.242s > user 1m59.074s > sys 1m51.001s > > # INSERTs into a standard table: > [joac...@testsrv scaling]$ time pb query -d scaling_qry_1.xml > > real 3m35.090s > user 2m5.295s > sys 2m2.307s > > Thus, there is a slight hit of about 10% (which may even be within > meausrement variations) - your milage will vary. > > Joachim > > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance > I think it would be interesting to create a ram disk and insert into it. In the MySQL community even thought MyISAM has fallen out of use the Memory table (based on MyISAM) is still somewhat used. -- Rob Wultsch wult...@gmail.com -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] PostgreSQL as a local in-memory cache
On Fri, Jun 18, 2010 at 1:55 PM, Josh Berkus wrote: > >> It must be a setting, not a version. >> >> For instance suppose you have a session table for your website and a >> users table. >> >> - Having ACID on the users table is of course a must ; >> - for the sessions table you can drop the "D" > > You're trying to solve a different use-case than the one I am. > > Your use-case will be solved by global temporary tables. I suggest that > you give Robert Haas some help & feedback on that. > > My use case is people using PostgreSQL as a cache, or relying entirely > on replication for durability. > > -- > -- Josh Berkus > PostgreSQL Experts Inc. > http://www.pgexperts.com > Is he? Wouldn't a global temporary table have content that is not visible between db connections? A db session many not be the same as a user session. -- Rob Wultsch wult...@gmail.com -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] performance on new linux box
On Wed, Jul 7, 2010 at 4:06 PM, Ryan Wexler wrote: > Postgresql was previously running on a single cpu linux machine with 2 gigs > of memory and a single sata drive (v8.3). Basically a desktop with linux on > it. I experienced slow performance. > > So, I finally moved it to a real server. A dually zeon centos machine with > 6 gigs of memory and raid 10, postgres 8.4. But, I am now experiencing even > worse performance issues. > > My system is consistently highly transactional. However, there is also > regular complex queries and occasional bulk loads. > > On the new system the bulk loads are extremely slower than on the previous > machine and so are the more complex queries. The smaller transactional > queries seem comparable but i had expected an improvement. Performing a db > import via psql -d databas -f dbfile illustrates this problem. It takes 5 > hours to run this import. By contrast, if I perform this same exact import > on my crappy windows box with only 2 gigs of memory and default postgres > settings it takes 1 hour. Same deal with the old linux machine. How is > this possible? > > Here are some of my key config settings: > max_connections = 100 > shared_buffers = 768MB > effective_cache_size = 2560MB > work_mem = 16MB > maintenance_work_mem = 128MB > checkpoint_segments = 7 > checkpoint_timeout = 7min > checkpoint_completion_target = 0.5 > > I have tried varying the shared_buffers size from 128 all the way to 1500mbs > and got basically the same result. Is there a setting change I should be > considering? > > Does 8.4 have performance problems or is this unique to me? > > thanks > > I think the most likely explanation is that the crappy box lied about fsync'ing data and your server is not. Did you purchase a raid card with a bbu? If so, can you set the write cache policy to write-back? -- Rob Wultsch wult...@gmail.com -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] now() gives same time within the session
On Mon, Jul 12, 2010 at 4:15 AM, A. Kretschmer wrote: > In response to atul.g...@globaldatapoint.com : >> Hi, >> >> >> >> I need to log the start and end time of the procedures in a table. But the >> start and end time are same. This is how I recreated the issue. >> >> >> >> create table test_time (time timestamp); >> >> delete from test_time; >> >> insert into test_time select now(); > > > Use timeofday() instead, now() returns the transaction starting time. Is this part of the SQL standard? -- Rob Wultsch wult...@gmail.com -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Using more tha one index per table
On Wed, Jul 21, 2010 at 12:53 AM, A. Kretschmer < andreas.kretsch...@schollglas.com> wrote: > In response to Elias Ghanem : > > Hi, > > I have a question concerning the uses of indexes in Postgresql. > > I red that in PG a query can not use more than one index per table: "a > query or > > data manipulation command can use at most one index per table". > > That's not true, but it's true for MySQL, afaik. > > > Andreas > -- > Andreas Kretschmer > Kontakt: Heynitz: 035242/47150, D1: 0160/7141639 (mehr: -> Header) > GnuPG: 0x31720C99, 1006 CCB4 A326 1D42 6431 2EB0 389D 1DC2 3172 0C99 > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance > That is not true either, though MySQL is less good at using bitmap'ed indexes. 5.0 can use "merge indexes", -- Rob Wultsch wult...@gmail.com
Re: [PERFORM] Using more tha one index per table
On Thu, Jul 22, 2010 at 1:35 AM, Richard Huxton wrote: > On 22/07/10 03:27, Greg Smith wrote: > >> Steve Atkins wrote: >> >>> If http://postgresql.org/docs/9.0/* were to 302 redirect to >>> http://postgresql.org/docs/current/* while 9.0 is the current release >>> (and similarly for 9.1 and so on) I suspect we'd find many more links >>> to current and fewer links to specific versions after a year or two. >>> >> >> True, but this would leave people with no way to bookmark a permanent >> link to whatever is the current version, which will represent a >> regression for how some people want the site to work. >> > > Having a quick look at the website, a simple change might be to have a > large "CURRENT MANUALS" link above all the versioned links. That should help > substantially. > > -- > Richard Huxton > Archonet Ltd > > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance > I suggested a few weeks ago adding a drop down menu for other version of the manual for a page. I have not had time to write a patch, but I think it is something that MySQL does better that pg. As an example take a look at the page on select for MySQL: http://dev.mysql.com/doc/refman/5.1/en/select.html . If you want a earlier or later version they are easily accessible via a link on the left. -- Rob Wultsch wult...@gmail.com
Re: [PERFORM] BBU Cache vs. spindles
On Fri, Oct 22, 2010 at 8:37 AM, Greg Smith wrote: > Tom Lane wrote: >> >> You've got entirely too simplistic a view of what the "delta" might be, >> I fear. In particular there are various sorts of changes that involve >> inserting the data carried in the WAL record and shifting pre-existing >> data around to make room, or removing an item and moving remaining data >> around. If you try to replay that type of action against a torn page, >> you'll get corrupted results. >> > > I wasn't sure exactly how those were encoded, thanks for the clarification. > Given that, it seems to me there are only two situations where > full_page_writes is safe to turn off: > > 1) The operating system block size is exactly the same database block size, > and all writes are guaranteed to be atomic to that block size. > 2) You're using a form of journaled filesystem where data blocks are never > updated, they're always written elsewhere and the filesystem is redirected > to that new block once it's on disk. > > Looks to me like whether or not there's a non-volatile write cache sitting > in the middle, like a BBU protected RAID card, doesn't really make any > difference here then. > > I think that most people who have thought they were safe to turn off > full_page_writes in the past did so because they believed they were in > category (1) here. I've never advised anyone to do that, because it's so > difficult to validate the truth of. Just given that, I'd be tempted to join > in on suggesting this parameter just go away in the name of safety, except > that I think category (2) here is growing now. ZFS is the most obvious > example where the atomic write implementation seems to always make disabling > full_page_writes safe. > For the sake of argument, has PG considered using a double write buffer similar to InnodB? -- Rob Wultsch wult...@gmail.com -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
On Fri, Oct 22, 2010 at 10:28 AM, Kevin Grittner wrote: > Rob Wultsch wrote: > >> has PG considered using a double write buffer similar to InnodB? > > That seems inferior to the full_page_writes strategy, where you only > write a page twice the first time it is written after a checkpoint. > We're talking about when we might be able to write *less*, not more. > > -Kevin > By "write" do you mean number of writes, or the number of bytes of the writes? For number of writes, yes a double write buffer will lose. In terms of number of bytes, I would think full_page_writes=off + double write buffer should be far superior, particularly given that the WAL is shipped over the network to slaves. -- Rob Wultsch wult...@gmail.com -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
On Fri, Oct 22, 2010 at 12:05 PM, Kevin Grittner wrote: > Rob Wultsch wrote: > >> I would think full_page_writes=off + double write buffer should be >> far superior, particularly given that the WAL is shipped over the >> network to slaves. > > For a reasonably brief description of InnoDB double write buffers, I > found this: > > http://www.mysqlperformanceblog.com/2006/08/04/innodb-double-write/ > > One big question before even considering this would by how to > determine whether a potentially torn page "is inconsistent". > Without a page CRC or some such mechanism, I don't see how this > technique is possible. > > Even if it's possible, it's far from clear to me that it would be an > improvement. The author estimates (apparently somewhat loosely) > that it's a 5% to 10% performance hit in InnoDB; I'm far from > certain that full_page_writes cost us that much. Does anyone have > benchmark numbers handy? > > -Kevin > Ignoring (briefly) the cost in terms of performance of the different system, not needing full_page_writes would make geographically dispersed replication possible for certain cases where it is not currently (or at least rather painful). -- Rob Wultsch wult...@gmail.com -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
On Fri, Oct 22, 2010 at 1:15 PM, Kevin Grittner wrote: > Rob Wultsch wrote: > >> not needing full_page_writes would make geographically dispersed >> replication possible for certain cases where it is not currently >> (or at least rather painful). > > Do you have any hard numbers on WAL file size impact? How much does > pglesslog help in a file-based WAL transmission environment? Should > we be considering similar filtering for streaming replication? > > -Kevin > No, I am DBA that mostly works on MySQL. I have had to deal with (handwaving...) tangential issues recently. I really would like to work with PG more and this seems like it would be a significant hindrance for certain usage patterns. Lots of replication does not take place over gig... -- Rob Wultsch wult...@gmail.com -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
On Tue, Oct 26, 2010 at 5:41 AM, Robert Haas wrote: > On Fri, Oct 22, 2010 at 3:05 PM, Kevin Grittner > wrote: >> Rob Wultsch wrote: >> >>> I would think full_page_writes=off + double write buffer should be >>> far superior, particularly given that the WAL is shipped over the >>> network to slaves. >> >> For a reasonably brief description of InnoDB double write buffers, I >> found this: >> >> http://www.mysqlperformanceblog.com/2006/08/04/innodb-double-write/ >> >> One big question before even considering this would by how to >> determine whether a potentially torn page "is inconsistent". >> Without a page CRC or some such mechanism, I don't see how this >> technique is possible. > > There are two sides to this problem: figuring out when to write a page > to the double write buffer, and figuring out when to read it back from > the double write buffer. The first seems easy: we just do it whenever > we would XLOG a full page image. As to the second, when we write the > page out to the double write buffer, we could also write to the double > write buffer the LSN of the WAL record which depends on that full page > image. Then, at the start of recovery, we scan the double write > buffer and remember all those LSNs. When we reach one of them, we > replay the full page image. > > The good thing about this is that it would reduce WAL volume; the bad > thing about it is that it would probably mean doing two fsyncs where > we only now do one. > The double write buffer is one of the few areas where InnoDB does more IO (in the form of fsynch's) than PG. InnoDB also has fuzzy checkpoints (which help to keep dirty pages in memory longer), buffering of writing out changes to secondary indexes, and recently tunable page level compression. Given that InnoDB is not shipping its logs across the wire, I don't think many users would really care if it used the double writer or full page writes approach to the redo log (other than the fact that the log files would be bigger). PG on the other hand *is* pushing its logs over the wire... -- Rob Wultsch wult...@gmail.com -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
On Tue, Oct 26, 2010 at 7:25 AM, Robert Haas wrote: > On Tue, Oct 26, 2010 at 10:13 AM, Rob Wultsch wrote: >> The double write buffer is one of the few areas where InnoDB does more >> IO (in the form of fsynch's) than PG. InnoDB also has fuzzy >> checkpoints (which help to keep dirty pages in memory longer), >> buffering of writing out changes to secondary indexes, and recently >> tunable page level compression. > > Baron Schwartz was talking to me about this at Surge. I don't really > understand how the fuzzy checkpoint stuff works, and I haven't been > able to find a good description of it anywhere. How does it keep > dirty pages in memory longer? Details on the other things you mention > would be interesting to hear, too. For checkpoint behavior: http://books.google.com/books?id=S_yHERPRZScC&pg=PA606&lpg=PA606&dq=fuzzy+checkpoint&source=bl&ots=JJrzRUKBGh&sig=UOMPsRy5E-YDgjAFkaSVn3dps_M&hl=en&ei=_k8yTOfeHYzZnAepyumLBA&sa=X&oi=book_result&ct=result&resnum=8&ved=0CEYQ6AEwBw#v=onepage&q=fuzzy%20checkpoint&f=false I would think that best case behavior "sharp" checkpoints with a large checkpoint_completion_target would have behavior similar to a fuzzy checkpoint. Insert (for innodb 1.1+ evidently there is also does delete and purge) buffering: http://dev.mysql.com/doc/refman/5.5/en/innodb-insert-buffering.html For a recent ~800GB db I had to restore, the insert buffer saved 92% of io needed for secondary indexes. Compression: http://dev.mysql.com/doc/innodb-plugin/1.0/en/innodb-compression-internals.html For many workloads 50% compression results in negligible impact to performance. For certain workloads compression can help performance. Please note that InnoDB also has non-tunable toast like feature. >> Given that InnoDB is not shipping its logs across the wire, I don't >> think many users would really care if it used the double writer or >> full page writes approach to the redo log (other than the fact that >> the log files would be bigger). PG on the other hand *is* pushing its >> logs over the wire... > > So how is InnoDB doing replication? Is there a second log just for that? > The other log is the "binary log" and it is one of the biggest problems with MySQL. Running MySQL in such a way that the binary log stays in sync with the InnoDB redo has a very significant impact on performance. http://www.mysqlperformanceblog.com/2010/10/23/mysql-limitations-part-2-the-binary-log/ http://mysqlha.blogspot.com/2010/09/mysql-versus-mongodb-update-performance.html (check out the pretty graph) If you are going to West you should considering heading over to the Facebook office on Tuesday as the MySQL team is having something of an open house: http://www.facebook.com/event.php?eid=160712450628622 Mark Callaghan from the Facebook MySQL Engineering (and several members of their ops team, for that matter) team understands InnoDB dramatically better than I do. -- Rob Wultsch wult...@gmail.com -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
On Wed, Oct 27, 2010 at 6:55 PM, Robert Haas wrote: > On Wed, Oct 27, 2010 at 12:41 AM, Rob Wultsch wrote: >> On Tue, Oct 26, 2010 at 7:25 AM, Robert Haas wrote: >>> On Tue, Oct 26, 2010 at 10:13 AM, Rob Wultsch wrote: >>>> The double write buffer is one of the few areas where InnoDB does more >>>> IO (in the form of fsynch's) than PG. InnoDB also has fuzzy >>>> checkpoints (which help to keep dirty pages in memory longer), >>>> buffering of writing out changes to secondary indexes, and recently >>>> tunable page level compression. >>> >>> Baron Schwartz was talking to me about this at Surge. I don't really >>> understand how the fuzzy checkpoint stuff works, and I haven't been >>> able to find a good description of it anywhere. How does it keep >>> dirty pages in memory longer? Details on the other things you mention >>> would be interesting to hear, too. >> >> For checkpoint behavior: >> http://books.google.com/books?id=S_yHERPRZScC&pg=PA606&lpg=PA606&dq=fuzzy+checkpoint&source=bl&ots=JJrzRUKBGh&sig=UOMPsRy5E-YDgjAFkaSVn3dps_M&hl=en&ei=_k8yTOfeHYzZnAepyumLBA&sa=X&oi=book_result&ct=result&resnum=8&ved=0CEYQ6AEwBw#v=onepage&q=fuzzy%20checkpoint&f=false >> >> I would think that best case behavior "sharp" checkpoints with a large >> checkpoint_completion_target would have behavior similar to a fuzzy >> checkpoint. > > Well, under that definition of a fuzzy checkpoint, our checkpoints are > fuzzy even with checkpoint_completion_target=0. > > What Baron seemed to be describing was a scheme whereby you could do > what I might call partial checkpoints. IOW, you want to move the redo > pointer without writing out ALL the dirty buffers in memory, so you > write out the pages with the oldest LSNs and then move the redo > pointer to the oldest LSN you have left. Except that doesn't quite > work, because the page might have been dirtied at LSN X and then later > updated again at LSN Y, and you still have to flush it to disk before > moving the redo pointer to any value >X. So you work around that by > maintaining a "first dirtied" LSN for each page as well as the current > LSN. > > I'm not 100% sure that this is how it works or that it would work in > PG, but even assuming that it is and does, I'm not sure what the > benefit is over the checkpoint-spreading logic we have now. There > might be some benefit in sorting the writes that we do, so that we can > spread out the fsyncs. So, write all the blocks to a give file, > fsync, and then repeat for each underlying data file that has at least > one dirty block. But that's completely orthogonal to (and would > actually be hindered by) the approach described in the preceding > paragraph. I wish I could answer your questions better. I am a power user that does not fully understand InnoDB internals. There are not all that many folks that have a very good understanding of InnoDB internals (given how well it works there is not all that much need). > >> Insert (for innodb 1.1+ evidently there is also does delete and purge) >> buffering: >> http://dev.mysql.com/doc/refman/5.5/en/innodb-insert-buffering.html > > We do something a bit like this for GIST indices. It would be > interesting to see if it also has a benefit for btree indices. > >> For a recent ~800GB db I had to restore, the insert buffer saved 92% >> of io needed for secondary indexes. >> >> Compression: >> http://dev.mysql.com/doc/innodb-plugin/1.0/en/innodb-compression-internals.html >> >> For many workloads 50% compression results in negligible impact to >> performance. For certain workloads compression can help performance. >> Please note that InnoDB also has non-tunable toast like feature. > > Interesting. I am surprised this works well. It seems that this only > works for pages that can be compressed by >=50%, which seems like it > could result in a lot of CPU wasted on failed attempts to compress. In my world, the spinning disk is almost always the bottleneck. Trading CPU for IO is almost always a good deal for me. > >>>> Given that InnoDB is not shipping its logs across the wire, I don't >>>> think many users would really care if it used the double writer or >>>> full page writes approach to the redo log (other than the fact that >>>> the log files would be bigger). PG on the other hand *is* pushing its >>>> logs over the wire... >>> >>> So how is InnoDB doing replication? Is there a second log just for that? >>> >> >> Th
[PERFORM] Group commit and commit delay/siblings
Manual: http://www.postgresql.org/docs/9.0/static/runtime-config-wal.html#RUNTIME-CONFIG-WAL-SETTINGS Recent discussion: http://www.facebook.com/notes/mysql-at-facebook/group-commit-in-postgresql/465781235932 It is my understanding that group commit in PG works without the commit_delay or commit_siblings being enabled. For many people coming from other databases, the existence of these GUC seems to suggest that group commit does not work without the being enabled. Are these setting useful, and if so how should they be tuned? If they are generally are not useful, should these settings be removed? -- Rob Wultsch wult...@gmail.com -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Group commit and commit delay/siblings
On Sun, Dec 5, 2010 at 7:30 PM, Jignesh Shah wrote: > The commit_siblings = 5 basically checks that it sleeps only when that > many backends are active. This I think is a very expensive check and I > would rather make commit_siblings=0 (which the current code does not > support.. it only supports minimum of 1) The check is expensive > irrespective of the settings .. But anyway here is the real kicker. > In all the tests I did with recent verions 8.4 and version 9.0 , it > seems that the default behavior handles the load well enough and one > does not have to use commit_delay at all. Since when the load is very > high all of them are basically in sync phase and the desired thing > happens anyway. > > Infact using commit_delay will actually add the cost of doing > commit_siblings check and can hurt the performance by increasing CPU > consumption.. Doing commit_siblings check for every transaction is a > killer since it does not return after meeting the minimum backends and > goes through every backend to calculate the total number before > comparing with the minimum. This is probably why most people see a > drop in performance when using commit_delay compared to the default. > > Anyway I would recommended right now to stick with the default and > not really use it. It does the sync absorbtion well if you have two > many users (though not perfect). Sounds like this setting should go away unless there is a very good reason to keep it. -- Rob Wultsch wult...@gmail.com -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Compared MS SQL 2000 to Postgresql 9.0 on Windows
On Fri, Dec 17, 2010 at 10:08 AM, Tom Polak wrote: > What kind of performance can I expect out of Postgres compare to MSSQL? You should take any generalizations with a grain of salt. I suggest that you do a POC. > Let's assume that Postgres is running on Cent OS x64 and MSSQL is running > on Windows 2008 x64, both are on identical hardware running RAID 5 (for > data redundancy/security), SAS drives 15k RPM, dual XEON Quad core CPUs, > 24 GB of RAM. RAID-5 = suckage for databases. Things to think about: How big is your data set and how big is your working set? Do you have a raid card? Is it properly configured? -- Rob Wultsch wult...@gmail.com -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Why we don't want hints
On Thu, Feb 10, 2011 at 9:25 AM, Chris Browne wrote: > robertmh...@gmail.com (Robert Haas) writes: >> On Thu, Feb 10, 2011 at 11:45 AM, Kevin Grittner >> wrote: >>> Well, I'm comfortable digging in my heels against doing *lame* hints >>> just because "it's what all the other kids are doing," which I think >>> is the only thing which would have satisfied the OP on this thread. >>> From both on-list posts and ones exchanged off-list with me, it >>> seems he was stubbornly resistant to properly tuning the server to >>> see if any problems remained, or posting particular problems to see >>> how they would be most effectively handled in PostgreSQL. We >>> obviously can't be drawn into dumb approaches because of >>> ill-informed demands like that. >> >> Nor was I proposing any such thing. But that doesn't make "we don't >> want hints" an accurate statement. Despite the impression that OP >> went away with, the real situation is a lot more nuanced than that, >> and the statement on the Todo list gives the wrong impression, IMHO. > > I have added the following comment to the ToDo: > > We are not interested to implement hints in ways they are commonly > implemented on other databases, and proposals based on "because > they've got them" will not be welcomed. If you have an idea that > avoids the problems that have been observed with other hint systems, > that could lead to valuable discussion. > > That seems to me to characterize the nuance. Where exactly are the problems with other systems noted? Most other systems have this option so saying "They have problems" is a giant cop out. -- Rob Wultsch wult...@gmail.com -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] commit so slow program looks frozen
Merlin Moncure wrote: > On 10/28/06, Simon Riggs <[EMAIL PROTECTED]> wrote: >> On Thu, 2006-10-26 at 11:06 -0400, Merlin Moncure wrote: >> > On 10/26/06, Carlo Stonebanks <[EMAIL PROTECTED]> wrote: >> > > This is pretty interesting - where can I read more on this? >> Windows isn't >> > > actually hanging, one single command line window is - from its >> behaviour, it >> > > looks like the TCL postgresql package is waiting for pg_exec to >> come back >> > > from the commit (I believe the commit has actually gone through). >> > > >> > > It could even be that there's something wrong with the TCL >> package, but from >> > > my understanding it is one of the most complete interfaces out >> there - which >> > > is weird, because TCL seems to be the most unpopular language in the >> > > community. >> > >> > when it happens, make sure to query pg_locks and see what is going on >> > there lock issues are not supposed to manifest on a commit, which >> > releases locks, but you never know. There have been reports of >> > insonsistent lock ups on windows (espeically multi-processor) which >> > you might be experiencing. Make sure you have the very latest version >> > of pg 8.1.x. Also consider checking out 8.2 and see if you can >> > reproduce the behavior there...this will require compiling postgresql. >> >> Merlin, >> >> Rumour has it you managed to get a BT from Windows. That sounds like it >> would be very useful here. Could it be there is a hangup in communication with the backend via the libpq library? I have a situation on Windows where psql seems to be hanging randomly AFTER completing (or almost completing) a vacuum full analyze verbose. I'm running the same databases on a single postgres instance on a Dell 4gb RAM 2 processor xeon (hyper-threading turned off) running Debian GNU/Linux. The windows system is an IBM 24gb RAM, 4 processor xeon (hyperthreading turned off). No problems on the Dell, it runs pgbench faster than the windows IBM system. The Dell Linux system zips through vacuumdb --all --analyze --full --verbose with no problems. The windows machine is running 6 instances of postgresql because of problems trying to load all of the databases into one instance on windows. The last output from psql is: INFO: free space map contains 474 pages in 163 relations DETAIL: A total of 2864 page slots are in use (including overhead). 2864 page slots are required to track all free space. Current limits are: 42 page slots, 25000 relations, using 4154 KB. (I've currently restarted postgresql with more reasonable fsm_page_slots and fsm_relations). It appears that psql is hung in the call to WS2_32!select. The psql stack trace looks like this: ntdll!KiFastSystemCallRet ntdll!NtWaitForSingleObject+0xc mswsock!SockWaitForSingleObject+0x19d mswsock!WSPSelect+0x380 WS2_32!select+0xb9 WARNING: Stack unwind information not available. Following frames may be wrong. libpq!PQenv2encoding+0x1fb libpq!PQenv2encoding+0x3a1 libpq!PQenv2encoding+0x408 libpq!PQgetResult+0x58 libpq!PQgetResult+0x188 psql+0x4c0f psql+0x954d psql+0x11e7 psql+0x1238 kernel32!IsProcessorFeaturePresent+0x9e With more detail: # ChildEBP RetAddr Args to Child 00 0022f768 7c822124 71b23a09 07a8 0001 ntdll!KiFastSystemCallRet (FPO: [0,0,0]) 01 0022f76c 71b23a09 07a8 0001 0022f794 ntdll!NtWaitForSingleObject+0xc (FPO: [3,0,0]) 02 0022f7a8 71b23a52 07a8 0780 mswsock!SockWaitForSingleObject+0x19d (FPO: [Non-Fpo]) 03 0022f898 71c0470c 0781 0022fc40 0022fb30 mswsock!WSPSelect+0x380 (FPO: [Non-Fpo]) 04 0022f8e8 6310830b 0781 0022fc40 0022fb30 WS2_32!select+0xb9 (FPO: [Non-Fpo]) WARNING: Stack unwind information not available. Following frames may be wrong. 05 0022fd68 631084b1 001d libpq!PQenv2encoding+0x1fb 06 0022fd88 63108518 0001 00614e70 libpq!PQenv2encoding+0x3a1 07 0022fda8 631060f8 0001 00614e70 libpq!PQenv2encoding+0x408 08 0022fdc8 63106228 00614e70 00613a71 00615188 libpq!PQgetResult+0x58 09 0022fde8 00404c0f 00614e70 00613a71 0041ac7a libpq!PQgetResult+0x188 0a 0022fe98 0040954d 00613a71 00423180 00423185 psql+0x4c0f 0b 0022ff78 004011e7 0006 00613b08 00612aa8 psql+0x954d 0c 0022ffb0 00401238 0001 0009 0022fff0 psql+0x11e7 0d 0022ffc0 77e523e5 7ffdc000 psql+0x1238 0e 0022fff0 00401220 78746341 kernel32!IsProcessorFeaturePresent+0x9e the pg_locks table: -[ RECORD 1 ]-+ locktype | relation database | 19553 relation | 10342 page | tuple | transactionid | classid | objid | objsubid | transaction | 1998424 pid | 576 mode | AccessShareLock granted | t -[ RECORD 2 ]-+ locktype | transactionid database | relation | page | tuple | transactionid | 1998424 classid | objid | objsubid | transaction | 1998424 pid | 576 mode | Exc
Re: [PERFORM] Quad processor options
-Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Bjoern Metzdorf Sent: Tuesday, May 11, 2004 3:11 PM To: scott.marlowe Cc: [EMAIL PROTECTED]; Pgsql-Admin (E-mail) Subject: Re: [PERFORM] Quad processor options scott.marlowe wrote: >>Next drives I'll buy will certainly be 15k scsi drives. > > Better to buy more 10k drives than fewer 15k drives. Other than slightly > faster select times, the 15ks aren't really any faster. Good to know. I'll remember that. >>In peak times we can get up to 700-800 connections at the same time. >>There are quite some updates involved, without having exact numbers I'll >>think that we have about 70% selects and 30% updates/inserts. > > Wow, a lot of writes then. Yes, it certainly could also be only 15-20% updates/inserts, but this is also not negligible. > Sure, adaptec makes one, so does lsi megaraid. Dell resells both of > these, the PERC3DI and the PERC3DC are adaptec, then lsi in that order, I > believe. We run the lsi megaraid with 64 megs battery backed cache. The LSI sounds good. > Intel also makes one, but I've heard nothing about it. It could well be the ICP Vortex one, ICP was bought by Intel some time ago.. > I haven't directly tested anything but the adaptec and the lsi megaraid. > Here at work we've had massive issues trying to get the adaptec cards > configured and installed on, while the megaraid was a snap. Installed RH, > installed the dkms rpm, installed the dkms enabled megaraid driver and > rebooted. Literally, that's all it took. I didn't hear anything about dkms for debian, so I will be hand-patching as usual :) Regards, Bjoern - Personally I would stay away from anything intel over 2 processors. I have done some research and if memory serves it something like this. Intel's architecture makes each processor compete for bandwidth on the bus to the ram. Amd differs in that each proc has its own bus to the ram. Don't take this as god's honest fact but just keep it in mind when considering a Xeon solution, it may be worth your time to do some deeper research into this. There is some on this here http://www4.tomshardware.com/cpu/20030422/ Rob ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [PERFORM] Configuring PostgreSQL to minimize impact of checkpoints
-Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of scott.marlowe Sent: Tuesday, May 11, 2004 2:23 PM To: Paul Tuckfield Cc: [EMAIL PROTECTED]; Matthew Nuzum; [EMAIL PROTECTED]; Rob Fielding Subject: Re: [PERFORM] Configuring PostgreSQL to minimize impact of checkpoints On Tue, 11 May 2004, Paul Tuckfield wrote: > If you are having a "write storm" or bursty writes that's burying > performance, a scsi raid controler with writeback cache will greatly > improve the situation, but I do believe they run around $1-2k. If > it's write specific problem, the cache matters more than the striping, > except to say that write specfic perf problems should avoid raid5 Actually, a single channel MegaRAID 320-1 (single channel ultra 320) is only $421 at http://www.siliconmechanics.com/c248/u320-scsi.php It works pretty well for me, having 6 months of a production server on one with zero hickups and very good performance. They have a dual channel intel card for only $503, but I'm not at all familiar with that card. The top of the line megaraid is the 320-4, which is only $1240, which ain't bad for a four channel RAID controller. Battery backed cache is an addon, but I think it's only about $80 or so. ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly - If you don't mind slumming on ebay :-) keep an eye out for PERC III cards, they are dell branded LSI cards. Perc = Power Edge Raid Controller. There are models on there dual channel u320 and dell usually sells them with battery backed cache. That's how I have acquired all my high end raid cards. Rob ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [PERFORM] Configuring PostgreSQL to minimize impact of checkpoints
Matthew Nuzum wrote: Jack Orenstein <[EMAIL PROTECTED]> writes: I'm looking at one case in which two successive transactions, each updating a handful of records, take 26 and 18 *seconds* (not msec) to complete. These transactions normally complete in under 30 msec. None of this is necessarily going to fix matters for an installation that has no spare I/O capacity, though. And from the numbers you're quoting I fear you may be in that category. "Buy faster disks" may be the only answer ... I had a computer once that had an out-of-the-box hard drive configuration that provided horrible disk performance. I found a tutorial at O'Reilly that explained how to use hdparm to dramatically speed up disk performance on Linux. I've noticed on other computers I've set up recently that hdparm seems to be used by default out of the box to give good performance. Maybe your computer is using all of it's I/O capacity because it's using PIO mode or some other non-optimal method of accessing the disk. There's certainly some scope there. I have an SGI Octane whos SCSI 2 disks were set-up by default with no write buffer and CTQ depth of zero :/ IDE drivers in Linux maybe not detecting your IDE chipset correctly and stepping down, however unlikely there maybe something odd going on but you could check hdparm out. Ensure correct cables too, and the aren't crushed or twisted too bad I digress... Assuming you're running with optimal schema and index design (ie you're not doing extra work unnecessarily), and your backend has better-then-default config options set-up (plenty of tips around here), then disk arrangement is critical to smoothing the ride. Taking things to a relative extreme, we implemented a set-up with issues similar sounding to yours. It was resolved by first optimising everything but hardware, then finally optimising hardware. This served us because it meant we squeezed as much out of the available hardware, before finally throwing more at it, getting us the best possible returns (plus further post optimisation on the new hardware). First tip would to take your pg_xlog and put it on another disk (and channel). Next if you're running a journalled fs, get that journal off onto another disk (and channel). Finally, get as many disks for the data store and spread the load across spindles. You're aiming here to distribute the contention and disk I/O more evenly to remove the congestion. sar and iostat help out as part of the analysis. You say you're using IDE, for which I'd highly recommend switching to SCSI and mutliple controllers because IDE isn't great for lots of other reasons. Obviously budgets count, and playing with SCSI certainly limits that. We took a total of 8 disks across 2 SCSI 160 channels and split up the drives into a number of software RAID arrays. RAID0 mirrors for the os, pg_xlog, data disk journal and swap and the rest became a RAID5 array for the data. You could instead implement your DATA disk as RAID1+0 if you wanted more perf at the cost of free space. Anyway, it's certainly not the fastest config out there, but it made all the difference to this particular application. Infact, we had so much free I/O we recently installed another app on there (based on mysql, sorry) which runs concurrently, and itself 4 times faster than it originally did... YMMV, just my 2p. -- Rob Fielding [EMAIL PROTECTED] www.dsvr.co.uk Development Designer Servers Ltd ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] OT: Help with performance problems
scott.marlowe wrote: On Fri, 23 Apr 2004, Chris Hoover wrote: DB's on Powervaults 220S using raid 5 (over 6 disks) What controller is this, the adaptec? We've found it to be slower than the LSI megaraid based controller, but YMMV. Wow, really? You got any more details of the chipset, mobo and kernel driver ? I've been taken to my wits end wrestling with an LSI MegaRAID 320-1 controller on a supermicro board all weekend. I just couldn't get anything more than 10MB/sec out of it with megaraid driver v1 OR v2 in Linux 2.4.26, nor the version in 2.6.6-rc2. After 2 days of humming the Adaptec mantra I gave in and switched the array straight onto the onboard Adaptec 160 controller (same cable and everything). Software RAID 5 gets me over 40MB sec for a nominal cpu hit - more than 4 times what I could get out of the MegaRAID controller :( Even the 2nd SCSI-2 channel gets 40MB/sec max (pg_xlog :) And HOW LONG does it take to detect drives during POSTo never mind ... I really just wanna rant :) There should be a free counseling service for enraged sysops. -- Rob Fielding [EMAIL PROTECTED] www.dsvr.co.uk Development Designer Servers Ltd ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] System overload / context switching / oom, 8.3
Andy Colson wrote: > work_mem = 32MB > maintenance_work_mem = 64MB if you have lots and lots of connections, you might need to cut these down? definitely, work_mem is the main focus. If I understand correctly, th 64MB maintenance_work_mem is per vacuum task, and on this system there are 3 autovacuums. I was wondering if with this many databases, possibly decreasing the maintenance_work_mem significantly and starting up more autovacuums. Yes, also moving databases to other servers in order to decrease the number of connections. > effective_cache_size = 5000MB I see your running a 32bit, but with bigmem support, but still, one process is limited to 4gig. You'd make better use of all that ram if you switched to 64bit. And this cache, I think, would be limited to 4gig. All of the cache is being used because the operating system kernel is built with the memory extensions to access outside the 32bit range. This is the cache size reported by free(1). However, there may be advantages to switch to 64bit. The oom-killer is kicking in, at some point, so somebody is using too much ram. There should be messages or logs or something, right? (I've never enabled the oom stuff so dont know much about it). But the log messages might be helpful. Also, do you know what the oom max memory usage is set to? You said: "oom_adj -17. vm_overcommit_memory set to 2, but at this time vm_overcommit_ratio was still at 50 (has since been changed to 90, should this be 100?)" Oh man. I encourage everyone to find out what /proc//oom_adj means. You have to set this to keep the Linux "oom-killer" from doing a kill -9 on postgres postmaster. On Debian: echo -17 >> /proc/$(cat /var/run/postgresql/8.3-main.pid)/oom_adj This is my experience with oom-killer. After putting -17 into /proc/pid/oom_adj, oom-killer seemed to kill one of the database connection processes. Then the postmaster attempted to shut down all processes because of possible shared memory corruption. The database then went into recovery mode. After stopping the database some of the processes were stuck and could not be killed. The operating system was rebooted and the database returned with no data loss. My earlier experience with oom-killer: If you don't have this setting in oom_adj, then it seems likely (certain?) that oom-killer kills the postmaster because of the algorithm oom-killer uses (called badness()) which adds children process scores to their parent's scores. I don't know if sshd was killed but I don't think anyone could log in to the OS. After rebooting there was a segmentation violation when trying to start the postmaster. I don't think that running pg_resetxlog with defaults is a good idea. My colleague who has been investigating the crash believes that we could have probably eliminated at least some of the data loss with more judicious use of pg_resetxlog. There was a discussion on the postgres lists about somehow having the postgres distribution include the functionality to set oom_adj on startup. To my knowledge, that's not in 8.3 so I wrote a script and init.d script to do this on Debian systems. As far as vm.over_commit memory goes, there are three settings and most recommend setting it to 2 for postgres. However, this does not turn off oom-killer! You need to put -17 in /proc//oom_adj whether you do anything about vm.over_commit memory or not We had vm_overcommit_memory set to 2 and oom-killer became active and killed the postmaster. Kind of off-topic, but a Linux kernel parameter that's often not set on database servers is elevator=deadline which sets up the io scheduling algorithm. The algorithm can be viewed/set at runtime for example the disk /dev/sdc in /sys/block/sdc/queue/scheduler. Rob -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] How to force Nested Loop plan?
I'm trying to understand how I can get the planner to always do the right thing with this query: EXPLAIN ANALYZE SELECT aa_t.min_date_time FROM aa_t , bb_t , cc_t WHERE bb_t.bb_id = aa_t.bb_id AND aa_t.realm_id = cc_t.realm_id AND aa_t.server_id = 21 ORDER BY aa_t.min_date_time desc LIMIT 1 OFFSET 674 ; There's a extreme elbow in the performance curve around the sum of LIMIT and OFFSET. The two plans follow. First for the query above: Limit (cost=21569.56..21601.56 rows=1 width=84) (actual time=59.60..59.69 rows=1 loops=1) -> Nested Loop (cost=0.00..110535.66 rows=3454 width=84) (actual time=0.19..59.20 rows=676 loops=1) -> Nested Loop (cost=0.00..93177.46 rows=3454 width=65) (actual time=0.14..44.41 rows=676 loops=1) -> Index Scan Backward using aa_t20 on aa_t (cost=0.00..76738.77 rows=3454 width=46) (actual time=0.10..31.30 rows=676 loops=1) Filter: (server_id = 21::numeric) -> Index Scan using cc_t1 on cc_t (cost=0.00..4.75 rows=1 width=19) (actual time=0.01..0.01 rows=1 loops=676) Index Cond: ("outer".realm_id = cc_t.realm_id) -> Index Scan using bb_t1 on bb_t (cost=0.00..5.01 rows=1 width=19) (actual time=0.02..0.02 rows=1 loops=676) Index Cond: (bb_t.bb_id = "outer".bb_id) Total runtime: 59.89 msec (10 rows) Setting OFFSET to 675 in the above query, results in this 100 times slower plan: Limit (cost=21614.48..21614.48 rows=1 width=84) (actual time=4762.39..4762.39 rows=1 loops=1) -> Sort (cost=21612.79..21621.42 rows=3454 width=84) (actual time=4761.45..4761.92 rows=677 loops=1) Sort Key: aa_t.min_date_time -> Merge Join (cost=21139.96..21409.80 rows=3454 width=84) (actual time=4399.80..4685.24 rows=41879 loops=1) Merge Cond: ("outer".bb_id = "inner".bb_id) -> Sort (cost=8079.83..8184.53 rows=41879 width=19) (actual time=936.99..967.37 rows=41879 loops=1) Sort Key: bb_t.bb_id -> Seq Scan on bb_t (cost=0.00..4864.79 rows=41879 width=19) (actual time=0.06..729.60 rows=41879 loops=1) -> Sort (cost=13060.13..13068.76 rows=3454 width=65) (actual time=3462.76..3493.97 rows=41879 loops=1) Sort Key: aa_t.bb_id -> Merge Join (cost=12794.42..12857.14 rows=3454 width=65) (actual time=2923.62..3202.78 rows=41879 loops=1) Merge Cond: ("outer".realm_id = "inner".realm_id) -> Sort (cost=12762.78..12771.41 rows=3454 width=46) (actual time=2920.78..2950.87 rows=41879 loops=1) Sort Key: aa_t.realm_id -> Index Scan using aa_t5 on aa_t (cost=0.00..12559.79 rows=3454 width=46) (actual time=0.18..2589.22 rows=41879 loops=1) Index Cond: (server_id = 21::numeric) -> Sort (cost=31.64..32.78 rows=455 width=19) (actual time=2.54..33.12 rows=42163 loops=1) Sort Key: cc_t.realm_id -> Seq Scan on cc_t (cost=0.00..11.55 rows=455 width=19) (actual time=0.04..0.86 rows=455 loops=1) Total runtime: 4792.84 msec (20 rows) Twiddling effective_cache_size and random_page_cost allows for a large LIMIT+OFFSET number but not enough. These tests are made with 40 effective_cache_size and random_page_cost of 4. I can increase the LIMIT+OFFSET elbow to 1654 by changing the query thusly: < AND aa_t.server_id = 21 --- > AND aa_t.server_id IN (21, 0) The value 0 is an invalid server_id, so I know it won't be returned. However, I've got 41K rows that could be returned by this query and growing, and 1654 is obviously not enough. (aa is 690K rows, bb is 41K rows, and cc is 500 rows.) If I drop the ORDER BY, the query goes much faster, but the query is useless without the ORDER BY. I've figured out that the second plan is slow, because it is writing a huge result set to disk (+200MB). This doesn't make sense to me, since sort_mem is 32000. Is there a way to tell the optimizer to use Nested Loop plan always instead of the Merge/Join plan? Turning off enable_mergejoin is obviously not an option. Thanks, Rob ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [PERFORM] How to force Nested Loop plan?
Tom Lane writes: > The reason the planner does not much like this plan is that it's > estimating that quite a lot of rows will have to be hit in min_date_time > order before it finds enough rows with server_id = 21. Thus the high > cost estimate for the above step. Thanks for the speedy and useful reply! More questions follow. :) Very interesting. How does it know "quite a lot"? Is there something I can do to get the planner to analyze the data better? > I suspect that the reason you like this plan is that there's actually > substantial correlation between server_id and min_date_time, such that > the required rows are found quickly. Before trying to force the planner > into what you consider an optimal plan, you had better ask yourself > whether you can expect that correlation to hold up in the future. > If not, your plan could become pessimal pretty quickly. The correlation holds. min_date_time increases over time as records are inserted. server_id is uniformly distributed over time. There's no randomness. There is at least one 21 record for every value of min_date_time. 21 is a special server_id containing aggregate (denormalized) data for the other servers. I thought about putting it in a separate table, but this would complicate the code as the data is identical to the non-aggregated case. Do you have any suggestions for organizing the data/query now that you know this? > I'd suggest creating a double-column index: Thanks. I'll try this. I'm a very big fan of declarative programming. However, there's a danger in declarative programming when the interperter isn't smart enough. When I add this index, I will slow down inserts (about 20K/day) and increase data size (this is the second largest table in the database). Moreover, if the planner is improved, I've should fix my code, delete the index, etc. Is there a way of giving the planner direct hints as in Oracle? They can be ignored when the optimizer is improved, just as "register" is ignored by C compilers nowadays. Adding the extra index and ORDER BY is also not easy in our case. The query is dynamically generated. I can force the query ORDER BY to be whatever I want, but I would lose the ability to do interesting things, like the automatic generation of ORDER BY when someone clicks on a column header in the application. Indeed there are times when people want to sort on other columns in the query. I reduced the problem to the salient details for my post to this board. What if the ORDER BY was: ORDER BY aa_t.server_id DESC, cc_t.name ASC Would the planner do the right thing? > PS: does server_id really need to be NUMERIC? Why not integer, or at > worst bigint? It is a NUMERIC(18). It could be a bigint. What would be the change in performance of this query if we changed it to bigint? BTW, my customer is probably going to be switching to Oracle. This particular query has been one of the reasons. Maybe this change will help us stay with Postgres. Thanks, Rob ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] How to force Nested Loop plan?
attention to it before this. When we first set up Oracle, we got into all of its parameters pretty heavily. With Postgres, we just tried it and it worked. This is the first query where we ran out of ideas to try. BTW, everybody's help on this list is fantastic. Usually, I can find the answer to my question (and have been doing so for 3 years) on this list without asking. Thanks, Rob ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] How to force Nested Loop plan?
Ron Johnson writes: > Dumb question: given your out-of-the-box satisfaction, could it be > that postgresql.conf hasn't been tweaked? Here are the modified values: shared_buffers = 8000 wal_buffers = 80 sort_mem = 32000 effective_cache_size = 40 random_page_cost = 4 autocommit = false timezone = UTC I had run a test with effective_cache_size to high value to see what would happen. Also adjusted random_page_cost: random_page_cost effective_cache_size elbow 4 4 675 .5 4 592 .1 4 392 4 100030 My conclusion is that random_page_cost should be left alone and effective_cache_size higher is better. BTW, the hardware is 2 x 2.4ghz Xeon, 1.2GB, SCSI (linux software raid) with 10K disks. This is close to the production box. Although we are planning on adding more memory to production. Rob ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [PERFORM] How to force Nested Loop plan?
Tom Lane writes: > Keep in mind though that you seem to be experimenting with a > fully-cached database; you may find that the planner's beliefs more > nearly approach reality when actual I/O has to occur. My hope is that the entire database should fit in memory. This may not be in the case right now with only 1GB, but it should be close. The pgsql/data/base/NNN directory is about 1.5GB on production. I'm pretty sure with constant vacuuming, we could keep that size down. A pgdump is about 60MB now, growing at about .5MB a day. > Another thing I'd be interested to know about is how closely the > physical order of the table entries correlates with min_date_time. Probably "pretty close". The primary key of aa_t is (bb_id, server_id), and bb_id is a sequence. aa_t is updated heavily on production, but these tests are on a fresh import so vacuuming and index order is not a factor. We do a reload every now and then to improve performance on production. min_date_time is highly correlated with bb_id, because both are increasing constantly. server_id is one of 16 values. > A high correlation reduces the actual cost of the indexscan (since > visiting the rows in index order becomes less of a random-access > proposition). We are aware that the planner doesn't model this effect > very well at present ... Oracle's optimizer is lacking here, too. The best optimizer I've seen was at Tandem, and even then hints were required. Are there plans for explicit hints to the planner? Thanks, Rob ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
[PERFORM] reindex/vacuum locking/performance?
I've read some posts that says vacuum doesn't lock, but my experience today indicates the opposite. It seemed that "vacuum full analyze" was locked waiting and so were other postmaster processes. It appeared to be deadlock, because all were in "WAITING" state according to ps. I let this go for about a 1/2 hour, and then killed the vacuum at which point all other processes completed normally. The same thing seemed to be happening with reindex on a table. It seems that the reindex locks the table and some other resource which then causes deadlock with other active processes. Another issue seems to be performance. A reindex on some indexes is taking 12 minutes or so. Vacuum seems to be slow, too. Way longer than the time it takes to reimport the entire database (30 mins). In summary, I suspect that it is better from a UI perspective to bring down the app on Sat at 3 a.m and reimport with a fixed time period than to live through reindexing/vacuuming which may deadlock. Am I missing something? Thanks, Rob ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [PERFORM] reindex/vacuum locking/performance?
> vacuum full does require exclusive lock, plain vacuum does not. I think I need full, because there are updates on the table. As I understand it, an update in pg is an insert/delete, so it needs to be garbage collected. > It's considerably more likely that the vacuum was waiting for an open > client transaction (that had a read or write lock on some table) to > finish than that there was an undetected deadlock. I suggest looking at > your client code. Also, in 7.3 or later you could look at the pg_locks > view to work out exactly who has the lock that's blocking vacuum. My client code does a lot. I look at more often than I'd like to. :-) I don't understand why the client transaction would block if vacuum was waiting. Does vacuum lock the table and then try to get some other "open transaction" resource? Free space? I guess I don't understand what other resources would be required of vacuum. The client transactions are short (< 1s). They don't deadlock normally, only with reindex and vacuum did I see this behavior. > vacuum full is indeed slow. That's why we do not recommend it as a > routine maintenance procedure. The better approach is to do plain > vacuums often enough that you don't need vacuum full. The description of vacuum full implies that is required if the db is updated frequently. This db gets about 1 txn a second, possibly more at peak load. > In pre-7.4 > releases you might need periodic reindexes too, depending on whether > your usage patterns tickle the index-bloat problem. 7.3, and yes, we have date indexes as well as sequences for primary keys. > But it is easily > demonstrable that reindexing is cheaper than rebuilding the database. IOW, vacuum+reindex is faster than dump+restore? I didn't see this, then again, I had this locking problem, so the stats are distorted. One other question: The reindex seems to lock the table for the entire process as opposed to freeing the lock between index rebuilds. It was hard to see, but it seemed like the clients were locked for the entire "reindex table bla" command. Sorry for lack of detail, but I didn't expect these issues so I wasn't keeping track of the system state as closely as I should have. Next time. :-) Thanks, Rob ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [PERFORM] Speeding up Aggregates
Greg Stark writes: > Call it a wishlist bug. The problem is it would be a hard feature to > implement properly. And none of the people paid to work on postgres > by various companies seem to have this on their to-do lists. So > don't expect it in the near future. We are using Postgres heavily, and we should be able to find some time and/or funding to help. We're becoming more and more frustrated with the discontinuous behaviour of the planner. It seems every complex query we have these days needs some "hint" like "ORDER BY foo DESC LIMIT 1" to make it run on the order of seconds, not minutes. We usually figure out a way to write the query so the planner does the right thing, and pushes the discontinuity out far enough that the user doesn't see it. However, it takes a lot of work, and it seems to me that work would be put to better use improving the planner than improving our knowledge of how to get the planner to do the right thing by coding the SQL in some unusual way. Please allow me to go out on a limb here. I know that Tom is philosophically opposed to planner hints. However, we do have a problem that the planner is never going to be smart enough. This leaves the SQL coder the only option of collecting a bag of (non-portable) SQL tricks. I saw what the best SQL coders did at Tandem, and frankly, it's scary. Being at Tandem, the SQL coders also had the option (if they could argue their case strong enough) of adding a new rule to the optimizer. This doesn't solve the problem for outsiders no matter how good they are at SQL. Would it be possible to extend the planner with a pattern matching language? It would formalize what it is doing already, and would allow outsiders to teach the planner about idiosyncrasies without changing the SQL. It would be like a style sheet in Latex (or Scribe :-) if you are familiar with these typesetting languages. Comments? Rob ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
[PERFORM] vacuum locking
It seems a simple "vacuum" (not full or analyze) slows down the database dramatically. I am running vacuum every 15 minutes, but it takes about 5 minutes to run even after a fresh import. Even with vacuuming every 15 minutes, I'm not sure vacuuming is working properly. There are a lot of updates. The slowest relation is the primary key index, which is composed of a sequence. I've appended a csv with the parsed output from vacuum. The page counts are growing way too fast imo. I believe this is caused by the updates, and index pages not getting re-used. The index values aren't changing, but other values in the table are. Any suggestions how to make vacuuming more effective and reducing the time it takes to vacuum? I'd settle for less frequent vacuuming or perhaps index rebuilding. The database can be re-imported in about an hour. Rob Spacing every 15 minutes Pages,Tuples,Deleted 7974,1029258,1536 7979,1025951,4336 7979,1026129,52 7979,1025618,686 7979,1025520,152 7980,1025583,28 7995,1028008,6 8004,1030016,14 8010,1026149,4965 8012,1026684,6 8014,1025910,960 8020,1026812,114 8027,1027642,50 8031,1027913,362 8040,1028368,784 8046,1028454,1143 8049,1029155,6 8053,1029980,10 8065,1031506,24 8084,1029134,4804 8098,1031004,346 8103,1029412,3044 8118,1029736,1872 8141,1031643,1704 8150,1032597,286 8152,1033222,6 8159,1029436,4845 8165,1029987,712 8170,1030229,268 8176,1029568,1632 8189,1030136,1540 8218,1030915,3963 8255,1033049,4598 8297,1036583,3866 8308,1031412,8640 8315,1031987,1058 8325,1033892,6 8334,1030589,4625 8350,1031709,1040 8400,1033071,5946 8426,1031555,8368 8434,1031638,2240 8436,1031703,872 8442,1031891,612 ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PERFORM] vacuum locking
Shridhar Daithankar writes: > You should try 7.4 beta and pg_autovacuum which is a contrib module > in CVS tip. It's on our todo list. :) How does pg_autovacuum differ from vacuumdb? I mean it seems to call the vacuum operation underneath just as vacuumdb does. I obviously didn't follow the logic as to how it gets there. :-) > Make sure that you have FSM properly tuned. Bump it from defaults to > suit your needs. I hope you have gone thr. this page for general > purpose setting. I didn't start vacuuming regularly until recently, so I didn't see this problem. > Assuming those were incremental figures, largest you have is ~8000 > tuples per 15 minutes and 26 pages. I think with proper FSM/shared > buffers/effective cache and a pg_autovacuum with 1 min. polling > interval, you could end up in lot better shape. Here are the numbers that are different. I'm using 7.3: shared_buffers = 8000 sort_mem = 8000 vacuum_mem = 64000 effective_cache_size = 4 free says: total used free sharedbuffers cached Mem: 10306761005500 25176 0 85020 382280 -/+ buffers/cache: 538200 492476 Swap: 2096472 2728201823652 It seems effective_cache_size is about right. vacuum_mem might be slowing down the system? But if I reduce it, won't vacuuming get slower? max_fsm_relations is probably too low (the default in my conf file says 100, probably needs to be 1000). Not sure how this affects disk usage. Here's the summary for the two active tables during a vacuum interval with high activity. The other tables don't get much activity, and are much smaller. As you see the 261 + 65 adds up to the bulk of the 5 minutes it takes to vacuum. INFO: Removed 8368 tuples in 427 pages. CPU 0.06s/0.04u sec elapsed 1.54 sec. INFO: Pages 24675: Changed 195, Empty 0; Tup 1031519: Vac 8368, Keep 254, UnUsed 1739. Total CPU 2.92s/2.58u sec elapsed 65.35 sec. INFO: Removed 232 tuples in 108 pages. CPU 0.01s/0.02u sec elapsed 0.27 sec. INFO: Pages 74836: Changed 157, Empty 0; Tup 4716475: Vac 232, Keep 11, UnUsed 641. Total CPU 10.19s/6.03u sec elapsed 261.44 sec. How would vacuuming every minute finish in time? It isn't changing much in the second table, but it's taking 261 seconds to wade through 5m rows. Assuming I vacuum every 15 minutes, it would seem like max_fsm_pages should be 1000, because that's about what was reclaimed. The default is 1. Do I need to change this? Sorry to be so dense, but I just don't know the right values are. Thanks muchly for the advice, Rob ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [PERFORM] vacuum locking
Manfred Koizar writes: > ISTM you are VACCUMing too aggressively. You are reclaiming less than > 1% and 0.005%, respectively, of tuples. I would increase FSM settings > to ca. 1000 fsm_relations, 10 fsm_pages and VACUUM *less* often, > say every two hours or so. I did this. We'll see how it goes. > ... or configure autovacuum to VACUUM a table when it has 10% dead > tuples. This solution doesn't really fix the fact that VACUUM consumes the disk while it is running. I want to avoid the erratic performance on my web server when VACUUM is running. mfg, Rob ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] vacuum locking
Josh Berkus writes: > Yes, but it will have less of an impact on the system while it's running. We'll find out. I lowered it to vacuum_mem to 32000. > What sort of disk array do you have? That seems like a lot of time > considering how little work VACUUM is doing. Vendor: DELL Model: PERCRAID Mirror Rev: V1.0 Type: Direct-AccessANSI SCSI revision: 02 Two 10K disks attached. Rob ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [PERFORM] vacuum locking
Vivek Khera writes: > AMI or Adaptec based? Adaptec, I think. AIC-7899 LVD SCSI is what dmidecode says, and Red Hat/Adaptec aacraid driver, Aug 18 2003 is what comes up when it boots. I haven't be able to use the aac utilities with this driver, however, so it's hard to interrogate the device. > If AMI, make sure it has write-back cache enabled (and you have > battery backup!), and disable the 'readahead' feature if you can. I can't do this so easily. It's at a colo, and it's production. I doubt this has anything to do with this problem, anyway. We're talking about hundreds of megabytes of data. > What's the disk utilization proir to running vacuum? If it is > hovering around 95% or more of capacity, of course you're gonna > overwhelm it. Here's the vmstat 5 at a random time: procs memoryswap io system cpu r b w swpd free buff cache si sobibo incs us sy id 0 0 0 272372 38416 78220 375048 0 3 2 00 0 2 2 0 0 0 0 272372 3 78320 375660 0 034 274 382 284 5 1 94 0 1 0 272372 23012 78372 375924 0 025 558 445 488 8 2 90 1 0 0 272368 22744 78472 376192 0 6 125 594 364 664 9 3 88 And here's it during vacuum: procs memoryswap io system cpu r b w swpd free buff cache si sobibo incs us sy id 1 2 1 277292 9620 72028 409664 46 32 4934 4812 1697 966 8 4 88 0 3 0 277272 9588 72096 412964 61 0 7303 2478 1391 976 3 3 94 2 2 0 277336 9644 72136 393264 1326 32 2827 2954 1693 1519 8 3 89 The pages are growing proportionately with the number of tuples, btw. Here's a vacuum snippet from a few days ago after a clean import, running every 15 minutes: INFO: Removed 2192 tuples in 275 pages. CPU 0.06s/0.01u sec elapsed 0.91 sec. INFO: Pages 24458: Changed 260, Empty 0; Tup 1029223: Vac 2192, Keep 3876, UnUsed 26. Total CPU 2.91s/2.22u sec elapsed 65.74 sec. And here's the latest today, running every 2 hours: INFO: Removed 28740 tuples in 1548 pages. CPU 0.08s/0.06u sec elapsed 3.73 sec. INFO: Pages 27277: Changed 367, Empty 0; Tup 1114178: Vac 28740, Keep 1502, UnUsed 10631. Total CPU 4.78s/4.09u sec elapsed 258.10 sec. The big tables/indexes are taking longer, but it's a big CPU/elapsed time savings to vacuum every two hours vs every 15 minutes. There's still the problem that when vacuum is running interactive performance drops dramatically. A query that takes a couple of seconds to run when the db isn't being vacuumed will take minutes when vacuum is running. It's tough for me to correlate exactly, but I suspect that while postgres is vacuuming an index or table, nothing else runs. In between relations, other stuff gets to run, and then vacuum hogs all the resources again. This could be for disk reasons or simply because postgres locks the index or table while it is being vacuumed. Either way, the behavior is unacceptable. Users shouldn't have to wait minutes while the database picks up after itself. The concept of vacuuming seems to be problematic. I'm not sure why the database simply can't garbage collect incrementally. AGC is very tricky, especially AGC that involves gigabytes of data on disk. Incremental garbage collection seems to be what other databases do, and it's been my experience that other databases don't have the type of unpredictable behavior I'm seeing with Postgres. I'd rather the database be a little bit slower on average than have to figure out the best time to inconvenience my users. Since my customer already has Oracle, we'll be running tests in the coming month(s :-) with Oracle to see how it performs under the same load and hardware. I'll keep this group posted. Rob ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PERFORM] vacuum locking
Tom Lane writes: > ... if all tuples are the same size, and if you never have any Incorrect. If the tuples smaller, Oracle does the right thing. If there's enough space in the page, it shifts the tuples to make room. That's what pctfree, pctused and pctincrease allow you to control. It's all in memory so its fast, and I don't think it has to update any indices. > transactions that touch enough tuples to overflow your undo segment That's easily configured, and hasn't been a problem in the databases I've managed. > (or even just sit there for a long time, preventing you from recycling That's probably bad software or a batch system--which is tuned differently. Any OLTP system has to be able to partition its problems to keep transactions short and small. If it doesn't, it will not be usable. > undo-log space; this is the dual of the VACUUM-can't-reclaim-dead-tuple > problem). And a few other problems that any Oracle DBA can tell you > about. I prefer our system. Oracle seems to make the assumption that data changes, which is why it manages free space within each page as well as within free lists. The database will be bigger but you get much better performance on DML. It is very good at caching so reads are fast. Postgres seems to make the assumption that updates and deletes are rare. A delete/insert policy for updates means that a highly indexed table requires lots of disk I/O when the update happens and the concomitant garbage collection when vacuum runs. But then MVCC makes the assumption that there's lots of DML. I don't understand the philosphical split here. I guess I don't understand what application profiles/statistics makes you prefer Postgres' approach over Oracle's. > The increased I/O activity is certainly to be expected, but what I find > striking here is that you've got substantial swap activity in the second > trace. What is causing that? Not VACUUM I don't think. It doesn't have > any huge memory demand. But swapping out processes could account for > the perceived slowdown in interactive response. The box is a bit memory starved, and we'll be addressing that shortly. I don't think it accounts for 3 minute queries, but perhaps it might. vacuum_mem is 32mb, btw. Rob ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
[PERFORM] Use of multipart index with "IN"
I have a reporting system that does regular queries on a table with a multipart index. I am running version 7.3.4. Here is the table definition: Table "public.ds_rec_fld" Column | Type | Modifiers ---+-+--- dsid | character varying(20) | not null recid | integer | not null field_name| character varying(20) | not null option_tag| character varying(10) | not null option_value | integer | field_text| character varying(2000) | field_type_cd | character varying(8)| Indexes: ds_rf_ndx1 btree (recid, field_name, option_value) Normally queries are done using recid and field_name, so Postgresql returns rows very quickly as expected. Here is a sample explain analyze output for a typical query: db=> explain analyze db-> select field_name, option_tag from ds_rec_fld where recid = 3000 and field_name = 'Q3A1'; QUERY PLAN - Index Scan using ds_rf_ndx1 on ds_rec_fld (cost=0.00..163.09 rows=40 width=38) (actual time=0.06..0.07 rows=1 loops=1) Index Cond: ((recid = 3001) AND (field_name = 'Q3A1'::character varying)) Total runtime: 0.12 msec (3 rows) The problem comes in when we are selecting multiple field_name values in one query. The normal SQL syntax we have been using is like this: select field_name, option_tag from ds_rec_fld where recid = 3001 and field_name in ('Q3A1', 'Q3A9'); This is just a simplified example, at times there can be a lot of field_name values in one query in the "in" clause. Here postgresql refuses to use the full index, instead doing a filter based on part of the first recid part of index. Here is the explain analyze output: Index Scan using ds_rf_ndx1 on ds_rec_fld (cost=0.00..30425.51 rows=80 width=38) (actual time=0.18..1.08 rows=2 loops=1) Index Cond: (recid = 3001) Filter: ((field_name = 'Q3A1'::character varying) OR (field_name = 'Q3A9'::character varying)) Total runtime: 1.12 msec (4 rows) So, 10 times longer. This is an issue because at times we are iterating through thousands of recid values. I did a vacuum analyze, adjusted random_page_cost, etc. all to no avail. I also noticed that the problem goes away when I reformat the query like this: select field_name, option_tag from ds_rec_fld where (recid = 3001 and field_name = 'Q3A1') or (recid = 3001 and field_name = 'Q3A9') Here is the explain analyze output for this: Index Scan using ds_rf_ndx1, ds_rf_ndx1 on ds_rec_fld (cost=0.00..326.57 rows=80 width=38) (actual time=0.07..0.10 rows=2 loops=1) Index Cond: (((recid = 3001) AND (field_name = 'Q3A1'::character varying)) OR ((recid = 3001) AND (field_name = 'Q3A9'::character varying))) Total runtime: 0.16 msec (3 rows) Much better. So I have partially solved my own problem, but there are other places that this is not this simple to fix. Therefore, my question is, is there some way to force postgresql to use the full index and still stick with the shorter "field_name in ('...', '...')" syntax? If anyone has any thoughts please let me know. Also it strikes me that perhaps the optimizer could be tweaked to treat the first case like the second one. Thanks in advance, Rob __ Do you Yahoo!? The New Yahoo! Shopping - with improved product search http://shopping.yahoo.com ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] vacuum locking
Stephen writes: > I ran into the same problem with VACUUM on my Linux box. If you are running > Linux, take a look at "elvtune" or read this post: The default values were -r 64 -w 8192. The article said this was "optimal". I just futzed with different values anywere from -w 128 -r 128 to -r 16 -w 8192. None of these mattered much when vacuum is running. This is a RAID1 box with two disks. Even with vacuum and one other postmaster running, it's still got to get a lot of blocks through the I/O system. Rob ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [PERFORM] vacuum locking
Mario Weilguni writes: > of course both approaches have advantages, it simply depends on the usage > pattern. A case where oracle really rules over postgresql are m<-->n > connection tables where each record consist of two foreign keys, the > overwrite approach is a big win here. That's usually our case. My company almost always has "groupware" problems to solve. Every record has a "realm" (security) foreign key and typically another key. The infrastructure puts the security key on queries to avoid returning the wrong realm's data. Rob ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [PERFORM] vacuum locking
Vivek Khera writes: > Also, how close are you to the capacity of your disk bandwidth? I > don't see that in your numbers. I know in freebsd I can run "systat > -vmstat" and it gives me a percentage of utilization that lets me know > when I'm near the capacity. The vacuum totally consumes the system. It's in a constant "D". As near as I can tell, it's hitting all blocks in the database. The problem is interactive performance when vacuum is in a D state. Even with just two processes doing "stuff" (vacuum and a select, let's say), the select is very slow. My understanding of the problem is that if a query hits the disk hard (and many of my queries do) and vacuum is hitting the disk hard, they contend for the same resource and nobody wins. The query optimizer has lots of problems with my queries and ends up doing silly sorts. As a simple example, one query goes like this: select avg(f1) from t1 group by f2; This results in a plan like: Aggregate (cost=171672.95..180304.41 rows=115086 width=32) -> Group (cost=171672.95..177427.26 rows=1150862 width=32) -> Sort (cost=171672.95..174550.10 rows=1150862 width=32) Sort Key: f2 -> Seq Scan on t1 (cost=0.00..39773.62 rows=1150862 width=32) This is of course stupid, because it sorts a 1M rows, which probably means it has to hit disk (sort_mem can only be so large). Turns out there are only about 20 different values of f2, so it would be much better to aggregate without sorting. This is the type of query which runs while vacuum runs and I'm sure the two are just plain incompatible. vacuum is read intensive and this query is write intensive. Rob ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PERFORM] vacuum locking
Greg Stark writes: > Note that pctfree/pctused are a big performance drain on the usual case. Try > setting them to 0/100 on a table that doesn't get updates (like a many-many > relation table) and see how much faster it is to insert and scan. Right. You can optimize each table independently. The "usual" case doesn't exist in most databases, I've found, which is why Oracle does better. > Judging by the number of FAQ lists out there that explain various quirks of > rollback segment configuration I wouldn't say it's so easily configured. Maybe we just got lucky. :-) > The biggest problem is on systems where there's a combination of both users. As is ours. > You need tremendous rollback segments to deal with the huge volume of oltp > transactions that can occur during a single DSS query. And the DSS query > performance is terrible as it has to check the rollback segments for a large > portion of the blocks it reads. The DSS issues only come into play I think if the queries are long. This is our problem. Postgres does a bad job with DSS, I believe. I mentioned the select avg(f1) from t1 group by f2 in another message. If it were optimized for "standard" SQL, such as, avg, sum, etc., I think it would do a lot better with DSS-type problems. Our problem seems to be that the DSS queries almost always hit disk to sort. > Arguably it's the other way around. Postgres's approach wins whenever most of > the tuples in a table have been updated, in that case it just has to scan the > whole table ignoring old records not visible to the transaction. Oracle has to > consult the rollback segment for any recently updated tuple. Oracle's wins in > the case where most of the tuples haven't changed so it can just scan the > table without consulting lots of rollback segments. I see what you're saying. I'm not a db expert, just a programmer trying to make his queries go faster, so I'll acknowledge that the design is theoretically better. In practice, I'm still stuck. As a simple example, this query select avg(f1) from t1 group by f2 Takes 33 seconds (see explain analyze in another note in this thread) to run on idle hardware with about 1GB available in the cache. It's clearly hitting disk to do the sort. Being a dumb programmer, I changed the query to: select f1 from t1; And wrote the rest in Perl. It takes 4 seconds to run. Why? The Perl doesn't sort to disk, it aggregates in memory. There are 18 rows returned. What I didn't mention is that I originally had: select avg(f1), t2.name from t1, t2 where t2.f2 = t1.f2 group by t2.name; Which is much worse: Aggregate (cost=161046.30..162130.42 rows=8673 width=222) (actual time=72069.10..87455.69 rows=18 loops=1) -> Group (cost=161046.30..161479.95 rows=86729 width=222) (actual time=71066.38..78108.17 rows=963660 loops=1) -> Sort (cost=161046.30..161263.13 rows=86729 width=222) (actual time=71066.36..72445.74 rows=963660 loops=1) Sort Key: t2.name -> Merge Join (cost=148030.15..153932.66 rows=86729 width=222) (actual time=19850.52..27266.40 rows=963660 loops=1) Merge Cond: ("outer".f2 = "inner".f2) -> Sort (cost=148028.59..150437.74 rows=963660 width=58) (actual time=19850.18..21750.12 rows=963660 loops=1) Sort Key: t1.f2 -> Seq Scan on t1 (cost=0.00..32479.60 rows=963660 width=58) (actual time=0.06...39 rows=963660 loops=1) -> Sort (cost=1.56..1.60 rows=18 width=164) (actual time=0.30..737.59 rows=931007 loops=1) Sort Key: t2.f2 -> Seq Scan on t2 (cost=0.00..1.18 rows=18 width=164) (actual time=0.05..0.08 rows=18 loops=1) Total runtime: 87550.31 msec Again, there are about 18 values of f2. The optimizer even knows this (it's a foreign key to t2.f2), but instead it does the query plan in exactly the wrong order. It hits disk probably 3 times as much as the simpler query judging by the amount of time this query takes (33 vs 88 secs). BTW, adding an index to t1.f2 has seriously negative effects on many other DSS queries. I'm still not sure that the sort problem is our only problem when vacuum runs. It's tough to pin down. We'll be adding more memory to see if that helps with the disk contention. Rob ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [PERFORM] vacuum locking
Greg Stark writes: > Sorry I was unclear. By "usual case" I meant reading, as opposed to updates. > The size of the on-disk representation turns out to be a major determinant in > a lot of database applications, since the dominant resource is i/o bandwidth. > Try doing a fresh import of a large table with pctfree 0 pctuse 100 and > compare how long a select takes on it compared to the original table. BTW, I greatly appreciate your support on this stuff. This list is a fantastic resource. I think we agree. The question is what is the workload. On tables without updates, postgres will be fast enough. However, postgres is slow on tables with updates afaict. I think of OLTP as a system with updates. One can do DSS on an OLTP database with Oracle, at least it seems to work for one of our projects. > FIrstly, that type of query will be faster in 7.4 due to implementing a new > method for doing groups called hash aggregates. We'll be trying it as soon as it is out. > Secondly you could try raising sort_mem. Postgres can't know how much memory > it really has before it swaps, so there's a parameter to tell it. And swapping > would be much worse than doing disk sorts. It is at 8000. This is probably as high as I can go with multiple postmasters. The sort area is shared in Oracle (I think :-) in the UGA. > You can raise sort_mem to tell it how much memory it's allowed to > use before it goes to disk sorts. You can even use ALTER SESSION to > raise it in a few DSS sessions but leave it low the many OLTP > sessions. If it's high in OLTP sessions then you could quickly hit > swap when they all happen to decide to use the maximum amount at the > same time. But then you don't want to be doing big sorts in OLTP > sessions anyways. This is a web app. I can't control what the user wants to do. Sometimes they update data, and other times they simply look at it. I didn't find ALTER SESSION for postgres (isn't that Oracle?), so I set sort_mem in the conf file to 512000, restarted postrgres. Reran the simpler query (no name) 3 times, and it was still 27 secs. > Unfortunately there's no way to tell how much memory it thinks it's > going to use. I used to use a script to monitor the pgsql_tmp > directory in the database to watch for usage. I don't have to. The queries that run slow are hitting disk. Anything that takes a minute has to be writing to disk. > Well, first of all it doesn't really because you said to group by t2.name not > f1. You might expect it to at least optimize something like this: I put f2 in the group by, and it doesn't matter. That's the point. It's the on-disk sort before the aggregate that's killing the query. > but even then I don't think it actually is capable of using foreign keys as a > hint like that. I don't think Oracle does either actually, but I'm not sure. I'll be finding out this week. > To convince it to do the right thing you would have to do either: > > SELECT a, t2.name > FROM (SELECT avg(f1),f2 FROM t1 GROUP BY f2) AS t1 > JOIN t2 USING (f2) > > Or use a subquery: > > SELECT a, (SELECT name FROM t2 WHERE t2.f2 = t1.f2) > FROM t1 > GROUP BY f2 This doesn't solve the problem. It's the GROUP BY that is doing the wrong thing. It's grouping, then aggregating. Rob ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PERFORM] vacuum locking
Greg Stark writes: > I don't understand why you would expect overwriting to win here. > What types of updates do you do on these tables? These are statistics that we're adjusting. I think that's pretty normal stuff. The DSS component is the avg() of these numbers on particular groups. The groups are related to foreign keys to customers and other things. > Normally I found using update on such a table was too awkward to > contemplate so I just delete all the relation records that I'm > replacing for the key I'm working with and insert new ones. This > always works out to be cleaner code. In fact I usually leave such > tables with no UPDATE grants on them. In accounting apps, we do this, too. It's awkward with all the relationships to update all the records in the right order. But Oracle wins on delete/insert, too, because it reuses the tuples it already has in memory, and it can reuse the same foreign key index pages, too, since the values are usually the same. The difference between Oracle and postgres seems to be optimism. postgres assumes the transaction will fail and/or that a transaction will modify lots of data that is used by other queries going on in parallel. Oracle assumes that the transaction is going to be committed, and it might as well make the changes in place. > In that situation I would have actually expected Postgres to do as well as or > better than Oracle since that makes them both functionally > equivalent. I'll find out soon enough. :-) Rob ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] vacuum locking
Greg Stark writes: > > > SELECT a, (SELECT name FROM t2 WHERE t2.f2 = t1.f2) > > > FROM t1 > > > GROUP BY f2 > > > > This doesn't solve the problem. It's the GROUP BY that is doing the > > wrong thing. It's grouping, then aggregating. > > But at least in the form above it will consider using an index on f2, and it > will consider using indexes on t1 and t2 to do the join. There are 20 rows in t2, so an index actually slows down the join. I had to drop the index on t1.f2, because it was trying to use it instead of simply sorting 20 rows. I've got preliminary results for a number of "hard" queries between oracle and postgres (seconds): PG ORA 0 5 q1 1 0 q2 0 5 q3 2 1 q4 219 7 q5 217 5 q6 79 2 q7 31 1 q8 These are averages of 10 runs of each query. I didn't optimize pctfree, etc., but I did run analyze after the oracle import. One of the reason postgres is faster on the q1-4 is that postgres supports OFFSET/LIMIT, and oracle doesn't. q7 and q8 are the queries that I've referred to recently (avg of group by). q5 and q6 are too complex to discuss here, but the fundamental issue is the order in which postgres decides to do things. The choice for me is clear: the developer time trying to figure out how to make the planner do the "obviously right thing" has been too high with postgres. These tests demonstate to me that for even complex queries, oracle wins for our problem. It looks like we'll be migrating to oracle for this project from these preliminary results. It's not just the planner problems. The customer is more familiar with oracle, and the vacuum performance is another problem. Rob ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] vacuum locking
Josh Berkus writes: > I hope that you'll stay current with PostgreSQL developments so that you can > do a similarly thourough evaluation for your next project. Oh, no worries. This project just happens to be a tough one. We're heavily invested in Postgres. Other projects we maintain that use Postgres are zoescore.com, colosla.org, and paintedsnapshot.com. I am currently working on a very large project where the customer is very committed to Postgres/open source. We're in discussions about what to do about the scalability problems we saw in the other project. You can help by addressing a dilema we (my new customer and I) see. I apologize for the length of what follows, but I'm trying to be as clear as possible about our situation. I have had a lot push back from the core Postgres folks on the idea of planner hints, which would go a long way to solve the performance problems we are seeing. I presented an alternative approach: have a "style sheet" (Scribe, LaTex) type of solution in the postmaster, which can be customized by end users. That got no response so I assume it wasn't in line with the "Postgres way" (more below). The vacuum problem is very serious for the problematic database to the point that one of my customer's customers said: However, I am having a hard time understanding why the system is so slow... from my perspective it seems like you have some fundamental database issues that need to be addressed. This is simply unacceptable, and that's why we're moving to Oracle. It's very bad for my business reputation. I don't have a ready solution to vacuuming, and none on the list have been effective. We'll be adding more memory, but it seems to be disk bandwidth problem. I run Oracle on much slower system, and I've never noticed problems of this kind, even when a database-wide validation is running. When vacuum is running, it's going through the entire database, and that pretty much trashes all other queries, especially DSS queries. As always it is just software, and there's got to be 80/20 solution. Our new project is large, high-profile, but not as data intensive as the problematic one. We are willing to commit significant funding and effort to make Postgres faster. We are "business value" driven. That means we solve problems practically instead of theoretically. This seems to be in conflict with "the Postgres way", which seems to be more theoretical. Our business situation comes ahead of theories. My customer (who monitors this list) and I believe that our changes would not be accepted back into the Postgres main branch. That presents us with a difficult situation, because we don't want to own a separate branch. (Xemacs helped push emacs, and maybe that's what has to happen here, yet it's not a pretty situation.) We'll be meeting next week to discuss the situation, and how we'll go forward. We have budget in 2003 to spend on this, but only if the situation can be resolved. Otherwise, we'll have to respect the data we are seeing, and think about our choice of technologies. Thanks for the feedback. Rob ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] vacuum locking
scott.marlowe writes: > t2 was 'vacuum full'ed and analyzed, right? Just guessing. Fresh import. I've been told this includes a ANALYZE. Rob ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PERFORM] vacuum locking
Tom Lane writes: > Rob Nagler <[EMAIL PROTECTED]> writes: > > q5 and q6 are too complex to discuss here, > > How do you expect us to get better if you don't show us the problems? With all due respect and thanks for the massive amount of help, I have presented the problems. q5 and q6 are a subset of the following general problems: * Multiple ORDER BY results in no index used. Solution: drop multiple ORDER BY, only use first * Vacuum locks out interactive users Solution: don't run vacuum full and only run vacuum at night * Low cardinality index on large table confuses planner Solution: Drop (foreign key) index, which hurts other performance * Grouped aggregates result in disk sort Solution: Wait to 7.4 (may work), or write in Perl (works today) * Extreme non-linear performance (crossing magic number in optimizer drops performance three orders of magnitude) Solution: Don't cross magic number, or code in Perl The general problem is that our system generates 90% of the SQL we need. There are many advantages to this, such as being able to add OFFSET/LIMIT support with a few lines of code in a matter of hours. Every time we have to custom code a query, or worse, code it in Perl, we lose many benefits. I understand the need to optimize queries, but my general experience with Oracle is that I don't have to do this very often. When the 80/20 rule inverts, there's something fundamentally wrong with the model. That's where we feel we're at. It's cost us a tremendous amount of money to deal with these query optimizations. The solution is not to fix the queries, but to address the root causes. That's what my other note in this thread is about. I hope you understand the spirit of my suggestion, and work with us to finding an acceptable approach to the general problems. > BTW, have you tried any of this with a 7.4beta release? I will, but for my other projects, not this one. I'll run this data, because it's a great test case. We have a business decision to make: devote more time to Postgres or go with Oracle. I spent less than a day getting the data into Oracle and to create the benchmark. The payoff is clear, now. The risk of 7.4 is still very high, because the vacuum problem still looms and a simple "past performance is a good indicator of future performance". Going forward, there's no choice. We've had to limit end-user functionality to get Postgres working as well as it does, and that's way below where Oracle is without those same limits and without any effort put into tuning. Thanks again for all your support. Rob ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [PERFORM] Pg+Linux swap use
Not being one to hijack threads, but I haven't heard of this performance hit when using HT, I have what should all rights be a pretty fast server, dual 2.4 Xeons with HT 205gb raid 5 array, 1 gig of memory. And it is only 50% as fast as my old server which was a dual AMD MP 1400's with a 45gb raid 5 array and 1gb of ram. I have read everything I could find on Pg performance tweaked all the variables that were suggested and nothing. Which is why I subscribed to this list, just been lurking so far but this caught my eye. Rob -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Andrew Sullivan Sent: Friday, October 31, 2003 8:36 AM To: [EMAIL PROTECTED] Subject: Re: [PERFORM] Pg+Linux swap use On Fri, Oct 31, 2003 at 12:03:59PM -0200, alexandre :: aldeia digital wrote: > Scott, Jeff and Shridhar: > > 1 GB RAM :) > > The stock kernels are not the same, HyperThreading enabled. 80 Some people have reported that things actually slow down with HT enabled. Have you tried turning it off? A -- Andrew Sullivan 204-4141 Yonge Street Afilias CanadaToronto, Ontario Canada <[EMAIL PROTECTED]> M2P 2A8 +1 416 646 3304 x110 ---(end of broadcast)--- TIP 8: explain analyze is your friend ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Pg+Linux swap use
For the record I am running on SuSE with a pretty much stock kernel. Not to sound naïve, but is turning of HT something done in the bios? Rob -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Bill Moran Sent: Friday, October 31, 2003 9:56 AM To: [EMAIL PROTECTED] Subject: Re: [PERFORM] Pg+Linux swap use Just for an additional viewpoint. I'm finishing up a project based on FreeBSD and PostgreSQL. The target server is a Dual 2.4G Intel machine. I have tested the application with hyperthreading enabled and disabled. To all appearances, enabling hyperthreading makes the box act like a quad, with the expected increase in processing capability - _for_this_application_. I have also heard the claims and seen the tests that show hyperthreading occasionally decreasing performance. I think in the end, you just have to test your particular application to see how it reacts. Rob Sell wrote: > Not being one to hijack threads, but I haven't heard of this performance hit > when using HT, I have what should all rights be a pretty fast server, dual > 2.4 Xeons with HT 205gb raid 5 array, 1 gig of memory. And it is only 50% as > fast as my old server which was a dual AMD MP 1400's with a 45gb raid 5 > array and 1gb of ram. I have read everything I could find on Pg performance > tweaked all the variables that were suggested and nothing. Which is why I > subscribed to this list, just been lurking so far but this caught my eye. > > Rob > > -Original Message- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of Andrew Sullivan > Sent: Friday, October 31, 2003 8:36 AM > To: [EMAIL PROTECTED] > Subject: Re: [PERFORM] Pg+Linux swap use > > On Fri, Oct 31, 2003 at 12:03:59PM -0200, alexandre :: aldeia digital wrote: > >>Scott, Jeff and Shridhar: >> >>1 GB RAM :) >> >>The stock kernels are not the same, HyperThreading enabled. 80 > > > Some people have reported that things actually slow down with HT > enabled. Have you tried turning it off? > > A > -- Bill Moran Potential Technologies http://www.potentialtech.com ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [PERFORM] Where to start for performance problem?
The problems with giving suggestions about increasing performance is that one persons increase is another persons decrease. having said that, there are a few general suggestions : Set-up some shared memory, about a tenth of your available RAM, and configure shared_memory and max_clients correctly. I've used the following formula, ripped off the net from somewhere. It's not entirely acurate, as other settings steal a little shared memory, but it works for the most part : ((1024*RAM_SIZE) - (14.2 * max_connections) - 250) / 8.2 as I say, it should get you a good value, otherwise lower it bit by bit if you have trouble starting your db. Increase effective_cache (50%-70% avail ram) and sort_mem (about 1/20th ram) and lower you random_page_cost to around 2 or less (as low as 0.3) if you have fast SCSI drives in a RAID10 set-up - this was a big speedup ;) But this might not be the answer though. The values detailed above are when tuning an already stable setup. Perhaps you need to look at your system resource usage. If you're degrading performance over time it sounds to me like you are slowly running out of memory and swap ? Generall if I take something over, I'll try and get it onto my terms. Have you tried importing the DB to a fresh installation, one where you know sensible defaults are set, so you aren't inheriting any cruft from the previous sysadmin. To be honest tho, I've never run pg so that it actually shutdown because it was running so badly - i just wouldn't think it would do that. -- Rob Fielding [EMAIL PROTECTED] www.dsvr.co.uk Development Designer Servers Ltd ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] tuning questions
I've tweaked shared buffers to 8192, pushed sort memory to 2048, vacuum memory to 8192, and effective cache size to 1. /proc/sys/kernel/shmmax is set to 16 and /proc/sys/fs/file-max is set to 65536. Ulimit -n 3192. Your sharedmemory is too high, and not even being used effectivey. Your other settings are too low. Ball park guessing here, but I'd say first read (and understand) this: http://www.varlena.com/varlena/GeneralBits/Tidbits/perf.html Then make shared memory about 10-20% available ram, and set: ((shmmax/1024) - ( 14.2 * max_connections ) - 250 ) / 8.2 = shared_buffers decrease random_page_cost to 0.3 and wack up sort mem by 16 times, effective cache size to about 50% RAM (depending on your other settings) and try that for starters. -- Rob Fielding [EMAIL PROTECTED] www.dsvr.co.uk Development Designer Servers Ltd ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
[PERFORM] Pgbench results
Greetings all, I'm wondering is there a website where people can submit their pgbench results along with their hardware and configuration's? If so where are they at? I have yet to find any. I think this could be a very useful tool not only for people looking at setting up a new server but for people trying to tune their db... Thanks Rob ---(end of broadcast)--- TIP 8: explain analyze is your friend
[PERFORM] WAL Optimisation - configuration and usage
48468 $ vmstat 5 procs memoryswap io system cpu r b w swpd free buff cache si sobibo incs us sy id 0 7 1 29588 10592 15700 809060 1 097750 103 13 9 79 3 8 0 29588 11680 15736 807620 0 0 3313 438 1838 3559 19 13 68 2 13 1 29588 12808 15404 800328 0 0 4470 445 1515 1752 7 7 86 0 9 1 29588 10992 15728 806476 0 0 2933 781 1246 2686 14 10 76 2 5 1 29588 11336 15956 807884 0 0 3354 662 1773 5211 27 17 57 4 5 0 29696 13072 16020 813872 0 24 4282 306 2632 7862 45 25 31 4 6 1 29696 10400 16116 815084 0 0 5086 314 2668 7893 47 26 27 9 2 1 29696 13060 16308 814232 27 0 3927 748 2586 7836 48 29 23 3 8 1 29696 10444 16232 812816 3 0 4015 433 2443 7180 47 28 25 8 4 0 29696 10904 16432 812488 0 0 4537 500 2616 8418 46 30 24 4 6 2 29696 11048 16320 810276 0 0 6076 569 1893 3919 20 14 66 0 5 0 29696 10480 16600 813788 0 0 4595 435 2400 6215 33 21 46 3 6 0 29696 10536 16376 812248 0 0 3802 504 2417 7921 43 25 32 1 6 1 29696 11236 16500 809636 0 0 3691 357 2171 5199 24 15 61 0 14 1 29696 10228 16036 801368 0 0 4038 561 1566 3288 16 12 72 Sorry it's so long but I thought some brief info would be better than not. Thanks for reading, -- Rob Fielding Development Designer Servers Ltd ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [PERFORM] WAL Optimisation - configuration and usage
Rod Taylor wrote: random_page_cost = 0.5 Try a value of 2 for a while. OK thanks Richard and Rod. I've upped this to 2. I think I left this over from a previous play with setttings on my IDE RAID 0 workstation. It seemed to have a good effect being set as a low float so it stuck. I've set it to 2. From another post off list, I've also bumped up max_fsm_relations = 1000 # min 10, fsm max_fsm_pages = 2 # min 1000, fs vacuum_mem = 32768 # min 1024 as they did seem a little low. I'm hesitant to set them too high at this stage as I'd prefer to keep as much RAM available for runtime at this time. I'm still hoping that perhaps the uber-pgadmin Mr Lane might reply about my WAL issue :) however I'm getting the feeling now the server is running with a much higher level of performance than it has been. Won't know until tomorrow thought. Cheers, -- Rob Fielding Development Designer Servers Ltd ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] WAL Optimisation - configuration and usage
Further update to my WAL experimentation. pg_xlog files have increased to 81, and checking today up to 84. Currently nothing much going on with the server save a background process running a select every 30 seconds with almost no impact (according to IO from vmstats). This in itself is a good sign - an improvement on running last week, but I'd still like to get clarification on WAL file usage if possible. Log file tailing has nothing more interesting than a whole set of "recycled transaction log file" entries : 2004-03-01 16:01:55 DEBUG: recycled transaction log file 00710017 2004-03-01 16:07:01 DEBUG: recycled transaction log file 00710018 2004-03-01 16:17:14 DEBUG: recycled transaction log file 00710019 2004-03-01 16:22:20 DEBUG: recycled transaction log file 0071001A 2004-03-01 16:32:31 DEBUG: recycled transaction log file 0071001B 2004-03-01 16:37:36 DEBUG: recycled transaction log file 0071001C 2004-03-01 16:47:48 DEBUG: recycled transaction log file 0071001D 2004-03-01 16:52:54 DEBUG: recycled transaction log file 0071001E 2004-03-01 17:03:05 DEBUG: recycled transaction log file 0071001F Looks kinda automated, but the times aren't quite even at around 6-10 minutes apart. cheers, -- Rob Fielding [EMAIL PROTECTED] www.dsvr.co.uk Development Designer Servers Ltd ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Time to put theory to the test?
On Mon, Apr 25, 2011 at 12:30 PM, J Sisson wrote: > machines, and virtually every MySQL machine has required data cleanup > and table scans and tweaks to get it back to "production" status. Tip from someone that manages thousands of MySQL servers: Use InnoDB when using MySQL. Using a crash unsafe product will yield undesirable results when a server crashes. It is also faster for many use cases. InnoDB is crash safe. It is just that simple. -- Rob Wultsch wult...@gmail.com -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Hardware advice for scalable warehouse db
On Fri, Jul 15, 2011 at 11:49 AM, chris r. wrote: > Hi list, > > Thanks a lot for your very helpful feedback! > >> I've tested MD1000, MD1200, and MD1220 arrays before, and always gotten >> seriously good performance relative to the dollars spent > Great hint, but I'm afraid that's too expensive for us. But it's a great > way to scale over the years, I'll keep that in mind. > > I had a look at other server vendors who offer 4U servers with slots for > 16 disks for 4k in total (w/o disks), maybe that's an even > cheaper/better solution for us. If you had the choice between 16 x 2TB > SATA vs. a server with some SSDs for WAL/indexes and a SAN (with SATA > disk) for data, what would you choose performance-wise? > > Again, thanks so much for your help. > > Best, > Chris SATA drives can easily flip bits and postgres does not checksum data, so it will not automatically detect corruption for you. I would steer well clear of SATA unless you are going to be using a fs like ZFS which checksums data. I would hope that a SAN would detect this for you, but I have no idea. -- Rob Wultsch wult...@gmail.com -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] TCP Overhead on Local Loopback
On Sun, Apr 1, 2012 at 1:24 PM, Ofer Israeli wrote: > Hi all, > > We are running performance tests using PG 8.3 on a Windows 2008 R2 machine > connecting locally over TCP. 8.3 will be not supported in under a year. Time to start testing upgrades. http://www.postgresql.org/support/versioning/ -- Rob Wultsch wult...@gmail.com -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] effective_cache_size on 32-bits postgres
On Mon, Mar 18, 2013 at 10:53 AM, Rodrigo Barboza wrote: > Hi guys, I am worried about the effective_cache_size. > I run a 32-bits postgres installation on a machine with 64 bits kernel. > Should I limit effective_cache_size to a maximum of 2.5gb? That variables refers to fs cache, so 32 bit pg should not matter. Shared buffers and similar variables will be another matter. Why the heck are you running 32 bit pg on a 64 bit system? You are almost certainly doing it wrong. -- Rob Wultsch wult...@gmail.com -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM]
On Sun, Apr 21, 2013 at 5:46 AM, sunil virmani wrote: > My DB version is little old - 8.1.18. > Your db is exceptionally old and very much unsupported. Vacuum has massively improved since 8.1 . See http://www.postgresql.org/support/versioning/ regarding supported versions.
[PERFORM] Deleting Rows From Large Tables
Hi All, We've got 3 quite large tables that due to an unexpected surge in usage (!) have grown to about 10GB each, with 72, 32 and 31 million rows in. I've been tasked with cleaning out about half of them, the problem I've got is that even deleting the first 1,000,000 rows seems to take an unreasonable amount of time. Unfortunately this is on quite an old server (Dell 2950 with a RAID-10 over 6 disks) running Postgres 8.4; which serves other things like our logging systems. If I run a sustained (more than about 5 minutes) delete it'll have a detrimental effect on the other services. I'm trying to batch up the deletes into small chunks of approximately 1 month of data ; even this seems to take too long, I originally reduced this down to a single day's data and had the same problem. I can keep decreasing the size of the window I'm deleting but I feel I must be doing something either fundamentally wrong or over-complicating this enormously. I've switched over to retrieving a list of IDs to delete, storing them in temporary tables and deleting based on the primary keys on each of the tables with something similar to this: BEGIN TRANSACTION; CREATE TEMPORARY TABLE table_a_ids_to_delete (id INT); CREATE TEMPORARY TABLE table_b_ids_to_delete (id INT); INSERT INTO table_a_ids_to_delete SELECT id FROM table_a WHERE purchased ='-infinity' AND created_at < '2007-01-01T00:00:00'; INSERT INTO table_b_ids_to_delete SELECT table_b_id FROM table_a_table_b_xref INNER JOIN table_a_ids_to_delete ON (table_a_ids_to_delete.id = table_a_table_b.quote_id); DELETE FROM table_a_table_b_xref USING table_a_ids_to_delete WHERE table_a_table_b_xref.table_a_id = table_a_ids_to_delete.id; DELETE FROM table_b USING table_b_ids_to_delete WHERE table_b.id = table_b_ids_to_delete.id; DELETE FROM table_a USING table_a_ids_to_delete WHERE table_a.id = table_a_ids_to_delete.id; COMMIT; There're indices on table_a on the queried columns, table_b's primary key is it's id, and table_a_table_b_xref has an index on (table_a_id, table_b_id). There're FK defined on the xref table, hence why I'm deleting from it first. Does anyone have any ideas as to what I can do to make the deletes any faster? I'm running out of ideas! Thanks in advance, -- Rob Emery -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Performant queries on table with many boolean columns
Hey all, New to the lists so please let me know if this isn't the right place for this question. I am trying to understand how to structure a table to allow for optimal performance on retrieval. The data will not change frequently so you can basically think of it as static and only concerned about optimizing reads from basic SELECT...WHERE queries. The data: - ~20 million records - Each record has 1 id and ~100 boolean properties - Each boolean property has ~85% of the records as true The retrieval will always be something like "SELECT id FROM WHERE . will be some arbitrary set of the ~100 boolean columns and you want the ids that match all of the conditions (true for each boolean column). Example: WHERE prop1 AND prop18 AND prop24 The obvious thing seems to make a table with ~100 columns, with 1 column for each boolean property. Though, what type of indexing strategy would one use on that table? Doesn't make sense to do BTREE. Is there a better way to structure it? Any and all advice/tips/questions appreciated! Thanks, Rob
Re: [PERFORM] Performant queries on table with many boolean columns
Hey all, Lots of interesting suggestions! I'm loving it. Just came back to this a bit earlier today and made a sample table to see what non-index performance would be. Constructed data just like above (used 12M rows and 80% true for all 100 boolean columns) Here's an analyze for what I'd expect to be the types of queries that I'll be handling from the frontend. I would expect around 40-70 properties per query. Now I'm going to start experimenting with some ideas above and other tuning. This isn't as bad as I thought it would be, though would like to get this under 200ms. rimig=# explain analyze select count(*) from bloomtest where prop0 AND prop1 AND prop2 AND prop3 AND prop4 AND prop5 AND prop6 AND prop7 AND prop8 AND prop9 AND prop10 AND prop11 AND prop12 AND prop13 AND prop14 AND prop15 AND prop16 AND prop17 AND prop18 AND prop19 AND prop20 AND prop21 AND prop22 AND prop23 AND prop24 AND prop25 AND prop26 AND prop27 AND prop28 AND prop29 AND prop30 AND prop31 AND prop32 AND prop33 AND prop34 AND prop35 AND prop36 AND prop37 AND prop38 AND prop39 AND prop40 AND prop41 AND prop42 AND prop43 AND prop44 AND prop45 AND prop46 AND prop47 AND prop48 AND prop49 AND prop50 AND prop51 AND prop52 AND prop53 AND prop54 AND prop55 AND prop56 AND prop57 AND prop58 AND prop59 AND prop60 AND prop61 AND prop62 AND prop63 AND prop64; Aggregate (cost=351563.03..351563.04 rows=1 width=0) (actual time=2636.829..2636.829 rows=1 loops=1) -> Seq Scan on bloomtest (cost=0.00..351563.02 rows=3 width=0) (actual time=448.200..2636.811 rows=9 loops=1) Filter: (prop0 AND prop1 AND prop2 AND prop3 AND prop4 AND prop5 AND prop6 AND prop7 AND prop8 AND prop9 AND prop10 AND prop11 AND prop12 AND prop13 AND prop14 AND prop15 AND prop16 AND prop17 AND prop18 AND prop19 AND prop20 AND prop21 AND prop22 AND prop23 AND prop24 AND prop25 AND prop26 AND prop27 AND prop28 AND prop29 AND prop30 AND prop31 AND prop32 AND prop33 AND prop34 AND prop35 AND prop36 AND prop37 AND prop38 AND prop39 AND prop40 AND prop41 AND prop42 AND prop43 AND prop44 AND prop45 AND prop46 AND prop47 AND prop48 AND prop49 AND prop50 AND prop51 AND prop52 AND prop53 AND prop54 AND prop55 AND prop56 AND prop57 AND prop58 AND prop59 AND prop60 AND prop61 AND prop62 AND prop63 AND prop64) Rows Removed by Filter: 1191 Total runtime: 2636.874 ms On Thu, Apr 21, 2016 at 12:45 PM, Jeff Janes wrote: > On Wed, Apr 20, 2016 at 11:54 AM, Teodor Sigaev wrote: > >> > >> The obvious thing seems to make a table with ~100 columns, with 1 column > >> for each boolean property. Though, what type of indexing strategy would > >> one use on that table? Doesn't make sense to do BTREE. Is there a better > >> way to structure it? > >> > > looks like a deal for contrib/bloom index in upcoming 9.6 release > > Not without doing a custom compilation with an increased INDEX_MAX_KEYS: > > ERROR: cannot use more than 32 columns in an index > > But even so, I'm skeptical this would do better than a full scan. It > would be interesting to test that. > > Cheers, > > Jeff >
Re: [PERFORM] Performant queries on table with many boolean columns
.595..9359.566 rows=9 loops=1) Filter: (((bitstr)::"bit" & B'1'::"bit") = B'1'::"bit") Rows Removed by Filter: 1191 Total runtime: 9359.593 ms (4 rows) *Time: 9360.072 ms* On Thu, Apr 21, 2016 at 3:34 PM, Rob Imig wrote: > Hey all, > > Lots of interesting suggestions! I'm loving it. > > Just came back to this a bit earlier today and made a sample table to see > what non-index performance would be. Constructed data just like above (used > 12M rows and 80% true for all 100 boolean columns) > > Here's an analyze for what I'd expect to be the types of queries that I'll > be handling from the frontend. I would expect around 40-70 properties per > query. > > Now I'm going to start experimenting with some ideas above and other > tuning. This isn't as bad as I thought it would be, though would like to > get this under 200ms. > > rimig=# explain analyze select count(*) from bloomtest where prop0 AND > prop1 AND prop2 AND prop3 AND prop4 AND prop5 AND prop6 AND prop7 AND prop8 > AND prop9 AND prop10 AND prop11 AND prop12 AND prop13 AND prop14 AND prop15 > AND prop16 AND prop17 AND prop18 AND prop19 AND prop20 AND prop21 AND > prop22 AND prop23 AND prop24 AND prop25 AND prop26 AND prop27 AND prop28 > AND prop29 AND prop30 AND prop31 AND prop32 AND prop33 AND prop34 AND > prop35 AND prop36 AND prop37 AND prop38 AND prop39 AND prop40 AND prop41 > AND prop42 AND prop43 AND prop44 AND prop45 AND prop46 AND prop47 AND > prop48 AND prop49 AND prop50 AND prop51 AND prop52 AND prop53 AND prop54 > AND prop55 AND prop56 AND prop57 AND prop58 AND prop59 AND prop60 AND > prop61 AND prop62 AND prop63 AND prop64; > > Aggregate (cost=351563.03..351563.04 rows=1 width=0) (actual > time=2636.829..2636.829 rows=1 loops=1) > >-> Seq Scan on bloomtest (cost=0.00..351563.02 rows=3 width=0) > (actual time=448.200..2636.811 rows=9 loops=1) > > Filter: (prop0 AND prop1 AND prop2 AND prop3 AND prop4 AND prop5 > AND prop6 AND prop7 AND prop8 AND prop9 AND prop10 AND prop11 AND prop12 > AND prop13 AND prop14 AND prop15 AND prop16 AND prop17 AND prop18 AND > prop19 AND prop20 AND prop21 AND prop22 AND prop23 AND prop24 AND prop25 > AND prop26 AND prop27 AND prop28 AND prop29 AND prop30 AND prop31 AND > prop32 AND prop33 AND prop34 AND prop35 AND prop36 AND prop37 AND prop38 > AND prop39 AND prop40 AND prop41 AND prop42 AND prop43 AND prop44 AND > prop45 AND prop46 AND prop47 AND prop48 AND prop49 AND prop50 AND prop51 > AND prop52 AND prop53 AND prop54 AND prop55 AND prop56 AND prop57 AND > prop58 AND prop59 AND prop60 AND prop61 AND prop62 AND prop63 AND prop64) > > Rows Removed by Filter: 1191 > > Total runtime: 2636.874 ms > > On Thu, Apr 21, 2016 at 12:45 PM, Jeff Janes wrote: > >> On Wed, Apr 20, 2016 at 11:54 AM, Teodor Sigaev wrote: >> >> >> >> The obvious thing seems to make a table with ~100 columns, with 1 >> column >> >> for each boolean property. Though, what type of indexing strategy would >> >> one use on that table? Doesn't make sense to do BTREE. Is there a >> better >> >> way to structure it? >> >> >> > looks like a deal for contrib/bloom index in upcoming 9.6 release >> >> Not without doing a custom compilation with an increased INDEX_MAX_KEYS: >> >> ERROR: cannot use more than 32 columns in an index >> >> But even so, I'm skeptical this would do better than a full scan. It >> would be interesting to test that. >> >> Cheers, >> >> Jeff >> > >