Re: [PERFORM] PostgreSQL vs. MySQL
...and on Sat, Jul 05, 2003 at 12:24:18AM +0200, Bjoern Metzdorf used the keyboard: > >> Afaik, your original posting said postgresql was 3 times slower than > >> mysql and that you are going to leave this list now. This implied > >> that you have made your decision between postgresql and mysql, > >> taking mysql because it is faster. > > > > Well, that shows what you get for making implications. The client is > > sticking with postgres and we are coding around the issue in other > > ways. > > As many other guys here pointed out, there are numerous ways to tune > postgresql for maximum performance. If you are willing to share more > information about your particular project, we might be able to help you out > and optimize your application, without the need to code around the issue as > much as you may be doing right now. > Even if it is not possible for you to share enough information, there are a > lot of places where you can read about performance tuning (if not in the > docs then in the archives). > Also, I should think the clients would not be too offended if Brian posted some hint about the actual quantity of data involved here, both the total expected database size and some info about the estimated "working set" size, such as a sum of sizes of tables most commonly used in JOIN queries and the percentage of data being shuffled around in those. Are indexes big? Are there any multicolumn indexes in use? Lots of sorting expected? Lots of UPDATEs/INSERTs/DELETEs? Also, it would be helpful to know just how normalized the database is, to provide some advice about possible query optimization, which could again prove helpful in speeding the machinery up. Another useful piece of information would be the amount of memory consumed by other applications vs. the amount of memory reserved by the OS for cache, and the nature of those other applications running - are they big cache consumers, such as Apache with static content and a large load would be, or do they keep a low profile? I think this would, in combination with the information already posted, such as the amount of memory and I/O subsystem info, at least enable us to advise about the recommended shared_buffers, effective_cache_size, sort_mem, vacuum_mem, and others, without compromising the intellectual property of Brian's clients. > > over and out. I CC'd this post over to you, Brian, 'cause this signoff made me rather unsure as to whether or not you're still on the list. Hope you don't mind. Sincerely, -- Grega Bremec System Administration & Development Support grega.bremec-at-noviforum.si http://najdi.si/ http://www.noviforum.si/ ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [PERFORM] Strange result: UNIX vs. TCP/IP sockets
On Fri, Jul 04, 2003 at 05:47:27PM -0400, Tom Lane wrote: > Andrew Sullivan <[EMAIL PROTECTED]> writes: > > How could it be the transport affects the time for the query as > > reported by the back end? > > How much data is being sent back by the query? In this case, it's an all-aggregate query: select count(*), min(id) from sometable where owner = int4; (Yeah, yeah, I know. I didn't write it.) But it's the EXPLAIN ANALYSE that's reporting different times depending on the transport. That's what I find so strange. > Do you have SSL enabled? SSL encryption overhead is nontrivial, > especially if any renegotiations happen. No. -- Andrew Sullivan 204-4141 Yonge Street Liberty RMS Toronto, Ontario Canada <[EMAIL PROTECTED]> M2P 2A8 +1 416 646 3304 x110 ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Moving postgresql.conf tunables into 2003...
> Sean Chittenden <[EMAIL PROTECTED]> writes: > > Getting the planner to pick > > using the index to filter out data inserted in the last 3 days over > > doing a seq scan... well, I don't know how you could do that without > > changing the random_page_cost. > > This sounds a *whole* lot like a correlation issue. If the data in > question were scattered randomly in the table, it's likely that an > indexscan would be a loser. The recently-inserted data is probably > clustered near the end of the table (especially if they're doing > VACUUM FULL after data purges; are they?). But the planner's > correlation stats are much too crude to recognize that situation, if > the rest of the table is not well-ordered. Data isn't scattered randomly from what I can tell and is basically already clustered just because the data is inserted linearly and based off of time. I don't think they're doing a VACUUM FULL after a purge, but I'll double check on that on Monday when they get in. Is there an easy way of determining or setting a planner stat to suggest that data is ordered around a column in a permanent way? CLUSTER has always been a one shot deal and its effects wear off quickly depending on the way that data is inserted. It seems as though that this would be a circumstance in which preallocated disk space would be a win (that way data wouldn't always be appended to the heap and could be inserted in order, of most use for non-time related data: ex, some non-unique ID). > If their typical process involves a periodic data purge and then a > VACUUM FULL, it might be worth experimenting with doing a CLUSTER on > the timestamp index instead of the VACUUM FULL. The CLUSTER would > reclaim space as effectively as VACUUM FULL + REINDEX, and it would > leave the table with an unmistakable 1.0 correlation ... which > should tilt the planner towards an indexscan without needing a > physically impossible random_page_cost to do it. I think CLUSTER > would probably be a little slower than VACUUM FULL but it's hard to > be sure without trying. Hrm, I understand what clustering does, I'm just not convinced that it'll "fix" this performance problem unless CLUSTER sets some kind of hint that ANALYZE uses to modify the way in which it collects statistics. Like I said, I'll let you know on Monday when they're back in the shop, but I'm not holding my breath. I know random_page_cost is set to something physically impossible, but in terms of performance, it's always been the biggest win for me to set this puppy quite low. Bug in the planner, or documentation surrounding what this knob does, I'm not sure, but setting this to a low value consistently yields good results for me. Faster the drive, the lower the random_page_cost value. *shrug* > That's one heck of a poor estimate for the number of rows returned. > > > -> Seq Scan on mss_fwevent (cost=0.00..223312.60 rows=168478 width=12) (actual > > time=24253.66..24319.87 rows=320 loops=1) The stats for the columns are already set to 1000 to aid with this... don't know what else I can do here. Having the planner off by as much as even half the actual size isn't uncommon in my experience. -sc -- Sean Chittenden ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] Moving postgresql.conf tunables into 2003...
> > # The default values for PostgreSQL are extremely conservative and > > # are likely far from ideal for a site's needs. Included in this > > # configuration, however, are _suggested_ values to help aid in > > > # > > This sort of narrative belongs in the SGML docs, not in a CONF file. > In fact, one could argue that we should take *all* commentary out of > the CONF file in order to force people to read the docs. The SGML docs aren't in the DBA's face and are way out of the way for DBAs rolling out a new system or who are tuning the system. SGML == Developer, conf == DBA. > Database performance tuning will always be a "black art," as it > necessitates a broad knowledge of PostgreSQL, OS architecture, and > computer hardware. So I doubt that we can post docs that would > allow any 10% time DBA to make PostgreSQL "fly", but hopefully over > the next year we can make enough knowledge public to allow anyone to > make PostgreSQL "sprint". I'm highly resistant to/disappointed in this attitude and firmly believe that there are well understood algorithms that DBAs use to diagnose and solve performance problems. It's only a black art because it hasn't been documented. Performance tuning isn't voodoo, it's adjusting constraints to align with the execution of applications and we know what the applications do, therefore the database can mold to the applications' needs. Some of those parameters are based on hardware constraints and should be pooled and organized as such. random_page_cost == avg cost of a random disk seek/read (eg: disk seek time) == constant integer for a given piece of hardware There are other settings that are RAM based as well, which should be formulaic and derived though a formula hasn't been defined to date. -sc -- Sean Chittenden ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [PERFORM] Moving postgresql.conf tunables into 2003...
Sean, > The SGML docs aren't in the DBA's face and are way out of the way for > DBAs rolling out a new system or who are tuning the system. SGML == > Developer, conf == DBA. That's exactly my point. We cannot provide enough documentation in the CONF file without septupling its length. IF we remove all commentary, and instead provide a pointer to the documentation, more DBAs will read it. > Some of those parameters are based on > hardware constraints and should be pooled and organized as such. > > random_page_cost == > avg cost of a random disk seek/read (eg: disk seek time) == > constant integer for a given piece of hardware But, you see, this is exactly what I'm talking about. random_page_cost isn't static to a specific piece of hardware ... it depends as well on what else is on the disk/array, concurrent disk activity, disk controller settings, filesystem, OS, distribution of records and tables, and arrangment of the partitions on disk. One can certainly get a "good enough" value by benchmarking the disk's random seek and calculating based on that ... but to get an "ideal" value requires a long interactive session by someone with experience and in-depth knowledge of the machine and database. > There are other settings that are RAM based as well, which should be > formulaic and derived though a formula hasn't been defined to date. You seem pretty passionate about this ... how about you help me an Kevin define a benchmarking suite when I get back into the country (July 17)? If we're going to define formulas, it requires that we have a near-comprehensive and consistent test database and test battery that we can run on a variety of machines and platforms. -- -Josh Berkus __AGLIO DATABASE SOLUTIONS___ Josh Berkus Complete information technology [EMAIL PROTECTED] and data management solutions (415) 565-7293 for law firms, small businesses fax 621-2533 and non-profit organizations. San Francisco ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PERFORM] PostgreSQL vs. MySQL
On Friday, July 4, 2003, at 07:07 AM, Brian Tarbox wrote: We had about 40 tables in the db, with joined queries on about 8-12 tables. A while ago a tested a moderately complex schema on MySQL, Pg, and Oracle. I usually heavily normalize schemas and then define views as a denormalized API, which sends MySQL to the book of toys already. The views more often than not would join anywhere from 6-12 tables, using plain (as opposed to compound) foreign keys to primary key straight joins. I noticed that Pg was more than an order of magnitude slower for joins > 8 tables than Oracle. I won't claim that none of this can have been due to lack of tuning. My point is the following though. After I dug in it turned out that of the 4 secs Pg needed to execute the query it spent 3.9 secs in the planner. The execution plan Pg came up with was pretty good - it just needed an extraordinary amount of time to arrive at it, spoiling its own results. Asking this list I then learned how to tweak GEQO such that it would pick up the planning and do it faster than it would otherwise. I was able to get the planner time down to a quarter - still a multitude of the actual execution time. I was told on this list that query planning suffers from combinatorial explosion very quickly - and I completely buy that. It's just - Oracle planned the same query in a fraction of a second, using the cost-based optimizer, on a slower machine. I've seen it plan 15-table joins in much less than a second, and I have no idea how it would do that. In addition, once you've prepared a query in Oracle, the execution plan is pre-compiled. If I were a CS student I'd offer myself to the hall of humiliation and set out to write a fast query planner for Pg ... -hilmar -- - Hilmar Lappemail: lapp at gnf.org GNF, San Diego, Ca. 92121 phone: +1-858-812-1757 - ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] PostgreSQL vs. MySQL
Brian Tarbox kirjutas R, 04.07.2003 kell 15:27: > I recently took a system from MySQL to Postgres. Same HW, SW, same data. > The major operations where moderately complex queries (joins on 8 tables). > The results we got was that Postgres was fully 3 times slower than MySql. For each and every query ?? > We were on this list a fair bit looking for answers and tried all the > standard answers. Could you post the list of "standard answers" you tried ? > It was still much much much slower. Was this with InnoDB ? what kind of joins were they (i.e "FROM a JOIN b on a.i=b.i" or "FROM a,b WHERE a.i = b.i" ? What was the ratio of planning time to actual execution time in pgsql? Where the queries originally optimized for MySQL ? Hannu ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PERFORM] Moving postgresql.conf tunables into 2003...
> > The SGML docs aren't in the DBA's face and are way out of the way > > for DBAs rolling out a new system or who are tuning the system. > > SGML == Developer, conf == DBA. > > That's exactly my point. We cannot provide enough documentation in > the CONF file without septupling its length. IF we remove all > commentary, and instead provide a pointer to the documentation, more > DBAs will read it. Which I don't think would happen and why I think the terse bits that are included are worth while. :) > > Some of those parameters are based on hardware constraints and > > should be pooled and organized as such. > > > > random_page_cost == > > avg cost of a random disk seek/read (eg: disk seek time) == > > constant integer for a given piece of hardware > > But, you see, this is exactly what I'm talking about. > random_page_cost isn't static to a specific piece of hardware ... it > depends as well on what else is on: *) the disk/array translation: how fast data is accessed and over how many drives. *) concurrent disk activity A disk/database activity metric is different than the cost of a seek on the platters. :) Because PostgreSQL doesn't currently support such a disk concurrency metric doesn't mean that its definition should get rolled into a different number in an attempt to accommodate for a lack thereof. *) disk controller settings This class of settings falls into the same settings that affect random seeks on the platters/disk array(s). *) filesystem Again, this influences avg seek time *) OS Again, avg seek time *) distribution of records and tables This has nothing to do with PostgreSQL's random_page_cost setting other than that if data is fragmented on the platter, the disk is going to have to do a lot of seeking. This is a stat that should get set by ANALYZE, not by a human. *) arrangement of the partitions on disk Again, avg seek time. > One can certainly get a "good enough" value by benchmarking the > disk's random seek and calculating based on that ... but to get an > "ideal" value requires a long interactive session by someone with > experience and in-depth knowledge of the machine and database. An "ideal" value isn't obtained via guess and check. Checking is only the verification of some calculable set of settingsthough right now those calculated settings are guessed, unfortunately. > > There are other settings that are RAM based as well, which should > > be formulaic and derived though a formula hasn't been defined to > > date. > > You seem pretty passionate about this ... how about you help me an > Kevin define a benchmarking suite when I get back into the country > (July 17)? If we're going to define formulas, it requires that we > have a near-comprehensive and consistent test database and test > battery that we can run on a variety of machines and platforms. Works for me, though a benchmark will be less valuable than adding a disk concurrency stat, improving data trend/distribution analysis, and using numbers that are concrete and obtainable through the OS kernel API or an admin manually plunking numbers in. I'm still recovering from my move from Cali to WA so with any luck, I'll be settled in by then. -sc -- Sean Chittenden ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Moving postgresql.conf tunables into 2003...
Sean, > > That's exactly my point. We cannot provide enough documentation in > > the CONF file without septupling its length. IF we remove all > > commentary, and instead provide a pointer to the documentation, more > > DBAs will read it. > > Which I don't think would happen and why I think the terse bits that > are included are worth while. :) Depressingly enough, you are probably correct, unless we assemble a more user-friendly "getting started" guide. > *) concurrent disk activity > > A disk/database activity metric is different than the cost of a seek > on the platters. :) Because PostgreSQL doesn't currently support such > a disk concurrency metric doesn't mean that its definition should get > rolled into a different number in an attempt to accommodate for a lack > thereof. I was talking about concurrent activity by *other* applications. For example, if a DBA has a java app that is accessing XML on the same array as postgres 500 times/minute, then you'd need to adjust random_page_cost upwards to allow for the resource contest. > An "ideal" value isn't obtained via guess and check. Checking is only > the verification of some calculable set of settingsthough right now > those calculated settings are guessed, unfortunately. > Works for me, though a benchmark will be less valuable than adding a > disk concurrency stat, improving data trend/distribution analysis, and > using numbers that are concrete and obtainable through the OS kernel > API or an admin manually plunking numbers in. I'm still recovering > from my move from Cali to WA so with any luck, I'll be settled in by > then. The idea is that for a lot of statistics, we're only going to be able to obtain valid numbers if you have something constant to check them against. Talk to you later this month! -- Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
[PERFORM] Extreme high load averages
The only time that I have ever seen load averages of 30 or more under OpenBSD is when one of my scripts goes wild.However, I can say that I am also seeing these load averages under PostgreSQL 7.3.2 after a migration to it from MySQL. MySQL Statistics: Uptime: 1055352 Threads: 178 Questions: 75161710 Slow queries: 46 Opens: 1084 Flush tables: 1 Open tables: 206 Queries per second avg: 71.220 The above are statistics from older generation scripts that would make use of MySQL as to give an idea of what's going on. That generation of scripts would handle the referential integrity, since foreign key constraints are not enforced under that system. However, the system handled 250 concurrent users without a singular problem, while under Postgres with new scripts using functions, referential integrity, transactions and lighter code, the system starts to buckle at even less then 70 users. What I would like to know is. Why? The kernel has been compiled to handle the number of concurrent connections, the server may not be the best, but it should be able to handle the requests: PIII 1Ghz, 1GB SDRAM, 2 IDE 20GB drives. I have changed settings to take advantage of the memory. So the following settings are of interest: shared_buffers = 16384 wal_buffers = 256 sort_mem = 16384 vacuum_mem = 32768 Statistics gathering has now been disabled, and logging is done through syslog.I do not expect those settings to cripple system performance however. The scripts are heavy SELECTS with a fair dose of UPDATES and INSERTS. To get a concept of what these scripts done, you can look at Ethereal Realms (http://www.ethereal-realms.org) which are running the PostgreSQL script variants or consider that this is a chat site. Anyone have ideas? Is the use of connection pooling consider bad? Should flush be run more then once a day? I have no intention of going back to MySQL, and would like to make this new solution work. Martin Foster Creator/Designer Ethereal Realms [EMAIL PROTECTED] ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster