[PERFORM] Parallel Vacuum
Folks, is there any constrains/problems/etc. to run several vacuum processes in parallel while each one is 'vaccuming' one different table? Example: vacuum -d db1 -t table1 & vacuum -d db1 -t table2 & vacuum -d db1 -t table3 & wait (sorry if it was already asked, but I did not find an explicit answer in archives) Thanks for any inputs! Rgds, -Dimitri ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] Parallel Vacuum
On Thursday 22 March 2007 14:52, Alvaro Herrera wrote: > Dimitri escribió: > > Folks, > > > > is there any constrains/problems/etc. to run several vacuum processes in > > parallel while each one is 'vaccuming' one different table? > > No, no problem. Keep in mind that if one of them takes a very long > time, the others will not be able to remove dead tuples that were > killed while the long vacuum was running -- unless you are in 8.2. Yes, I'm using the last 8.2.3 version. So, will they *really* processing in parallel, or will block each other step by step? Rgds, -Dimitri ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] Parallel Vacuum
On Thursday 22 March 2007 16:12, Alvaro Herrera wrote: > Dimitri escribió: > > On Thursday 22 March 2007 14:52, Alvaro Herrera wrote: > > > Dimitri escribió: > > > > Folks, > > > > > > > > is there any constrains/problems/etc. to run several vacuum processes > > > > in parallel while each one is 'vaccuming' one different table? > > > > > > No, no problem. Keep in mind that if one of them takes a very long > > > time, the others will not be able to remove dead tuples that were > > > killed while the long vacuum was running -- unless you are in 8.2. > > > > Yes, I'm using the last 8.2.3 version. So, will they *really* processing > > in parallel, or will block each other step by step? > > They won't block. Wow! Excellent! :) So, in this case why not to add 'parallel' option integrated directly into the 'vacuumdb' command? In my case I have several CPU on the server and quite powerful storage box which is not really busy with a single vacuum. So, my idea is quite simple - speed-up vacuum with parallel execution (just an algorithm): -- PLL=parallel_degree select tab_size, tabname, dbname from ... order by tab_size desc; vacuumdb -d $dbname -t $tabname 2>&1 > /tmp/vac.$dbname.$tabname.log & while (pgrep vacuumdb | wc -l ) >= $PLL sleep 1 end end wait ------ biggest tables are vacuumed first, etc. But of course it will be much more cool to have something like: vacuumdb -a -P parallel_degree What do you think? ;) Rgds, -Dimitri ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] Parallel Vacuum
Mike, you're right until you're using a single disk :) Now, imagine you have more disks - more I/O operations you may perform, and you'll need also a CPU time to process them :) until you fully use one CPU per 'vacuumdb' - and then you stop... As well, even in case when CPU is not highly used by vacuumdb - single process is still not able to get a max performance of the storage array, just because you need several concurrent I/O running in the system to reach max throughput. And even filesystem might help you here - it's not all... More concurrent writers you have - higher performance you reach (until real limit)... In my case I have a small storage array capable to give you more than 500MB/sec and say 5000 op/s. All my data are striped throw all array disks. Single 'vacuumdb' process here become more CPU-bound rather I/O as it cannot fully load storage array... So, more vacuum processes I start in parallel - faster I'll finish database vacuuming. Best regards! -Dimitri On Thursday 22 March 2007 18:10, Michael Stone wrote: > On Thu, Mar 22, 2007 at 04:55:02PM +0100, Dimitri wrote: > >In my case I have several CPU on the server and quite powerful storage box > >which is not really busy with a single vacuum. So, my idea is quite simple > > - speed-up vacuum with parallel execution (just an algorithm): > > Vacuum is I/O intensive, not CPU intensive. Running more of them will > probably make things slower rather than faster, unless each thing you're > vacuuming has its own (separate) disks. The fact that your CPU isn't > pegged while vacuuming suggests that your disk is already your > bottleneck--and doing multiple sequential scans on the same disk will > definitely be slower than doing one. > > Mike Stone > > ---(end of broadcast)--- > TIP 3: Have you checked our extensive FAQ? > >http://www.postgresql.org/docs/faq ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] Sunfire X4500 recommendations
On Friday 23 March 2007 03:20, Matt Smiley wrote: > My company is purchasing a Sunfire x4500 to run our most I/O-bound > databases, and I'd like to get some advice on configuration and tuning. > We're currently looking at: - Solaris 10 + zfs + RAID Z > - CentOS 4 + xfs + RAID 10 > - CentOS 4 + ext3 + RAID 10 > but we're open to other suggestions. > Matt, for Solaris + ZFS you may find answers to all your questions here: http://blogs.sun.com/roch/category/ZFS http://blogs.sun.com/realneel/entry/zfs_and_databases Think to measure log (WAL) activity and use separated pool for logs if needed. Also, RAID-Z is more security-oriented rather performance, RAID-10 should be a better choice... Rgds, -Dimitri ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [PERFORM] Parallel Vacuum
On Thursday 22 March 2007 19:46, Michael Stone wrote: > On Thu, Mar 22, 2007 at 07:24:38PM +0100, Dimitri wrote: > >you're right until you're using a single disk :) > >Now, imagine you have more disks > > I do have more disks. I maximize the I/O performance by dedicating > different sets of disks to different tables. YMMV. I do suggest watching > your I/O rates and wallclock time if you try this to see if your > aggregate is actually substantially faster than the single case. (I > assume that you haven't yet gotten far enough to actually do performance > testing.) You may also want to look into tuning your sequential I/O > performance. > > Mike Stone Mike, specially for you :) Parallel Vacuum Test == - Database 'db_OBJ' PgSQL 8.2.3 tables: object1, object2, ... object8 (all the same) volume: 10.000.000 rows in each table, 22GB in total - Script Mono Vacuum $ cat vac_mono.sh /usr/local/pgsqlFF/bin/vacuumdb -p 5464 -d db_OBJ -t object1 /usr/local/pgsqlFF/bin/vacuumdb -p 5464 -d db_OBJ -t object2 /usr/local/pgsqlFF/bin/vacuumdb -p 5464 -d db_OBJ -t object3 /usr/local/pgsqlFF/bin/vacuumdb -p 5464 -d db_OBJ -t object4 /usr/local/pgsqlFF/bin/vacuumdb -p 5464 -d db_OBJ -t object5 /usr/local/pgsqlFF/bin/vacuumdb -p 5464 -d db_OBJ -t object6 /usr/local/pgsqlFF/bin/vacuumdb -p 5464 -d db_OBJ -t object7 /usr/local/pgsqlFF/bin/vacuumdb -p 5464 -d db_OBJ -t object8 $ - Script Parallel Vacuum $ cat vac_pll.sh /usr/local/pgsqlFF/bin/vacuumdb -p 5464 -d db_OBJ -t object1 & /usr/local/pgsqlFF/bin/vacuumdb -p 5464 -d db_OBJ -t object2 & /usr/local/pgsqlFF/bin/vacuumdb -p 5464 -d db_OBJ -t object3 & /usr/local/pgsqlFF/bin/vacuumdb -p 5464 -d db_OBJ -t object4 & /usr/local/pgsqlFF/bin/vacuumdb -p 5464 -d db_OBJ -t object5 & /usr/local/pgsqlFF/bin/vacuumdb -p 5464 -d db_OBJ -t object6 & /usr/local/pgsqlFF/bin/vacuumdb -p 5464 -d db_OBJ -t object7 & /usr/local/pgsqlFF/bin/vacuumdb -p 5464 -d db_OBJ -t object8 & wait $ Test 1: Cold Clean database (already previously vacuumed) = Scenario: - stop database - flush FS cache (umount/mount) - start database - execute vacuum script $ time sh vac_mono.sh real4m24.23s user0m0.00s sys 0m0.01s $ time sh vac_pll.sh real1m9.36s user0m0.00s sys 0m0.01s Test 2: Hot Dirty database (modified and not vacuumed) == Scenario: - stop database - flush FS cache (umount/mount) - start database - execute 200.000 updates against each from 8 object' tables - execute vacuum script $ time sh vac_mono.sh real9m36.90s user0m0.00s sys 0m0.01s $ time sh vac_pll.sh real2m10.41s user0m0.00s sys 0m0.02s Speed-up x4 is obtained just because single vacuum process reaching max 80MB/sec in throughput, while with 8 parallel vacuum processes I'm jumping to 360MB/sec... And speakink about Sequential I/O: while you're doing read - file system may again prefetch incoming data in way once you reclaim next read - your data will be already in FS cache. However, file system cannot 'pre-write' data for you - so having more concurrent writers helps a lot! (Of course in case you have a storage configured to keep concurrent I/O :)) Well, why all this staff?... Let's imagine once you need more performance, and you buy 10x times more performant storage box, will you still able to kill it with a single-process I/O activity? No... :) To scale well you need to be able to split your work in several task executed in parallel. And personally, I'm very happy we can do it with vacuum now - the one of the most critical part of PostgreSQL... Best regards! -Dimitri ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] Sunfire X4500 recommendations
On Friday 23 March 2007 14:32, Matt Smiley wrote: > Thanks Dimitri! That was very educational material! I'm going to think > out loud here, so please correct me if you see any errors. Your mail is so long - I was unable to answer all questions same day :)) > > The section on tuning for OLTP transactions was interesting, although my > OLAP workload will be predominantly bulk I/O over large datasets of > mostly-sequential blocks. I supposed mostly READ operations, right? > > The NFS+ZFS section talked about the zil_disable control for making zfs > ignore commits/fsyncs. Given that Postgres' executor does single-threaded > synchronous I/O like the tar example, it seems like it might benefit > significantly from setting zil_disable=1, at least in the case of > frequently flushed/committed writes. However, zil_disable=1 sounds unsafe > for the datafiles' filesystem, and would probably only be acceptible for > the xlogs if they're stored on a separate filesystem and you're willing to > loose recently committed transactions. This sounds pretty similar to just > setting fsync=off in postgresql.conf, which is easier to change later, so > I'll skip the zil_disable control. yes, you don't need it for PostgreSQL, it may be useful for other database vendors, but not here. > > The RAID-Z section was a little surprising. It made RAID-Z sound just like > RAID 50, in that you can customize the trade-off between iops versus usable > diskspace and fault-tolerance by adjusting the number/size of > parity-protected disk groups. The only difference I noticed was that > RAID-Z will apparently set the stripe size across vdevs (RAID-5s) to be as > close as possible to the filesystem's block size, to maximize the number of > disks involved in concurrently fetching each block. Does that sound about > right? Well, look at RAID-Z just as wide RAID solution. More you have disks in your system - more high is probability you may loose 2 disks on the same time, and in this case wide RAID-10 will simply make loose you whole the data set (and again if you loose both disks in mirror pair). So, RAID-Z brings you more security as you may use wider parity, but the price for it is I/O performance... > > So now I'm wondering what RAID-Z offers that RAID-50 doesn't. I came up > with 2 things: an alleged affinity for full-stripe writes and (under > RAID-Z2) the added fault-tolerance of RAID-6's 2nd parity bit (allowing 2 > disks to fail per zpool). It wasn't mentioned in this blog, but I've heard > that under certain circumstances, RAID-Z will magically decide to mirror a > block instead of calculating parity on it. I'm not sure how this would > happen, and I don't know the circumstances that would trigger this > behavior, but I think the goal (if it really happens) is to avoid the > performance penalty of having to read the rest of the stripe required to > calculate parity. As far as I know, this is only an issue affecting small > writes (e.g. single-row updates in an OLTP workload), but not large writes > (compared to the RAID's stripe size). Anyway, when I saw the filesystem's > intent log mentioned, I thought maybe the small writes are converted to > full-stripe writes by deferring their commit until a full stripe's worth of > data had been accumulated. Does that sound plausible? The problem here that within the same workload you're able to do less I/O operations with RAID-Z then in RAID-10. So, bigger your I/O block size or smaller - you'll still obtain lower throughput, no? :) > > Are there any other noteworthy perks to RAID-Z, rather than RAID-50? If > not, I'm inclined to go with your suggestion, Dimitri, and use zfs like > RAID-10 to stripe a zpool over a bunch of RAID-1 vdevs. Even though many > of our queries do mostly sequential I/O, getting higher seeks/second is > more important to us than the sacrificed diskspace. There is still one point to check: if you do mostly READ on your database probably RAID-Z will be not *too* bad and will give you more used space. However, if you need to update your data or load frequently - RAID-10 will be better... > > For the record, those blogs also included a link to a very helpful ZFS Best > Practices Guide: > http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide oh yes, it's constantly growing wiki, good start for any Solaris questions as well performance points :) > > To sum up, so far the short list of tuning suggestions for ZFS includes: > - Use a separate zpool and filesystem for xlogs if your apps write often. > - Consider setting zil_disable=1 on the xlogs' dedicated filesystem. ZIL > is the intent log, and it sounds like disabling it may be like d
Re: [PERFORM] Shared buffers, db transactions commited, and write IO on Solaris
>> > Erik, > > using 'forcedirectio' simply brings your write operations to the > *real* volume - means while you need to write 10 bytes you'll write 10 > bytes (instead of UFS block size (8K)). So it explains me why your > write volume became slower. I men 'lower' (not slower) Sorry, that's not true. Google "ufs forcedirectio" go to the first link and you will find: "forcedirectio The forcedirectio (read "force direct IO") UFS option causes data to be buffered in kernel address whenever data is transferred between user address space and the disk. In other words, it bypasses the file system cache. For certain types of applications -- primarily database systems -- this option can dramatically improve performance. In fact, some database experts have argued that a file using the forcedirectio option will outperform a raw partition, though this opinion seems fairly controversial. The forcedirectio improves file system performance by eliminating double buffering, providing a small, efficient code path for file system reads and writes and removing pressure on memory." Erik, please, don't take me wrong, but reading Google (or better man pages) don't replace brain and basic practice... Direct IO option is not a silver bullet which will solve all your problems (try to do 'cp' on the mounted in 'forcedirectio' filesystem, or use your mailbox on it - you'll quickly understand impact)... However, what this does mean is that writes will be at the actual filesystem block size and not the cache block size (8K v. 512K). while UFS filesystem mounted normally, it uses its own cache for all operations (read and write) and saves data modifications on per page basis, means: when a process writes 200 bytes there will be 200 bytes modified in cache, then whole page is written (8K) once data demanded to be flushed (and WAL is writing per each commit)... Now, mounted with 'forcedirectio' option UFS is free of page size constraint and will write like a raw device an exactly demanded amount of data, means: when a process writes 200 bytes it'll write exactly 200 bytes to the disk. For WAL it may be very benefit, because you'll be able to perform more I/O operations/sec, means more commit/sec. But on the same time it may dramatically slow down all your read operations (no more data prefetch nor dynamic cache)... The best solution probably is to separate WAL from data (BTW, it'll be nice to have such an option as WAL_PATH in conf file), it may be resolved by simple use of tablespace or at least directory links, etc. But if your major activity is writing - probably it's already ok for you. However, to understand TX number mystery I think the only possible solution is to reproduce a small live test: (I'm sure you're aware you can mount/unmount forcedirectio dynamically?) during stable workload do: # mount -o remount,logging /path_to_your_filesystem and check if I/O volume is increasing as well TX numbers than come back: # mount -o remount,forcedirectio /path_to_your_filesystem and see if I/O volume is decreasing as well TX numbers... Best regards! -Dimitri > > Now, why TX number is reduced - is a small mystery :) > > Options: > - you really do 10 times less commits, means you work 10 times > slower? ;) > what about users? how do you measure your work performance? We are an email marketing service provider with a web front end application. We measure work performance via web requests (counts, types, etc...), mailer activity and the resulting database activity. We are doing as much or more work now than previously, and faster. > > - TX reported in pg_* tables are not exact, but I don't believe > at all :) Even if they aren't exact, being off by a factor of 10 wouldn't be believable. the forcedirectio mount setting for ufs can definitely explain the drop in data written volume, but doesn't do much to explain the difference in xact commits. erik jones <[EMAIL PROTECTED]> software developer 615-296-0838 emma(r) ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Shared buffers, db transactions commited, and write IO on Solaris
You are right in that the page size constraint is lifted in that directio cuts out the VM filesystem cache. However, the Solaris kernel still issues io ops in terms of its logical block size (which we have at the default 8K). It can issue io ops for fragments as small as 1/8th of the block size, but Postgres issues its io requests in terms of the block size which means that io ops from Postgres will be in 8K chunks which is exactly what we see when we look at our system io stats. In fact, if any io request is made that isn't a multiple of 512 bytes (the disk sector size), the file system switches back to the buffered io. Oh, yes, of course! yes, you still need to respect multiple of 512 bytes block size on read and write - sorry, I was tired :) Then it's seems to be true - default XLOG block size is 8K, means for every even small auto-committed transaction we should write 8K?... Is there any reason to use so big default block size?... Probably it may be a good idea to put it as 'initdb' parameter? and have such value per database server? Rgds, -Dimitri > > However, to understand TX number mystery I think the only possible > solution > is to reproduce a small live test: > > (I'm sure you're aware you can mount/unmount forcedirectio > dynamically?) > > during stable workload do: > > # mount -o remount,logging /path_to_your_filesystem > > and check if I/O volume is increasing as well TX numbers > than come back: > > # mount -o remount,forcedirectio /path_to_your_filesystem > > and see if I/O volume is decreasing as well TX numbers... That's an excellent idea and I'll run it by the rest of our team tomorrow. erik jones <[EMAIL PROTECTED]> software developer 615-296-0838 emma(r) ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] Shared buffers, db transactions commited, and write IO on Solaris
The problem is while your goal is to commit as fast as possible - it's pity to vast I/O operation speed just keeping common block size... Let's say if your transaction modification entering into 512K - you'll be able to write much more 512K blocks per second rather 8K per second (for the same amount of data)... Even we rewrite probably several times the same block with incoming transactions - it still costs on traffic, and we will process slower even H/W can do better. Don't think it's good, no? ;) Rgds, -Dimitri On 3/30/07, Erik Jones <[EMAIL PROTECTED]> wrote: On Mar 30, 2007, at 8:14 AM, Dimitri wrote: >> >> You are right in that the page size constraint is lifted in that >> directio cuts out the VM filesystem cache. However, the Solaris >> kernel still issues io ops in terms of its logical block size (which >> we have at the default 8K). It can issue io ops for fragments as >> small as 1/8th of the block size, but Postgres issues its io requests >> in terms of the block size which means that io ops from Postgres will >> be in 8K chunks which is exactly what we see when we look at our >> system io stats. In fact, if any io request is made that isn't a >> multiple of 512 bytes (the disk sector size), the file system >> switches back to the buffered io. > > Oh, yes, of course! yes, you still need to respect multiple of 512 > bytes block size on read and write - sorry, I was tired :) > > Then it's seems to be true - default XLOG block size is 8K, means for > every even small auto-committed transaction we should write 8K?... Is > there any reason to use so big default block size?... > > Probably it may be a good idea to put it as 'initdb' parameter? and > have such value per database server? I believe it's because that is a pretty normal Unix kernal block size and you want the two to match. erik jones <[EMAIL PROTECTED]> software developer 615-296-0838 emma(r) ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [PERFORM] Shared buffers, db transactions commited, and write IO on Solaris
Folks, to close topic with "LOG block size=1K" idea - I took a time to test it (yes) and in best cases there is only 15% gain comparing to 8K - storage protocol is quite heavy itself, so less or more data sent within it doesn't reduce service time too much... As well even this gain is quickly decreasing with growing workload! So, yes 8K is good enough and probably the most optimal choice for LOG (as well data) block size. Rgds, -Dimitri Well, to check if there is a real potential gain all we need is a small comparing test using PgSQL compiled with LOG block size equal to say 1K and direct IO enabled. Rgds, -Dimitri On 3/30/07, Kenneth Marshall <[EMAIL PROTECTED]> wrote: > On Fri, Mar 30, 2007 at 04:25:16PM +0200, Dimitri wrote: > > The problem is while your goal is to commit as fast as possible - it's > > pity to vast I/O operation speed just keeping common block size... > > Let's say if your transaction modification entering into 512K - you'll > > be able to write much more 512K blocks per second rather 8K per second > > (for the same amount of data)... Even we rewrite probably several > > times the same block with incoming transactions - it still costs on > > traffic, and we will process slower even H/W can do better. Don't > > think it's good, no? ;) > > > > Rgds, > > -Dimitri > > > With block sizes you are always trading off overhead versus space > efficiency. Most OS write only in 4k/8k to the underlying hardware > regardless of the size of the write you issue. Issuing 16 512byte > writes has much more overhead than 1 8k write. On the light transaction > end, there is no real benefit to a small write and it will slow > performance for high throughput environments. It would be better to, > and I think that someone is looking into, batching I/O. > > Ken > ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
[PERFORM] Equivalents in PostgreSQL of MySQL's "ENGINE=MEMORY" "MAX_ROWS=1000"
Probably another helpful solution may be to implement: ALTER TABLE LOGGING OFF/ON; just to disable/enable WAL? First it may help in all cases of intensive data load while you slow down other sessions with increasing WAL activity. Then you have a way to implement MEMORY-like tables on RAM disk tablespace (well, you still need to take care to drop them auto-manually :)) However, if we speak about performance of MEMORY table - it should be much better in Tom's solution with big temp buffers rather RAM disk... The strong point in implementation of MEMORY table is it *knows* it sits in RAM! and it changes completely all I/O kind logic... BTW, before NDB was bough by MySQL we done a benchmark to rich a highest possible TPS numbers with it. We got 1.500.000 TPS(!) (yes, one million and half per second!) knowing all current TPC records are measured in thousands of transactions per minute - you see impact... And of course for my education I tried to do the same with other database vendors running only SELECT queries and placing tablespaces on RAM disk... After trying all possible combinations I was still *very* far :)) MEMORY databases is something like a parallel world, very interesting, but very different :)) Rgds, -Dimitri On 4/3/07, A.M. <[EMAIL PROTECTED]> wrote: On Apr 3, 2007, at 16:00 , Alan Hodgson wrote: > On Tuesday 03 April 2007 12:47, "A.M." > <[EMAIL PROTECTED]> wrote: >> On Apr 3, 2007, at 15:39 , C. Bergström wrote: >> I would like to use transactional semantics over tables that can >> disappear whenever the server fails. memcached does not offer that. > > How would temporary tables? The only difference between temporary tables and standard tables is the WAL. Global temporary tables would be accessible by all sessions and would be truncated on postmaster start. For a further potential speed boost, global temp tables could be put in a ramdisk tablespace. Well, that's at least how I envision them. Cheers, M ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] Equivalents in PostgreSQL of MySQL's "ENGINE=MEMORY" "MAX_ROWS=1000"
Wow, it's excellent! :)) probably the next step is: ALTER TABLE CACHE ON/OFF; just to force keeping any table in the cache. What do you think?... Rgds, -Dimitri On 4/5/07, Josh Berkus wrote: Dimitri, > Probably another helpful solution may be to implement: > >ALTER TABLE LOGGING OFF/ON; > > just to disable/enable WAL? Actually, a patch similar to this is currently in the queue for 8.3. -- Josh Berkus PostgreSQL @ Sun San Francisco ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Bad RAID1 read performance
As there is no 'continuous space' option on ext3/ext2 (or probably "-f fragment_size" may do a trick?) - I think after some filesystem activity you simply loose continuous space allocation and rather expected sequential reading may be transformed into random seeking of 'logically' sequentual blocks... Rgds, -Dimitri On 5/30/07, Albert Cervera Areny <[EMAIL PROTECTED]> wrote: Hardware isn't very good I believe, and it's about 2-3 years old, but the RAID is Linux software, and though not very good the difference between reading and writing should probably be greater... (?) Would you set 512Kb readahead on both drives and RAID? I tried various configurations and none seemed to make a big difference. It seemed correct to me to set 512kb per drive and 1024kb for md0. A Dimecres 30 Maig 2007 16:09, Luke Lonergan va escriure: > This sounds like a bad RAID controller - are you using a built-in hardware > RAID? If so, you will likely want to use Linux software RAID instead. > > Also - you might want to try a 512KB readahead - I've found that is optimal > for RAID1 on some RAID controllers. > > - Luke > > On 5/30/07 2:35 AM, "Albert Cervera Areny" <[EMAIL PROTECTED]> wrote: > > Hi, > > after doing the "dd" tests for a server we have at work I obtained: > > Read: 47.20 Mb/s > > Write: 39.82 Mb/s > > Some days ago read performance was around 20Mb/s due to no readahead in > > md0 so I modified it using hdparm. However, it seems to me that being it > > a RAID1 read speed could be much better. These are SATA disks with 3Gb of > > RAM so I did 'time bash -c "dd if=/dev/zero of=bigfile bs=8k count=786432 > > && sync"'. File system is ext3 (if read many times in the list that XFS > > is faster), but I don't want to change the file system right now. > > Modifing the readahead from the current 1024k to 2048k doesn't make any > > difference. Are there any other tweaks I can make? > > > > > > ---(end of broadcast)--- > > TIP 4: Have you searched our list archives? > > > >http://archives.postgresql.org > > ---(end of broadcast)--- > TIP 6: explain analyze is your friend -- Albert Cervera Areny Dept. Informàtica Sedifa, S.L. Av. Can Bordoll, 149 08202 - Sabadell (Barcelona) Tel. 93 715 51 11 Fax. 93 715 51 12 AVISO LEGAL La presente comunicación y sus anexos tiene como destinatario la persona a la que va dirigida, por lo que si usted lo recibe por error debe notificarlo al remitente y eliminarlo de su sistema, no pudiendo utilizarlo, total o parcialmente, para ningún fin. Su contenido puede tener información confidencial o protegida legalmente y únicamente expresa la opinión del remitente. El uso del correo electrónico vía Internet no permite asegurarni la confidencialidad de los mensajes nisucorrecta recepción. Enel caso de que el destinatario no consintiera la utilización del correo electrónico, deberá ponerlo en nuestro conocimiento inmediatamente. ... DISCLAIMER . This message and its attachments are intended exclusively for the named addressee. If you receive this message in error, please immediately delete it from your system and notify the sender. You may not use this message or any part of it for any purpose. The message may contain information that is confidential or protected by law, and any opinions expressed are those of the individualsender. Internet e-mail guarantees neither the confidentiality nor the proper receipt of the message sent. If the addressee of this message does not consent to the use of internete-mail,pleaseinform usinmmediately. ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
[PERFORM] Some info to share: db_STRESS Benchmark results
Folks, just wanted to share some benchmark results from one long performance study comparing MySQL, PostgreSQL and Oracle transactions throughput and engine scalability on T2000 and V890 (under Solaris). Oracle results are removed (of course :), but other are quite interesting... Findings are presented as it, following step by step learning and tuning curve :) So well, you may find: - http://dimitrik.free.fr/db_STRESS.html - Benchmark kit description - http://dimitrik.free.fr/db_STRESS_BMK_Part1.html -- first main part - http://dimitrik.free.fr/db_STRESS_BMK_Part2_ZFS.html -- second part including ZFS specific tuning Tests were executed in Mar/Apr.2007 with latest v8.2.3 on that time. Due limited spare time I was able to publish results only now... Any comments are welcome! :) Best regards! -Dimitri ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] Some info to share: db_STRESS Benchmark results
Well, let's say I want to have compact graphs :) So, few comments on graphs: - Title: compact name of test and execution conditions - X-axis: is always representing time scale - Y-axis: is showing a value level (whatever) - Legend: gives you a value Name and its metric (KB/s, Op/s, TPS, etc) TPS: (transactions per second) - ALL-tps TR_all: all transactions (READ+WRITE) per second level - ALL-tps TR_Read: only READ tps level - ALL-tps TR_Write: only WRITE tps level I must say I was more intrested by databases tuning rather documenting each my step... But well, without documenting there is no result :) As well I did not think to compare database initially (don't know why but it's always starting a small war between DB vendors :)), but results were so surprising so I just continued until it was possible :)) Rgds, -Dimitri On 5/31/07, Alexander Staubo <[EMAIL PROTECTED]> wrote: On 5/31/07, Dimitri <[EMAIL PROTECTED]> wrote: > just wanted to share some benchmark results from one long performance > study comparing MySQL, PostgreSQL and Oracle transactions throughput > and engine scalability on T2000 and V890 (under Solaris). Interesting, if awfully cryptic. The lack of axis labels, the lack of axis normalization, and the fact that you put the graphs for different databases and parameters on separate pages makes it rather hard to compare the various results. Alexander. ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] Autodetect of software RAID1+0 fails
Craig, to make things working properly here you need to create a config file keeping both raid1 and raid0 information (/etc/mdadm/mdadm.conf). However if your root filesystem is corrupted, or you loose this file, or move disks somewhere else - you are back to the same initial issue :)) So, the solution I've found 100% working in any case is: use mdadm to create raid1 devices (as you do already) and then use LVM to create raid0 volume on it - LVM writes its own labels on every MD devices and will find its volumes peaces automatically! Tested for crash several times and was surprised by its robustness :)) Rgds, -Dimitri On 6/1/07, Craig James <[EMAIL PROTECTED]> wrote: Apologies for a somewhat off-topic question, but... The Linux kernel doesn't properly detect my software RAID1+0 when I boot up. It detects the two RAID1 arrays, the partitions of which are marked properly. But it can't find the RAID0 on top of that, because there's no corresponding device to auto-detect. The result is that it creates /dev/md0 and /dev/md1 and assembles the RAID1 devices on bootup, but /dev/md2 isn't created, so the RAID0 can't be assembled at boot time. Here's what it looks like: $ cat /proc/mdstat Personalities : [raid0] [raid1] md2 : active raid0 md0[0] md1[1] 234436224 blocks 64k chunks md1 : active raid1 sde1[1] sdc1[2] 117218176 blocks [2/2] [UU] md0 : active raid1 sdd1[1] sdb1[0] 117218176 blocks [2/2] [UU] $ uname -r 2.6.12-1.1381_FC3 After a reboot, I always have to do this: mknod /dev/md2 b 9 2 mdadm --assemble /dev/md2 /dev/md0 /dev/md1 mount /dev/md2 What am I missing here? Thanks, Craig ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] Data transfer very slow when connected via DSL
Hi Rainer, but did you try to execute your query directly from 'psql' ?... Why I'm asking: seems to me your case is probably just network latency dependent, and what I noticed during last benchmarks with PostgreSQL the SELECT query become very traffic hungry if you are using CURSOR. Program 'psql' is implemented to not use CURSOR by default, so it'll be easy to check if you're meeting this issue or not just by executing your query remotely from 'psql'... Rgds, -Dimitri On 6/21/07, Rainer Bauer <[EMAIL PROTECTED]> wrote: Hello Tom, >I seem to recall that we've seen similar reports before, always >involving Windows :-(. Check whether you have any nonstandard >components hooking into the network stack on that machine. I just repeated the test by booting into "Safe Mode with Network Support", but the results are the same. So I don't think that's the cause. Apart from that, what response times could I expect? Rainer ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Data transfer very slow when connected via DSL
Let's stay optimist - at least now you know the main source of your problem! :)) Let's see now with CURSOR... Firstly try this: munnin=>\timing munnin=>\set FETCH_COUNT 1; munnin=>select * from "tblItem"; what's the time you see here? (I think your application is working in this manner) Now, change the FETCH_COUNT to 10, then 50, then 100 - your query execution time should be better (at least I hope so :)) And if it's better - you simply need to modify your FETCH clause with adapted "FORWARD #" value (the best example is psql source code itself, you may find ExecQueryUsingCursor function implementation (file common.c))... Rgds, -Dimitri On 6/22/07, Rainer Bauer <[EMAIL PROTECTED]> wrote: Hello Dimitri, >but did you try to execute your query directly from 'psql' ?... munnin=>\timing munnin=>select * from "tblItem"; (50 rows) Time: 391,000 ms >Why I'm asking: seems to me your case is probably just network latency >dependent, and what I noticed during last benchmarks with PostgreSQL >the SELECT query become very traffic hungry if you are using CURSOR. >Program 'psql' is implemented to not use CURSOR by default, so it'll >be easy to check if you're meeting this issue or not just by executing >your query remotely from 'psql'... Yes, see also my other post. Unfortunatelly this means that using my program to connect via DSL to the Postgres database is not possible. Rainer ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] Data transfer very slow when connected via DSL
Tom, seems to me the problem here is rather simple: current issue depends completely on the low level 'implementation' of SELECT query in the application. In case it's implemented with using of "DECLARE ... CURSOR ..." and then "FETCH NEXT" by default (most common case) it brings application into "ping-pong condition" with database server: each next FETCH is possible only if the previous one is finished and server received feedback from client with explicit fetch next order. In this condition query response time became completely network latency dependent: - each packet send/receive has a significant cost - you cannot reduce this cost as you cannot group more data within a single packet and you waste your traffic - that's why TCP_NODELAY become so important here - with 150ms network latency the cost is ~300ms per FETCH (15sec(!) for 50 lines) You may think if you're working in LAN and your network latency is 0.1ms you're not concerned by this issue - but in reality yes, you're impacted! Each network card/driver has it's own max packet/sec traffic capability (independent to volume) and once you hit it - your response time may only degrade with more concurrent sessions (even if your CPU usage is still low)... The solution here is simple: - don't use CURSOR in simple cases when you just reading/printing a SELECT results - in case it's too late to adapt your code or you absolutely need CURSOR for some reasons: replace default "FETCH" or "FETCH NEXT" by "FETCH 100" (100 rows generally will be enough) normally it'll work just straight forward (otherwise check you're verifying PQntuples() value correctly and looping to read all tuples) To keep default network workload more optimal, I think we need to bring "FETCH N" more popular for developers and enable it (even hidden) by default in any ODBC/JDBC and other generic modules... Rgds, -Dimitri On 6/22/07, Tom Lane <[EMAIL PROTECTED]> wrote: Rainer Bauer <[EMAIL PROTECTED]> writes: > Fetching the 50 rows takes 12 seconds (without logging 8 seconds) and > examining the log I found what I suspected: the performance is directly > related to the ping time to the server since fetching one tuple requires a > round trip to the server. Hm, but surely you can get it to fetch more than one row at once? This previous post says that someone else solved an ODBC performance problem with UseDeclareFetch=1: http://archives.postgresql.org/pgsql-odbc/2006-08/msg00014.php It's not immediately clear why pgAdmin would have the same issue, though, because AFAIK it doesn't rely on ODBC. I just finished looking through our archives for info about Windows-specific network performance problems. There are quite a few threads, but the ones that were solved seem not to bear on your problem (unless the one above does). I found one pretty interesting thread suggesting that the problem was buffer-size dependent: http://archives.postgresql.org/pgsql-performance/2006-12/msg00269.php but that tailed off with no clear resolution. I think we're going to have to get someone to watch the problem with a packet sniffer before we can get much further. regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] Data transfer very slow when connected via DSL
Rainer, but did you try initial query with FETCH_COUNT equal to 100?... Rgds, -Dimitri On 6/22/07, Rainer Bauer <[EMAIL PROTECTED]> wrote: Hello Dimitri, >Let's stay optimist - at least now you know the main source of your problem! :)) > >Let's see now with CURSOR... > >Firstly try this: >munnin=>\timing >munnin=>\set FETCH_COUNT 1; >munnin=>select * from "tblItem"; > >what's the time you see here? (I think your application is working in >this manner) That's it! It takes exactly 8 seconds like my program. I retrieve the data through a bound column: SELECT * FROM tblItem WHERE intItemIDCnt = ? After converting this to SELECT * FROM tblItem WHERE intItemIDCnt IN (...) the query is as fast as psql: 409ms So the problem is identified and the solution is to recode my application. Rainer PS: When enabling UseDeclareFetch as suggested by Tom then the runtime is still three times slower: 1192ms. But I guess that problem is for the ODBC list. ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] Data transfer very slow when connected via DSL
Hello Rainer, initially I was surprised you did not match non-CURSOR time with FETCH 100, but then thinking little bit the explanation is very simple - let's analyze what's going in both cases: Without CURSOR: 1.) app calls PQexec() with "Query" and waiting for the result 2.) PG sends the result to app, data arriving grouped into max possible big packets, network latency is hidden by huge amount per single send With CURSOR and FETCH 100: 1.) app calls PQexec() with "BEGIN" and waiting 2.) PG sends ok 3.) app calls PQexec() with "DECLARE cursor for Query" and waiting 4.) PG sends ok 5.) app calls PQexec() with "FETCH 100" and waiting 6.) PG sends the result of 100 rows to app, data arriving grouped into max possible big packets, network latency is hidden by huge data amount per single send 7.) no more data (as you have only 50 rows in output) and app calls PQexec() with "CLOSE cursor" and waiting 8.) PG sends ok 9.) app calls PQexec() with "COMMIT" and waiting 10.) PG sends ok as you see the difference is huge, and each step add your network latency delay. So, with "FETCH 100" we save only cost of steps 5 and 6 (default "FETCH 1" will loop here for all 50 rows adding 50x times latency delay again). But we cannot solve cost of other steps as they need to be executed one by one to keep execution logic and clean error handling... Hope it's more clear now and at least there is a choice :)) As well, if your query result will be 500 (for ex.) I think the difference will be less important between non-CURSOR and "FETCH 500" execution... Rgds, -Dimitri On 6/22/07, Rainer Bauer <[EMAIL PROTECTED]> wrote: Hello Dimitri, >Rainer, but did you try initial query with FETCH_COUNT equal to 100?... Yes I tried it with different values and it's like you suspected: FETCH_COUNT 1 Time: 8642,000 ms FETCH_COUNT 5 Time: 2360,000 ms FETCH_COUNT 10 Time: 1563,000 ms FETCH_COUNT 25 Time: 1329,000 ms FETCH_COUNT 50 Time: 1140,000 ms FETCH_COUNT 100 Time: 969,000 ms \unset FETCH_COUNT Time: 390,000 ms Rainer ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] Data transfer very slow when connected via DSL
Rainer, seeking psqlODBC code source it seems to work in similar way and have an option "SQL_ROWSET_SIZE" to execute FETCH query in the same way as "FETCH_COUNT" in psql. Try to set it to 100 and let's see if it'll be better... Rgds, -Dimitri On 6/22/07, Rainer Bauer <[EMAIL PROTECTED]> wrote: Hello Joshua, >That opens up some questions. What ODBC driver are you using (with exact >version please). psqlODBC 8.2.4.2 (build locally). I have restored the 8.2.4.0 from the official msi installer, but the results are the same. Rainer ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
[PERFORM] Filesystem Direct I/O and WAL sync option
All, I'm very curious to know if we may expect or guarantee any data consistency with WAL sync=OFF but using file system mounted in Direct I/O mode (means every write() system call called by PG really writes to disk before return)... So may we expect data consistency: - none? - per checkpoint basis? - full?... Thanks a lot for any info! Rgds, -Dimitri ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] Filesystem Direct I/O and WAL sync option
Yes, disk drives are also having cache disabled or having cache on controllers and battery protected (in case of more high-level storage) - but is it enough to expect data consistency?... (I was surprised about checkpoint sync, but does it always calls write() anyway? because in this way it should work without fsync)... On 7/3/07, Heikki Linnakangas <[EMAIL PROTECTED]> wrote: Dimitri wrote: > I'm very curious to know if we may expect or guarantee any data > consistency with WAL sync=OFF but using file system mounted in Direct > I/O mode (means every write() system call called by PG really writes > to disk before return)... You'd have to turn that mode on on the data drives as well to get consistency, because fsync=off disables checkpoint fsyncs of the data files as well. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [PERFORM] Filesystem Direct I/O and WAL sync option
Yes Gregory, that's why I'm asking, because from 1800 transactions/sec I'm jumping to 2800 transactions/sec! and it's more than important performance level increase :)) Rgds, -Dimitri On 7/4/07, Gregory Stark <[EMAIL PROTECTED]> wrote: "Dimitri" <[EMAIL PROTECTED]> writes: > Yes, disk drives are also having cache disabled or having cache on > controllers and battery protected (in case of more high-level > storage) - but is it enough to expect data consistency?... (I was > surprised about checkpoint sync, but does it always calls write() > anyway? because in this way it should work without fsync)... Well if everything is mounted in sync mode then I suppose you have the same guarantee as if fsync were called after every single write. If that's true then surely that's at least as good. I'm curious how it performs though. Actually it seems like in that configuration fsync should be basically zero-cost. In other words, you should be able to leave fsync=on and get the same performance (whatever that is) and not have to worry about any risks. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] Filesystem Direct I/O and WAL sync option
Gregory, thanks for good questions! :)) I got more lights on my throughput here :)) The running OS is Solaris9 (customer is still not ready to upgrade to Sol10), and I think the main "sync" issue is coming from the old UFS implementation... UFS mounted with 'forcedirectio' option uses different "sync" logic as well accepting concurrent writing to the same file which is giving here a higher performance level. I did not expect really so big gain, so did not think to replay the same test with direct I/O on and fsync=on too. For my big surprise - it also reached 2800 tps as with fsync=off !!! So, initial question is no more valid :)) As well my tests are executed just to validate server + storage capabilities, and honestly it's really pity to see them used under old Solaris version :)) but well, at least we know what kind of performance they may expect currently, and think about migration before the end of this year... Seeing at least 10.000 random writes/sec on storage sub-system during live database test was very pleasant to customer and make feel them comfortable for their production... Thanks a lot for all your help! Best regards! -Dimitri On 7/4/07, Gregory Stark <[EMAIL PROTECTED]> wrote: "Dimitri" <[EMAIL PROTECTED]> writes: > Yes Gregory, that's why I'm asking, because from 1800 transactions/sec > I'm jumping to 2800 transactions/sec! and it's more than important > performance level increase :)) wow. That's kind of suspicious though. Does the new configuration take advantage of the lack of the filesystem cache by increasing the size of shared_buffers? Even then I wouldn't expect such a big boost unless you got very lucky with the size of your working set compared to the two sizes of shared_buffers. It seems likely that somehow this change is not providing the same guarantees as fsync. Perhaps fsync is actually implementing IDE write barriers and the sync mode is just flushing buffers to the hard drive cache and then returning. What transaction rate do you get if you just have a single connection streaming inserts in autocommit mode? What kind of transaction rate do you get with both sync mode on and fsync=on in Postgres? And did you say this with a battery backed cache? In theory fsync=on/off and shouldn't make much difference at all with a battery backed cache. Stranger and stranger. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] Filesystem Direct I/O and WAL sync option
Yes, I tried all WAL sync methods, but there was no difference... However, there was a huge difference when I run the same tests under Solaris10 - 'fdatasync' option gave the best performance level. On the same time direct I/O did not make difference on Solaris 10 :) So the main rule - there is no universal rule :) just adapt system options according your workload... Direct I/O will generally speed-up write operation due avoiding buffer flashing overhead as well concurrent writing (breaking POSIX limitation of single writer per given file on the same time). But on the same time it may slow-down your read operations, and you may need 64bit PG version to use big cache to still keep same performance level on SELECT queries. And again, there are other file systems like QFS (for ex.) which may give you the best of both worlds: direct write and buffered read on the same time! etc. etc. etc. :) Rgds, -Dimitri On 7/9/07, Jonah H. Harris <[EMAIL PROTECTED]> wrote: On 7/9/07, Jim C. Nasby <[EMAIL PROTECTED]> wrote: > BTW, it might be worth trying the different wal_sync_methods. IIRC, > Jonah's seen some good results from open_datasync. On Linux, using ext3, reiser, or jfs, I've seen open_sync perform quite better than fsync/fdatasync in most of my tests. But, I haven't done significant testing with direct I/O lately. -- Jonah H. Harris, Software Architect | phone: 732.331.1324 EnterpriseDB Corporation| fax: 732.331.1301 33 Wood Ave S, 3rd Floor| [EMAIL PROTECTED] Iselin, New Jersey 08830| http://www.enterprisedb.com/ ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] Postgres configuration for 64 CPUs, 128 GB RAM...
Luke, ZFS tuning is not coming from general suggestion ideas, but from real practice... So, - limit ARC is the MUST for the moment to keep your database running comfortable (specially DWH!) - 8K blocksize is chosen to read exactly one page when PG ask to read one page - don't mix it with prefetch! when prefetch is detected, ZFS will read next blocks without any demand from PG; but otherwise why you need to read more pages each time PG asking only one?... - prefetch of course not needed for OLTP, but helps on OLAP/DWH, agree :) Rgds, -Dimitri On 7/22/07, Luke Lonergan <[EMAIL PROTECTED]> wrote: > Josh, > > On 7/20/07 4:26 PM, "Josh Berkus" <[EMAIL PROTECTED]> wrote: > > > There are some specific tuning parameters you need for ZFS or performance > > is going to suck. > > > > http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide > > (scroll down to "PostgreSQL") > > http://www.sun.com/servers/coolthreads/tnb/applications_postgresql.jsp > > http://bugs.opensolaris.org/view_bug.do?bug_id=6437054 > > > > You also don't say anything about what kind of workload you're running. > > > I think we're assuming that the workload is OLTP when putting these tuning > guidelines forward. Note that the ZFS tuning guidance referred to in this > bug article recommend "turning vdev prefetching off" for "random I/O > (databases)". This is exactly the opposite of what we should do for OLAP > workloads. > > Also, the lore that setting recordsize on ZFS is mandatory for good database > performance is similarly not appropriate for OLAP work. > > If the workload is OLAP / Data Warehousing, I'd suggest ignoring all of the > tuning information from Sun that refers generically to "database". The > untuned ZFS performance should be far better in those cases. Specifically, > these three should be ignored: > - (ignore this) limit ARC memory use > - (ignore this) set recordsize to 8K > - (ignore this) turn off vdev prefetch > > - Luke > > > > ---(end of broadcast)--- > TIP 9: In versions below 8.0, the planner will ignore your desire to >choose an index scan if your joining column's datatypes do not >match > ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Postgres configuration for 64 CPUs, 128 GB RAM...
Hi Luke, On the same page of Solaris internals wiki you may find links to the study with db_STRESS benchmark (done on UFS and ZFS with PostgreSQL, MySQL and Oracle (well, Oracle results are removed, but at least I may say it entered into the same tuning as PgSQL). Tests were done on Sol10u3 (as well you may find any other platform details in report document)... Also, if block size adjustment is less or more transparent (don't read 32K if you need only 8K - with huge data volume you'll simply waste your cache; in case you're doing full scan - leave prefetch algorithm to work for you); probably ARC (cache) limitation need more light. Well, I even cannot say there is any problem, etc. with it - it just has too much aggressive implementation :)) If all your running programs fitting into 1GB of RAM - you may leave ARC size by default (leaves 1GB free of system RAM). Otherwise, you should limit ARC to keep your workload execution comfortable: ARC allocating memory very quickly and every time your program need more RAM - it entering into concurrency with ARC... In my tests I observed short workload freezes during such periods and I did not like it too much :)) specially with high connection numbers :)) well, we may spend hours to discuss :) (sorry to be short, I have a very limited mail access for the moment)... However, ZFS is improving all the time and works better and better with every Solaris release, so probably all current tuning will be different or obsolete at the end of this year :)) BTW, forgot to mention, you'll need Solaris 10u4 or at least 10u3 but with all recent patches applied to run M8000 on full power. Best regards! -Dimitri On 7/30/07, Luke Lonergan <[EMAIL PROTECTED]> wrote: > Hi Dimitri, > > Can you post some experimental evidence that these settings matter? > > At this point we have several hundred terabytes of PG databases running on > ZFS, all of them setting speed records for data warehouses. > > We did testing on these settings last year on S10U2, perhaps things have > changed since then. > > - Luke > > Msg is shrt cuz m on ma treo > > -Original Message- > From: Dimitri [mailto:[EMAIL PROTECTED] > Sent: Monday, July 30, 2007 05:26 PM Eastern Standard Time > To: Luke Lonergan > Cc: Josh Berkus; pgsql-performance@postgresql.org; Marc Mamin > Subject: Re: [PERFORM] Postgres configuration for 64 CPUs, 128 GB RAM... > > Luke, > > ZFS tuning is not coming from general suggestion ideas, but from real > practice... > > So, > - limit ARC is the MUST for the moment to keep your database running > comfortable (specially DWH!) > - 8K blocksize is chosen to read exactly one page when PG ask to > read one page - don't mix it with prefetch! when prefetch is detected, > ZFS will read next blocks without any demand from PG; but otherwise > why you need to read more pages each time PG asking only one?... > - prefetch of course not needed for OLTP, but helps on OLAP/DWH, agree :) > > Rgds, > -Dimitri > > > On 7/22/07, Luke Lonergan <[EMAIL PROTECTED]> wrote: > > Josh, > > > > On 7/20/07 4:26 PM, "Josh Berkus" <[EMAIL PROTECTED]> wrote: > > > > > There are some specific tuning parameters you need for ZFS or > performance > > > is going to suck. > > > > > > http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide > > > (scroll down to "PostgreSQL") > > > http://www.sun.com/servers/coolthreads/tnb/applications_postgresql.jsp > > > http://bugs.opensolaris.org/view_bug.do?bug_id=6437054 > > > > > > You also don't say anything about what kind of workload you're running. > > > > > > I think we're assuming that the workload is OLTP when putting these tuning > > guidelines forward. Note that the ZFS tuning guidance referred to in this > > bug article recommend "turning vdev prefetching off" for "random I/O > > (databases)". This is exactly the opposite of what we should do for OLAP > > workloads. > > > > Also, the lore that setting recordsize on ZFS is mandatory for good > database > > performance is similarly not appropriate for OLAP work. > > > > If the workload is OLAP / Data Warehousing, I'd suggest ignoring all of > the > > tuning information from Sun that refers generically to "database". The > > untuned ZFS performance should be far better in those cases. > Specifically, > > these three should be ignored: > > - (ignore this) limit ARC memory use > > - (ignore this) set recordsize to 8K > > - (ignore this) turn off vdev prefetch > > > > - Luke > > > > > > > > ---(end of broadcast)--- > > TIP 9: In versions below 8.0, the planner will ignore your desire to > >choose an index scan if your joining column's datatypes do not > >match > > > ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] optimize query with a maximum(date) extraction
BTW, will it improve something if you change your index to "my_table( id, the_date )"? Rgds, -Dimitri On 9/5/07, JS Ubei <[EMAIL PROTECTED]> wrote: > Hi all, > > I need to improve a query like : > > SELECT id, min(the_date), max(the_date) FROM my_table GROUP BY id; > > Stupidly, I create a B-tree index on my_table(the_date), witch is logically > not used in my query, because it's not with a constant ? isn't it ? > > I know that I can't create a function index with an aggregative function. > > How I can do ? > > thanks, > > jsubei > > > > > > _ > Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers Yahoo! Mail > > ---(end of broadcast)--- > TIP 5: don't forget to increase your free space map settings > ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] 8K recordsize bad on ZFS?
Josh, it'll be great if you explain how did you change the records size to 128K? - as this size is assigned on the file creation and cannot be changed later - I suppose that you made a backup of your data and then process a full restore.. is it so? Rgds, -Dimitri On 5/8/10, Josh Berkus wrote: > Jignesh, All: > > Most of our Solaris users have been, I think, following Jignesh's advice > from his benchmark tests to set ZFS page size to 8K for the data zpool. > However, I've discovered that this is sometimes a serious problem for > some hardware. > > For example, having the recordsize set to 8K on a Sun 4170 with 8 drives > recently gave me these appalling Bonnie++ results: > > Version 1.96 --Sequential Output-- --Sequential Input- > --Random- > Concurrency 4 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- > --Seeks-- > MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP > /sec %CP > db111 24G 260044 33 62110 17 89914 15 > 1167 25 > Latency6549ms4882ms 3395ms > 107ms > > I know that's hard to read. What it's saying is: > > Seq Writes: 260mb/s combined > Seq Reads: 89mb/s combined > Read Latency: 3.3s > > Best guess is that this is a result of overloading the array/drives with > commands for all those small blocks; certainly the behavior observed > (stuttering I/O, latency) is in line with that issue. > > Anyway, since this is a DW-like workload, we just bumped the recordsize > up to 128K and the performance issues went away ... reads up over 300mb/s. > > -- > -- Josh Berkus > PostgreSQL Experts Inc. > http://www.pgexperts.com > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance > -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] 8K recordsize bad on ZFS?
As I said, the record size is applied on the file creation :-) so by copying your data from one directory to another one you've made the new record size applied on the newly created files :-) (equal to backup restore if there was not enough space).. Did you try to redo the same but still keeping record size equal 8K ? ;-) I think the problem you've observed is simply related to the copy-on-write nature of ZFS - if you bring any modification to the data your sequential order of pages was broken with a time, and finally the sequential read was transformed to the random access.. But once you've re-copied your files again - the right order was applied again. BTW, 8K is recommended for OLTP workloads, but for DW you may stay with 128K without problem. Rgds, -Dimitri On 5/10/10, Josh Berkus wrote: > On 5/9/10 1:45 AM, Dimitri wrote: >> Josh, >> >> it'll be great if you explain how did you change the records size to >> 128K? - as this size is assigned on the file creation and cannot be >> changed later - I suppose that you made a backup of your data and then >> process a full restore.. is it so? > > You can change the recordsize of the zpool dynamically, then simply copy > the data directory (with PostgreSQL shut down) to a new directory on > that zpool. This assumes that you have enough space on the zpool, of > course. > > We didn't test how it would work to let the files in the Postgres > instance get gradually replaced by "natural" updating. > > -- > -- Josh Berkus > PostgreSQL Experts Inc. > http://www.pgexperts.com > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance > -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Slow query with planner row strange estimation
It's probably one of the cases when having HINTS in PostgreSQL may be very helpful.. SELECT /*+ enable_nestloop=off */ ... FROM ... will just fix this query without impacting other queries and without adding any additional instructions into the application code.. So, why there is a such resistance to implement hints withing SQL queries in PG?.. Rgds, -Dimitri On 7/9/10, Robert Haas wrote: > On Fri, Jul 9, 2010 at 6:13 AM, damien hostin > wrote: >>> Have you tried running ANALYZE on the production server? >>> >>> You might also want to try ALTER TABLE ... SET STATISTICS to a large >>> value on some of the join columns involved in the query. >> >> Hello, >> >> Before comparing the test case on the two machines, I run analyse on the >> whole and look at pg_stats table to see if change occurs for the columns. >> but on the production server the stats never became as good as on the >> desktop computer. I set statistic at 1 on column used by the join, run >> analyse which take a 300 row sample then look at the stats. The stats >> are not as good as on the desktop. Row number is nearly the same but only >> 1 >> or 2 values are found. >> >> The data are not balanced the same way on the two computer : >> - Desktop is 12000 rows with 6000 implicated in the query (50%), >> - "Production" (actually a dev/test server) is 6 million rows with 6000 >> implicated in the query (0,1%). >> Columns used in the query are nullable, and in the 5994000 other rows that >> are not implicated in the query these columns are null. >> >> I don't know if the statistic target is a % or a number of value to >> obtain, > > It's a number of values to obtain. > >> but event set at max (1), it didn't managed to collect good stats (for >> this particular query). > > I think there's a cutoff where it won't collect values unless they > occur significantly more often than the average frequency. I wonder > if that might be biting you here: without the actual values in the MCV > table, the join selectivity estimates probably aren't too good. > >> As I don't know what more to do, my conclusion is that the data need to be >> better balanced to allow the analyse gather better stats. But if there is >> a >> way to improve the stats/query with this ugly balanced data, I'm open to >> it >> ! >> >> I hope that in real production, data will never be loaded this way. If >> this >> appened we will maybe set enable_nestloop to off, but I don't think it's a >> good solution, other query have a chance to get slower. > > Yeah, that usually works out poorly. > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise Postgres Company > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance > -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] 32 vs 64 bit build on Solaris Sparc
Hi Joe, the general rule on Solaris SPARC is: - if you need to address a big size of memory (over 4G): compile in 64bit - otherwise: compile in 32bit ;-) It's true that 32bit code will run faster comparing to 64bit ont the 64bit SPARC - you'll operate with 2 times shorter addresses, and in some cases SPARC will be able to execute 2 operations in parallel on 32bit code, while it'll be still one operation on 64bit code.. - But it's all about the code, because once you start to do I/O requests all kind of optimization on the instructions will be lost due I/O latency ;-)) So, as usual, a real answer in each case may be obtained only by a real test.. Just test both versions and you'll see yourself what is a valid in your case :-)) Same problem regarding compilers: in some cases GCC4 will give a better result, in some cases Sun Studio will be better (there are many posts in blogs about optimal compiler options to use).. - don't hesitate to try and don't forget to share here with others :-)) Rgds, -Dimitri On 8/11/10, Joseph Conway wrote: > With a 16 CPU, 32 GB Solaris Sparc server, is there any conceivable > reason to use a 32 bit build rather than a 64 bit build? Apparently the > Sun PostgreSQL package includes a README that indicates you might want > to think twice about using 64 bit because it is slower -- this seems > like outdated advice, but I was looking for confirmation one way or the > other. > > Also semi-related question: when building from source, using gcc, > enabling debug (but *not* assert) is normally not much of a performance > hit. Is the same true if you build with the Sun CC? > > Thanks in advance for any thoughts/experiences. > > Joe > > > > -- > Joe Conway > credativ LLC: http://www.credativ.us > Linux, PostgreSQL, and general Open Source > Training, Service, Consulting, & 24x7 Support > > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance > -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: Fwd: [PERFORM] Vacuum Full + Cluster + Vacuum full = non removable dead rows
So, does it mean that VACUUM will never clean dead rows if you have a non-stop transactional activity in your PG database???... (24/7 OLTP for ex.) Rgds, -Dimitri On 8/19/10, Kevin Grittner wrote: > Alexandre de Arruda Paes wrote: >> 2010/8/18 Tom Lane > >>> There's an open transaction somewhere that VACUUM is preserving >>> the tuples for. This transaction need not ever have touched the >>> table, or ever intend to touch the table --- but VACUUM cannot >>> know that, so it saves any tuples that the transaction might be >>> entitled to see if it looked. >>> >>> > carmen=# select * from vlocks where relname='tp93t'; select * >>> > from pg_stat_activity where usename='webpa'; >>> >>> You keep on showing us only subsets of pg_stat_activity :-( > >> select * from pg_stat_activity where usename='webpa'; > > You keep on showing us only subsets of pg_stat_activity :-( > > *ANY* open transaction, including "idle in transaction" including > transactions by other users in other databases will prevent vacuum > from cleaning up rows, for the reasons Tom already gave you. > > What do you get from?: > > select * from pg_stat_activity where current_query <> '' > order by xact_start limit 10; > > -Kevin > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance > -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: Fwd: [PERFORM] Vacuum Full + Cluster + Vacuum full = non removable dead rows
Great! - it's what I expected until now :-) but discussion in this thread put my mind in trouble :-)) So, the advice for Alexandre here is just to check the age of the oldest running transaction and the last time when the table in question was modified.. - if modification time is older than the oldest transaction = we have a problem in PG.. Otherwise it works as expected to match MVCC. Rgds, -Dimitri On 8/21/10, Scott Marlowe wrote: > No, it means it can't clean rows that are younger than the oldest > transaction currently in progress. if you started a transaction 5 > hours ago, then all the dead tuples created in the last 5 hours are > not recoverable. Dead tuples created before that transaction are > recoverable. If you run transactions for days or weeks, then you're > gonna have issues. > > On Sat, Aug 21, 2010 at 2:25 AM, Dimitri wrote: >> So, does it mean that VACUUM will never clean dead rows if you have a >> non-stop transactional activity in your PG database???... (24/7 OLTP >> for ex.) >> >> Rgds, >> -Dimitri >> >> >> On 8/19/10, Kevin Grittner wrote: >>> Alexandre de Arruda Paes wrote: >>>> 2010/8/18 Tom Lane >>> >>>>> There's an open transaction somewhere that VACUUM is preserving >>>>> the tuples for. This transaction need not ever have touched the >>>>> table, or ever intend to touch the table --- but VACUUM cannot >>>>> know that, so it saves any tuples that the transaction might be >>>>> entitled to see if it looked. >>>>> >>>>> > carmen=# select * from vlocks where relname='tp93t'; select * >>>>> > from pg_stat_activity where usename='webpa'; >>>>> >>>>> You keep on showing us only subsets of pg_stat_activity :-( >>> >>>> select * from pg_stat_activity where usename='webpa'; >>> >>> You keep on showing us only subsets of pg_stat_activity :-( >>> >>> *ANY* open transaction, including "idle in transaction" including >>> transactions by other users in other databases will prevent vacuum >>> from cleaning up rows, for the reasons Tom already gave you. >>> >>> What do you get from?: >>> >>> select * from pg_stat_activity where current_query <> '' >>> order by xact_start limit 10; >>> >>> -Kevin >>> >>> -- >>> Sent via pgsql-performance mailing list >>> (pgsql-performance@postgresql.org) >>> To make changes to your subscription: >>> http://www.postgresql.org/mailpref/pgsql-performance >>> >> >> -- >> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) >> To make changes to your subscription: >> http://www.postgresql.org/mailpref/pgsql-performance >> > > > > -- > To understand recursion, one must first understand recursion. > -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: Fwd: [PERFORM] Vacuum Full + Cluster + Vacuum full = non removable dead rows
The problem here is that we're trying to keep an image of a whole world for any transaction which is in most cases will need to get a look on few streets around.. ;-) I understand well that it's respecting the standard and so on, but the background problem that you may see your table bloated just because there is a long running transaction appeared in another database, and if it's maintained/used/etc by another team - the problem very quickly may become human rather technical :-)) So, why simply don't add a FORCE option to VACUUM?.. - In this case if one executes "VACUUM FORCE TABLE" will be just aware about what he's doing and be sure no one of the active transactions will be ever access this table. What do you think?.. ;-) Rgds, -Dimitri On 8/22/10, Robert Haas wrote: > On Sat, Aug 21, 2010 at 9:49 AM, Alexandre de Arruda Paes > wrote: >> Only for discussion: the CLUSTER command, in my little knowledge, is a >> intrusive command that's cannot recover the dead tuples too. >> >> Only TRUNCATE can do this job, but obviously is not applicable all the >> time. > > Either VACUUM or CLUSTER will recover *dead* tuples. What you can't > recover are tuples that are still visible to some running transaction. > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise Postgres Company > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance > -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] How to achieve sustained disk performance of 1.25 GB write for 5 mins
You may also try the Sun's F5100 (flash storage array) - you may easily get 700 MB/s just with a single I/O stream (single process), so just with 2 streams you'll get your throughput.. - The array has 2TB total space and max throughput should be around 4GB/s.. Rgds, -Dimitri On 11/18/10, Greg Smith wrote: > Eric Comeau wrote: >> Ideally 1 large file, but it may have to be multiple. We find that if >> we send multiple files it just causes the disk to thrash more so we >> get better throughput by sending one large file. > > If it's really one disk, sure. The problem you're facing is that your > typical drive controller is going to top out at somewhere between 300 - > 500MB/s of sequential writes before it becomes the bottleneck. Above > somewhere between 6 and 10 drives attached to one controller on current > hardware, adding more to a RAID-0 volume only increases the ability to > handle seeks quickly. If you want to try and do this with traditional > hard drives, I'd guess you'd need 3 controllers with at least 4 > short-stroked drives attached to each to have any hope of hitting > 1.25GB/s. Once you do that, you'll run into CPU time as the next > bottleneck. At that point, you'll probably need one CPU per controller, > all writing out at once, to keep up with your target. > > The only popular hardware design that comes to mind aimed at this sort > of thing was Sun's "Thumper" design, most recently seen in the Sun Fire > X4540. That put 8 controllers with 6 disks attached to each, claiming > "demonstrated up to 2 GB/sec from disk to network". It will take a > design like that, running across multiple controllers, to get what > you're looking for on the disk side--presuming everything else keeps up. > > One of the big SSD-on-PCI-e designs mentioned here already may very well > end up being a better choice for you here though, as those aren't going > to require quite as much hardware all get wired up. > > -- > Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD > PostgreSQL Training, Services and Supportwww.2ndQuadrant.us > "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books > > -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] PostgreSQL vs MySQL, and FreeBSD
Seems to me there is more thread model implementation problem on FreeBSD, and databases just reflecting it... Most of the test I done on Solaris show the same performance level on the same short READ-only queries for MySQL and PostgreSQL. And to be honest till the end, thread model should be far faster (context switching between threads is way faster vs processes), but - as I say usually - even a very good idea may be just wasted by a poor implementation... And in case of MySQL they have too much locking to manage concurrency between threads which kills all thread model benefits... Also, to compare apples to apples, they should run this test from remote client rather locally on the same host - however in this case the result for PostgreSQL will mostly depends on client implementation: if client implements reading via CURSOR (quite often), reading will generate 4x times more intensive network traffic than necessary and final PostgreSQL result will be worse... Reading this article I'm just happy for them to see progress done on FreeBSD :-) As well to demonstrate OS parallelism it's not so impressive to see 4CPU server results rather 8CPU or 32threaded Niagara... Don't know why they did not present similar performance graphs for these platform, strange no?... Rgds, -Dimitri On 11/9/07, Ron Mayer <[EMAIL PROTECTED]> wrote: > Bill Moran wrote: > > On Fri, 9 Nov 2007 11:11:18 -0500 (EST) > > Greg Smith <[EMAIL PROTECTED]> wrote: > >> On Fri, 9 Nov 2007, Sebastian Hennebrueder wrote: > >>> If the queries are complex, this is understable. > >> The queries used for this comparison are trivial. There's only one table > >> involved and there are no joins. It's testing very low-level aspects of > >> performance. > > > > Actually, what it's really showing is parallelism, and I've always > > expected PostgreSQL to come out on top in that arena. > > Isn't it showing Postgres winning even without parallelism. > > At 1 threads, Postgres looks like 800TPS where MysQL comes > in at about 600TPS on their Opteron charts. > > ---(end of broadcast)--- > TIP 9: In versions below 8.0, the planner will ignore your desire to >choose an index scan if your joining column's datatypes do not >match > ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
[PERFORM] Any better plan for this query?..
Hi, any idea if there is a more optimal execution plan possible for this query: select S.REF as stref, S.NAME as stnm, H.HORDER as hord, H.BEGIN_DATE as hbeg, H.END_DATE as hend, H.NOTE as hnote from HISTORY H, STAT S where S.REF = H.REF_STAT and H.REF_OBJECT = '01' order by H.HORDER ; EXPLAIN ANALYZE output on 8.4: QUERY PLAN Sort (cost=4549.75..4555.76 rows=2404 width=176) (actual time=1.341..1.343 rows=20 loops=1) Sort Key: h.horder Sort Method: quicksort Memory: 30kB -> Hash Join (cost=33.50..4414.75 rows=2404 width=176) (actual time=1.200..1.232 rows=20 loops=1) Hash Cond: (h.ref_stat = s.ref) -> Index Scan using history_ref_idx on history h (cost=0.00..4348.20 rows=2404 width=135) (actual time=0.042..0.052 rows=20 loops=1) Index Cond: (ref_object = '01'::bpchar) -> Hash (cost=21.00..21.00 rows=1000 width=45) (actual time=1.147..1.147 rows=1000 loops=1) -> Seq Scan on stat s (cost=0.00..21.00 rows=1000 width=45) (actual time=0.005..0.325 rows=1000 loops=1) Total runtime: 1.442 ms (10 rows) Table HISTORY contains 200M rows, only 20 needed Table STAT contains 1000 rows, only 20 needed to be joined to HISTORY values. Table definitions: """"""""""""""""""""""""""""""""""""""""""""""""""" create table STAT ( REF CHAR(3)not null, NAMECHAR(40) not null, NUMBINTnot null ); create table HISTORY ( REF_OBJECT CHAR(10) not null, HORDER INT not null, REF_STATCHAR(3) not null, BEGIN_DATE CHAR(12) not null, END_DATECHAR(12) , NOTECHAR(100) ); create unique index stat_ref_idx on STAT( ref ); create index history_ref_idx on HISTORY( ref_object, horder ); """"""""""""""""""""""""""""""""""""""""""""""""""" NOTE: The same query runs 2 times faster on MySQL. Any idea?.. Rgds, -Dimitri -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Any better plan for this query?..
Hi Craig, yes, you detailed very well the problem! :-) all those CHAR columns are so just due historical issues :-) as well they may contains anything else and not only numbers, that's why.. Also, all data inside are fixed, so VARCHAR will not save place, or what kind of performance issue may we expect with CHAR vs VARCHAR if all data have a fixed length?.. Any way to force nested loop without additional index?.. It's 2 times faster on InnoDB, and as it's just a SELECT query no need to go in transaction details :-) Rgds, -Dimitri On 5/6/09, Craig Ringer wrote: > Dimitri wrote: >> Hi, >> >> any idea if there is a more optimal execution plan possible for this >> query: >> >> select S.REF as stref, S.NAME as stnm, H.HORDER as hord, H.BEGIN_DATE as >> hbeg, >> H.END_DATE as hend, H.NOTE as hnote >> from HISTORY H, STAT S >> where S.REF = H.REF_STAT >> and H.REF_OBJECT = '01' >> order by H.HORDER ; > > OK, so you're taking a simple: > > history INNER JOIN stat ON (stat.ref = history.ref_stat) > > then filtering for records with a particular value of history.ref_object > and finally performing a sort. > > If I'm reading it right, the plan below does a sequential scan on the > `stat' table. The stat table only has 1000 rows, so this isn't > necessarily an unreasonable choice even if there is an appropriate index > and even if not many of the rows will be needed. > > It then does an index scan of the history table looking for tuples with > ref_object = '01' (text match). It hash joins the hashed results > of the initial seq scan to the results of the index scan, and sorts the > result. > > To me, that looks pretty reasonable. You might be able to avoid the hash > join in favour of a nested loop scan of stat_ref_idx (looping over > records from history.ref_stat where ref_object = '001') by > providing a composite index on HISTORY(ref_stat, ref_object). I'm really > not too sure, though; plan optimization isn't my thing, I'm just seeing > if I can offer a few ideas. > >> Table definitions: > > While not strictly necessary, it's a *REALLY* good idea to define a > suitable PRIMARY KEY. > > Also, the `CHAR(n)' data type is evil. E.V.I.L. Use `varchar(n)' for > bounded-length values, or `text' for unbounded fields, unless you REALLY > want the crazy behaviour of `CHAR(n)'. > > I'm a little bit puzzled about why you seem to be doing lots of things > with integer values stored in text strings, but that probably doesn't > matter too much for the issue at hand. > >> NOTE: The same query runs 2 times faster on MySQL. > > With InnoDB tables and proper transactional safety? Or using scary > MyISAM tables and a "just pray" approach to data integrity? If you're > using MyISAM tables I'm not surprised; MySQL with MyISAM is stunningly > fast, but oh-my-god dangerous. > > -- > Craig Ringer > -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Any better plan for this query?..
Hi Heikki, I've already tried a target 1000 and the only thing it changes comparing to the current 100 (default) is instead of 2404 rows it says 240 rows, but the plan remaining the same.. Rgds, -Dimitri On 5/6/09, Heikki Linnakangas wrote: > Dimitri wrote: >> any idea if there is a more optimal execution plan possible for this >> query: >> >> select S.REF as stref, S.NAME as stnm, H.HORDER as hord, H.BEGIN_DATE as >> hbeg, >> H.END_DATE as hend, H.NOTE as hnote >> from HISTORY H, STAT S >> where S.REF = H.REF_STAT >> and H.REF_OBJECT = '01' >> order by H.HORDER ; >> >> EXPLAIN ANALYZE output on 8.4: >>QUERY >> PLAN >> >> Sort (cost=4549.75..4555.76 rows=2404 width=176) (actual >> time=1.341..1.343 rows=20 loops=1) >>Sort Key: h.horder >>Sort Method: quicksort Memory: 30kB >>-> Hash Join (cost=33.50..4414.75 rows=2404 width=176) (actual >> time=1.200..1.232 rows=20 loops=1) >> Hash Cond: (h.ref_stat = s.ref) >> -> Index Scan using history_ref_idx on history h >> (cost=0.00..4348.20 rows=2404 width=135) (actual time=0.042..0.052 >> rows=20 loops=1) >>Index Cond: (ref_object = '01'::bpchar) >> -> Hash (cost=21.00..21.00 rows=1000 width=45) (actual >> time=1.147..1.147 rows=1000 loops=1) >>-> Seq Scan on stat s (cost=0.00..21.00 rows=1000 >> width=45) (actual time=0.005..0.325 rows=1000 loops=1) >> Total runtime: 1.442 ms >> (10 rows) >> >> Table HISTORY contains 200M rows, only 20 needed >> Table STAT contains 1000 rows, only 20 needed to be joined to HISTORY >> values. > > The bad doesn't look too bad to me, although the planner is > over-estimating the number of matches in the history table (2404 vs 20). > That's a bit surprising given how simple the predicate is. Make sure > you've ANALYZEd the table. If that's not enough, you can try to increase > the statistics target for ref_object column, ie. ALTER TABLE history > ALTER COLUMN ref_object SET STATISTICS 500. That might give you a > different plan, maybe with a nested loop join instead of hash join, > which might be faster in this case. > > -- >Heikki Linnakangas >EnterpriseDB http://www.enterprisedb.com > -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Any better plan for this query?..
Hi Chris, the only problem I see here is it's 2 times slower vs InnoDB, so before I'll say myself it's ok I want to be sure there is nothing else to do.. :-) Rgds, -Dimitri On 5/6/09, Chris wrote: > Dimitri wrote: >> Hi Craig, >> >> yes, you detailed very well the problem! :-) >> all those CHAR columns are so just due historical issues :-) as well >> they may contains anything else and not only numbers, that's why.. >> Also, all data inside are fixed, so VARCHAR will not save place, or >> what kind of performance issue may we expect with CHAR vs VARCHAR if >> all data have a fixed length?.. > > None in postgres, but the char/varchar thing may or may not bite you at > some point later - sounds like you have it covered though. > >> It's 2 times faster on InnoDB, and as it's just a SELECT query no need >> to go in transaction details :-) > > Total runtime: 1.442 ms > (10 rows) > > You posted a query that's taking 2/1000's of a second. I don't really > see a performance problem here :) > > -- > Postgresql & php tutorials > http://www.designmagick.com/ > > -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Any better plan for this query?..
Hi Richard, no, of course it's not based on explain :-) I've run several tests before and now going in depth to understand if there is nothing wrong. Due such a single query time difference InnoDB is doing 2-3 times better TPS level comparing to PostgreSQL.. Rgds, -Dimitri On 5/6/09, Richard Huxton wrote: > Dimitri wrote: >> Hi Chris, >> >> the only problem I see here is it's 2 times slower vs InnoDB > > How do you know? This isn't just based on the explain values reported, > is it? > > -- >Richard Huxton >Archonet Ltd > -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Any better plan for this query?..
I'll try to answer all mails at once :-)) - query is running fully in RAM, no I/O, no network, only CPU time - looping 100 times the same query gives 132ms total time (~1.32ms per query), while it's 44ms on InnoDB (~0.44ms per query) - disabling seq scan forcing a planner to use an index scan, and finally it worse as gives 1.53ms per query.. - prepare the query helps: prepare statement takes 16ms, but execute runs in 0.98ms = which make me think it's not only a planner overhead... And it's still 2 times lower vs 0.44ms. Also, generally prepare cannot be used in this test case as we suppose any query may be of any kind (even if it's not always true :-)) - char or varchar should be used here because the reference code is supposed to accept any characters (alphanumeric) - it also reminds me that probably there are some extra CPU time due locale setting - but all my "lc_*" variables are set to "C"... Rgds, -Dimitri On 5/6/09, Merlin Moncure wrote: > On Wed, May 6, 2009 at 7:46 AM, Merlin Moncure wrote: >> prepare history_stat(char(10) as > > typo: > prepare history_stat(char(10)) as > -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Any better plan for this query?..
The story is simple: for the launching of MySQL 5.4 I've done a testing comparing available on that time variations of InnoDB engines, and at the end by curiosity started the same test with PostgreSQL 8.3.7 to see if MySQL performance level is more close to PostgreSQL now (PG was a strong true winner before). For my big surprise MySQL 5.4 outpassed 8.3.7... However, analyzing the PostgreSQL processing I got a feeling something goes wrong on PG side.. So, now I've installed both 8.3.7 and 8.4beta1 to see more in depth what's going on. Currently 8.4 performs much better than 8.3.7, but there is still a room for improvement if such a small query may go faster :-) Rgds, -Dimitri On 5/6/09, Albe Laurenz wrote: > Dimitri wrote: >> I've run several tests before and now going in depth to understand if >> there is nothing wrong. Due such a single query time difference InnoDB >> is doing 2-3 times better TPS level comparing to PostgreSQL.. > > Why don't you use MySQL then? > Or tune PostgreSQL? > > Yours, > Laurenz Albe > -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Any better plan for this query?..
Folks, first of all: - I used a fixed reference value just to simplify the case analyzing and isolate it as max as possible, of course during my tests all values are random :-) - final goal of the test is to analyze scalability, so yes, concurrent sessions with random keys are growing from 1 to 256 (I run it on 32cores server, no think time, just stressing), and the result is still not yet better comparing to InnoDB - I'm analyzing this query running in memory to understand what's blocking while all main bottlenecks are avoided (no I/O anymore nor network, etc.) - initial explain analyze and table details were posted in the first message Now, let's go more further: - so "as it" query execution took 1.50ms - after removing "order by" it took 1.19ms - select count(*) instead of columns and with removed "order by" took 0.98ms - execute of the same prepared "select count(*) ..." took 0.68ms So, where the time is going?... Rgds, -Dimitri On 5/6/09, Ries van Twisk wrote: > > On May 6, 2009, at 7:53 AM, Richard Huxton wrote: > >> Dimitri wrote: >>> I'll try to answer all mails at once :-)) >>> - query is running fully in RAM, no I/O, no network, only CPU time >>> - looping 100 times the same query gives 132ms total time (~1.32ms >>> per >>> query), while it's 44ms on InnoDB (~0.44ms per query) >> >> Well, assuming you're happy that PG is tuned reasonably for your >> machine and that MySQL's query cache isn't returning the results >> here it looks like MySQL is faster for this particular query. >> >> The only obvious place there could be a big gain is with the hashing >> algorithm. If you remove the ORDER BY and the query-time doesn't >> fall by much then it's the hash phase. >> >> The other thing to try is to alter the query to be a SELECT count(*) >> rather than returning rows - that will let you measure the time to >> transfer the result rows. >> >> -- >> Richard Huxton >> Archonet Ltd >> > > > Do you expect to run this query 100 times per second during your > application? > or is this just a test to see how fast the query is for optimalisation. > > I always get scared myself with such a test as 'runs out of memory', > reason > given is that usually this is not really the case in a production > environment. > > Try to make a little test case where you give the query random > parameters > so different result sets are returned. This will give you a better > idea on how > fast the query really is and might give you better comparison results. > > instead of count(*) I isusallt do explain analyze to see how fast > PostgreSQL handles to query. > > Ries > > > -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Any better plan for this query?..
I supposed in case with prepare and then execute a query optimizer is no more coming in play on "execute" phase, or did I miss something?.. Forget to say: query cache is disabled on MySQL side. Rgds, -Dimitri On 5/6/09, Craig Ringer wrote: > Dimitri wrote: >> Hi Chris, >> >> the only problem I see here is it's 2 times slower vs InnoDB, so >> before I'll say myself it's ok I want to be sure there is nothing else >> to do.. :-) > > Can the genetic query optimizer come into play on small queries? > > -- > Craig Ringer > -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Any better plan for this query?..
Hi Ken, yes, I may do it, but I did not expect to come into profiling initially :-) I expected there is just something trivial within a plan that I just don't know.. :-) BTW, is there already an integrated profiled within a code? or do I need external tools?.. Rgds, -Dimitri On 5/6/09, Kenneth Marshall wrote: > On Wed, May 06, 2009 at 09:38:59AM +0200, Dimitri wrote: >> Hi, >> >> any idea if there is a more optimal execution plan possible for this >> query: >> >> select S.REF as stref, S.NAME as stnm, H.HORDER as hord, H.BEGIN_DATE as >> hbeg, >> H.END_DATE as hend, H.NOTE as hnote >> from HISTORY H, STAT S >> where S.REF = H.REF_STAT >> and H.REF_OBJECT = '01' >> order by H.HORDER ; >> >> EXPLAIN ANALYZE output on 8.4: >>QUERY >> PLAN >> >> Sort (cost=4549.75..4555.76 rows=2404 width=176) (actual >> time=1.341..1.343 rows=20 loops=1) >>Sort Key: h.horder >>Sort Method: quicksort Memory: 30kB >>-> Hash Join (cost=33.50..4414.75 rows=2404 width=176) (actual >> time=1.200..1.232 rows=20 loops=1) >> Hash Cond: (h.ref_stat = s.ref) >> -> Index Scan using history_ref_idx on history h >> (cost=0.00..4348.20 rows=2404 width=135) (actual time=0.042..0.052 >> rows=20 loops=1) >>Index Cond: (ref_object = '01'::bpchar) >> -> Hash (cost=21.00..21.00 rows=1000 width=45) (actual >> time=1.147..1.147 rows=1000 loops=1) >>-> Seq Scan on stat s (cost=0.00..21.00 rows=1000 >> width=45) (actual time=0.005..0.325 rows=1000 loops=1) >> Total runtime: 1.442 ms >> (10 rows) >> >> Table HISTORY contains 200M rows, only 20 needed >> Table STAT contains 1000 rows, only 20 needed to be joined to HISTORY >> values. >> >> Table definitions: >> """"""""""""""""""""""""""""""""""""""""""""""""""" >> create table STAT >> ( >> REF CHAR(3)not null, >> NAMECHAR(40) not null, >> NUMBINTnot null >> ); >> >> create table HISTORY >> ( >> REF_OBJECT CHAR(10) not null, >> HORDER INT not null, >> REF_STATCHAR(3) not null, >> BEGIN_DATE CHAR(12) not null, >> END_DATECHAR(12) , >> NOTECHAR(100) >> ); >> >> create unique index stat_ref_idx on STAT( ref ); >> create index history_ref_idx on HISTORY( ref_object, horder ); >> """"""""""""""""""""""""""""""""""""""""""""""""""" >> >> NOTE: The same query runs 2 times faster on MySQL. >> >> Any idea?.. >> >> Rgds, >> -Dimitri >> > Dimitri, > > Is there any chance of profiling the postgres backend to see > where the time is used? > > Just an idea, > Ken > -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Any better plan for this query?..
Hi Simon, may you explain why REINDEX may help here?.. - database was just created, data loaded, and then indexes were created + analyzed.. What may change here after REINDEX?.. With hashjoin disabled was a good try! Running this query "as it" from 1.50ms we move to 0.84ms now, and the plan is here: QUERY PLAN -- Sort (cost=4562.83..4568.66 rows=2329 width=176) (actual time=0.225..0.229 rows=20 loops=1) Sort Key: h.horder Sort Method: quicksort Memory: 30kB -> Merge Join (cost=4345.89..4432.58 rows=2329 width=176) (actual time=0.056..0.205 rows=20 loops=1) Merge Cond: (s.ref = h.ref_stat) -> Index Scan using stat_ref_idx on stat s (cost=0.00..49.25 rows=1000 width=45) (actual time=0.012..0.079 rows=193 loops=1) -> Sort (cost=4345.89..4351.72 rows=2329 width=135) (actual time=0.041..0.043 rows=20 loops=1) Sort Key: h.ref_stat Sort Method: quicksort Memory: 30kB -> Index Scan using history_ref_idx on history h (cost=0.00..4215.64 rows=2329 width=135) (actual time=0.013..0.024 rows=20 loops=1) Index Cond: (ref_object = '01'::bpchar) Total runtime: 0.261 ms (12 rows) Curiously planner expect to run it in 0.26ms Any idea why planner is not choosing this plan from the beginning?.. Any way to keep this plan without having a global or per sessions hashjoin disabled?.. Rgds, -Dimitri On 5/6/09, Simon Riggs wrote: > > On Wed, 2009-05-06 at 10:31 +0200, Dimitri wrote: > >> I've already tried a target 1000 and the only thing it changes >> comparing to the current 100 (default) is instead of 2404 rows it says >> 240 rows, but the plan remaining the same.. > > Try both of these things > * REINDEX on the index being used in the query, then re-EXPLAIN > * enable_hashjoin = off, then re-EXPLAIN > > You should first attempt to get the same plan, then confirm it really is > faster before we worry why the optimizer hadn't picked that plan. > > We already know that MySQL favors nested loop joins, so turning up a > plan that on this occasion is actually better that way is in no way > representative of general performance. Does MySQL support hash joins? > > -- > Simon Riggs www.2ndQuadrant.com > PostgreSQL Training, Services and Support > > -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Any better plan for this query?..
The problem with "gprof" - it'll profile all stuff from the beginning to the end, and a lot of function executions unrelated to this query will be profiled... As well when you look on profiling technology - all such kind of solutions are based on the system clock frequency and have their limits on time resolution. On my system this limit is 0.5ms, and it's too big comparing to the query execution time :-) So, what I've done - I changed little bit a reference key criteria from = '01' to < '51', so instead of 20 rows I have 1000 rows on output now, it's still slower than InnoDB (12ms vs 9ms), but at least may be profiled (well, we also probably moving far from the problem as time may be spent mostly on the output traffic now, but I've tried anyway) - I've made a loop of 100 iterations of this query which is reading but not printing data. The total execution time of this loop is 1200ms, and curiously under profiling was not really changed. Profiler was able to catch 733ms of total execution time (if I understand well, all functions running faster than 0.5ms are remain un-profiled). The top profiler output is here: Excl. Incl. Name User CPU User CPU sec. sec. 0.733 0.733 0.103 0.103 memcpy 0.045 0.045 slot_deform_tuple 0.037 0.040 AllocSetAlloc 0.021 0.021 AllocSetFree 0.018 0.037 pfree 0.018 0.059 appendBinaryStringInfo 0.017 0.031 heap_fill_tuple 0.017 0.017 _ndoprnt 0.016 0.016 nocachegetattr 0.015 0.065 heap_form_minimal_tuple 0.015 0.382 ExecProcNode 0.015 0.015 strlen 0.014 0.037 ExecScanHashBucket 0.014 0.299 printtup 0.013 0.272 ExecHashJoin 0.011 0.011 enlargeStringInfo 0.011 0.086 index_getnext 0.010 0.010 hash_any 0.009 0.076 FunctionCall1 0.009 0.037 MemoryContextAlloc 0.008 0.008 LWLockAcquire 0.007 0.069 pq_sendcountedtext 0.007 0.035 ExecProject 0.007 0.127 ExecScan ... Curiously "memcpy" is in top. Don't know if it's impacted in many cases, but probably it make sense to see if it may be optimized, etc.. Rgds, -Dimitri On 5/7/09, Euler Taveira de Oliveira wrote: > Dimitri escreveu: >> BTW, is there already an integrated profiled within a code? or do I >> need external tools?.. >> > Postgres provides support for profiling. Add --enable-profiling flag. Use > gprof to get the profile. > > > -- > Euler Taveira de Oliveira > http://www.timbira.com/ > -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Any better plan for this query?..
I've simply restarted a full test with hashjoin OFF. Until 32 concurrent users things are going well. Then since 32 users response time is jumping to 20ms, with 64 users it's higher again, and with 256 users reaching 700ms, so TPS is dropping from 5.000 to ~200.. With hashjoin ON it's not happening, and I'm reaching at least 11.000 TPS on fully busy 32 cores. I should not use prepare/execute as the test conditions should remain "generic". About scalability issue - there is one on 8.3.7, because on 32 cores with such kind of load it's using only 50% CPU and not outpassing 6.000 TPS, while 8.4 uses 90% CPU and reaching 11.000 TPS.. On the same time while I'm comparing 8.3 and 8.4 - the response time is 2 times lower in 8.4, and seems to me the main gain for 8.4 is here. I'll publish all details, just need a time :-) Rgds, -Dimitri On 5/7/09, Merlin Moncure wrote: > On Thu, May 7, 2009 at 4:20 AM, Dimitri wrote: >> Hi Simon, >> >> may you explain why REINDEX may help here?.. - database was just >> created, data loaded, and then indexes were created + analyzed.. What >> may change here after REINDEX?.. >> >> With hashjoin disabled was a good try! >> Running this query "as it" from 1.50ms we move to 0.84ms now, >> and the plan is here: >> >> QUERY >> PLAN >> -- >> Sort (cost=4562.83..4568.66 rows=2329 width=176) (actual >> time=0.225..0.229 rows=20 loops=1) >> Sort Key: h.horder >> Sort Method: quicksort Memory: 30kB >> -> Merge Join (cost=4345.89..4432.58 rows=2329 width=176) (actual >> time=0.056..0.205 rows=20 loops=1) >> Merge Cond: (s.ref = h.ref_stat) >> -> Index Scan using stat_ref_idx on stat s >> (cost=0.00..49.25 rows=1000 width=45) (actual time=0.012..0.079 >> rows=193 loops=1) >> -> Sort (cost=4345.89..4351.72 rows=2329 width=135) (actual >> time=0.041..0.043 rows=20 loops=1) >> Sort Key: h.ref_stat >> Sort Method: quicksort Memory: 30kB >> -> Index Scan using history_ref_idx on history h >> (cost=0.00..4215.64 rows=2329 width=135) (actual time=0.013..0.024 >> rows=20 loops=1) >> Index Cond: (ref_object = '01'::bpchar) >> Total runtime: 0.261 ms >> (12 rows) >> >> Curiously planner expect to run it in 0.26ms >> >> Any idea why planner is not choosing this plan from the beginning?.. >> Any way to keep this plan without having a global or per sessions >> hashjoin disabled?.. > > can you work prepared statements into your app? turn off hash join, > prepare the query, then turn it back on. > > merlin > -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Any better plan for this query?..
Folks, it's completely crazy, but here is what I found: - if HISTORY table is analyzed with target 1000 my SELECT response time is jumping to 3ms, and the max throughput is limited to 6.000 TPS (it's what happenned to 8.3.7) -if HISTORY table is analyzed with target 5 - my SELECT response time is decreasing to 1.2ms (!) and then my max TPS level is ~12.000 ! and CPU is used up to 95% even by 8.3.7 :-) and 8.4 performed better just because I left its analyze target to default 100 value. Anyone may explain me why analyze target may have so huge negative secondary effect?.. Next point: SCALABILITY ISSUE Now both 8.3.7 and 8.4 have similar performance levels, but 8.3.7 is always slightly better comparing to 8.4, but well. The problem I have: - on 8 cores: ~5.000 TPS / 5.500 MAX - on 16 cores: ~10.000 TPS / 11.000 MAX - on 32 cores: ~10.500 TPS / 11.500 MAX What else may limit concurrent SELECTs here?.. Yes, forget, MySQL is reaching 17.500 TPS here. Rgds, -Dimitri On 5/7/09, Simon Riggs wrote: > > On Thu, 2009-05-07 at 20:36 +0200, Dimitri wrote: > >> I've simply restarted a full test with hashjoin OFF. Until 32 >> concurrent users things are going well. Then since 32 users response >> time is jumping to 20ms, with 64 users it's higher again, and with 256 >> users reaching 700ms, so TPS is dropping from 5.000 to ~200.. >> >> With hashjoin ON it's not happening, and I'm reaching at least 11.000 >> TPS on fully busy 32 cores. > > Much better to stick to the defaults. > > Sounds like a problem worth investigating further, but not pro bono. > >> About scalability issue - there is one on 8.3.7, because on 32 cores >> with such kind of load it's using only 50% CPU and not outpassing >> 6.000 TPS, while 8.4 uses 90% CPU and reaching 11.000 TPS.. > > Yeh, small changes make a big difference. Thanks for the info. > > How does MySQL perform? > > -- > Simon Riggs www.2ndQuadrant.com > PostgreSQL Training, Services and Support > > -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] What is the most optimal config parameters to keep stable write TPS ?..
Hi, what may you suggest as the most optimal postgresql.conf to keep writing as stable as possible?.. What I want is to avoid "throughput waves" - I want to keep my response times stable without any activity holes. I've tried to reduce checkpoint timeout from 5min to 30sec - it helped, throughput is more stable now, but instead of big waves I have now short waves anyway.. What is the best options combination here?.. Rgds, -Dimitri -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Any better plan for this query?..
Hi Tom, it was not willing :-) it just stayed so after various probes with a query plan. Anyway, on 8.4 the default target is 100, an just by move it to 5 I reached on 16cores 10.500 TPS instead of 8.000 initially. And I think you have a good reason to keep it equal to 100 by default, isn't it? ;-) And what about scalability on 32cores?.. Any idea? Rgds, -Dimitri On 5/11/09, Tom Lane wrote: > Dimitri writes: >> Anyone may explain me why analyze target may have so huge negative >> secondary effect?.. > > If these are simple queries, maybe what you're looking at is the > increase in planning time caused by having to process 10x as much > statistical data. Cranking statistics_target to the max just because > you can is not necessarily a good strategy. > > regards, tom lane > -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] What is the most optimal config parameters to keep stable write TPS ?..
Hi Kevin, PostgreSQL: 8.3.7 & 8.4 Server: Sun M5000 32cores OS: Solaris 10 current postgresql.conf: # max_connections = 2000 # (change requires restart) effective_cache_size = 48000MB shared_buffers = 12000MB temp_buffers = 200MB work_mem = 100MB# min 64kB maintenance_work_mem = 600MB# min 1MB max_fsm_pages = 2048000 fsync = on # turns forced synchronization on or off synchronous_commit = off# immediate fsync at commit wal_sync_method = fdatasync wal_buffers = 2MB wal_writer_delay = 400ms# 1-1 milliseconds checkpoint_segments = 128 checkpoint_timeout = 30s archive_mode = off track_counts = on autovacuum = on log_autovacuum_min_duration = 0 autovacuum_max_workers = 4 autovacuum_naptime = 20 # time between autovacuum runs autovacuum_vacuum_threshold = 50 autovacuum_analyze_threshold = 50 autovacuum_vacuum_scale_factor = 0.001 lc_messages = 'C' lc_monetary = 'C' lc_numeric = 'C' lc_time = 'C' # Rgds, -Dimitri On 5/11/09, Kevin Grittner wrote: > Dimitri wrote: > >> what may you suggest as the most optimal postgresql.conf to keep >> writing as stable as possible?.. >> >> What I want is to avoid "throughput waves" - I want to keep my >> response times stable without any activity holes. I've tried to >> reduce checkpoint timeout from 5min to 30sec - it helped, throughput >> is more stable now, but instead of big waves I have now short waves >> anyway.. >> >> What is the best options combination here?.. > > What version of PostgreSQL? What operating system? What hardware? > > The answers are going to depend on the answers to those questions. > > It would also be good to show all lines from postgresql.conf which are > not commented out. > > -Kevin > -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] What is the most optimal config parameters to keep stable write TPS ?..
Thanks a lot, I'll try them all! Yes, I have a good external storage box with battery backed cache enabled. There are 64GB of RAM so I expected it'll help little bit to increase a buffer cache, but ok, will see if with 256MB it'll be better :-) What about "full_page_writes" ? seems it's "on" by default. Does it makes sense to put if off?.. Rgds, -Dimitri On 5/11/09, Kevin Grittner wrote: > Dimitri wrote: > >> PostgreSQL: 8.3.7 & 8.4 >> Server: Sun M5000 32cores >> OS: Solaris 10 > > Does that have a battery backed RAID controller? If so, is it > configured for write-back? These both help a lot with smoothing > checkpoint I/O gluts. > > We've minimized problems by making the background writer more > aggressive. 8.3 and later does a better job in general, but we've > still had to go with: > > bgwriter_lru_maxpages = 1000 > bgwriter_lru_multiplier = 4.0 > >> shared_buffers = 12000MB > > You might want to test with that set to something much lower, to see > what the checkpoint delays look like. We've found it best to use a > small (256MB) setting, and leave caching to the OS; in our > environment, it seems to do a better job of scheduling the disk I/O. > YMMV, of course. > > -Kevin > -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] What is the most optimal config parameters to keep stable write TPS ?..
Hi Scott, good point - the current checkpoint completion target is a default 0.5, and it makes sense to set it to 0.8 to make writing more smooth, great! yes, data and xlog are separated, each one is sitting on an independent storage LUN RAID1, and storage box is enough performant Thank you! Rgds, -Dimitri On 5/11/09, Scott Marlowe wrote: > On Mon, May 11, 2009 at 10:31 AM, Dimitri wrote: >> Hi Kevin, >> >> PostgreSQL: 8.3.7 & 8.4 >> Server: Sun M5000 32cores >> OS: Solaris 10 >> >> current postgresql.conf: >> >> # >> max_connections = 2000 # (change requires restart) >> effective_cache_size = 48000MB >> shared_buffers = 12000MB >> temp_buffers = 200MB >> work_mem = 100MB# min 64kB >> maintenance_work_mem = 600MB# min 1MB >> >> max_fsm_pages = 2048000 >> fsync = on # turns forced synchronization on >> or off >> synchronous_commit = off# immediate fsync at commit >> wal_sync_method = fdatasync >> wal_buffers = 2MB >> wal_writer_delay = 400ms# 1-1 milliseconds >> >> checkpoint_segments = 128 >> checkpoint_timeout = 30s > > What's your checkpoint completion target set to? Crank that up a bit > ot 0.7, 0.8 etc and make the timeout more, not less. That should > help. > > Also, look into better hardware (RAID controller with battery backed > cache) and also putting pg_xlog on a separate RAID-1 set (or RAID-10 > set if you've got a lot of drives under the postgres data set). > -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] What is the most optimal config parameters to keep stable write TPS ?..
OK, it'll be better to avoid a such improvement :-) Performance - yes, but not for any price :-) Thank you! Rgds, -Dimitri On 5/11/09, Kevin Grittner wrote: > Dimitri wrote: > >> What about "full_page_writes" ? seems it's "on" by default. Does it >> makes sense to put if off?.. > > It would probably help with performance, but the description is a > little disconcerting in terms of crash recovery. We tried running > with it off for a while (a year or so back), but had problems with > corruption. I think the specific cause of that has since been fixed, > it's left us a bit leery of the option. > > Maybe someone else can speak to how safe (or not) the current > implementation of that option is. > > -Kevin > -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Any better plan for this query?..
Hi Simon, it's too early yet to speak about MySQL scalability... :-) it's only since few months there is *no more* regression on MySQL performance while moving from 8 to 16 cores. But looking how quickly it's progressing now things may change very quickly :-) For the moment on my tests it gives: - on 8 cores: 14.000 TPS - on 16 cores: 17.500 TPS - on 32 cores: 15.000 TPS (regression) Rgds, -Dimitri On 5/11/09, Simon Riggs wrote: > > On Mon, 2009-05-11 at 17:18 +0200, Dimitri wrote: > >> Yes, forget, MySQL is reaching 17.500 TPS here. > > Please share your measurements of MySQL scalability also. > > -- > Simon Riggs www.2ndQuadrant.com > PostgreSQL Training, Services and Support > > -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Any better plan for this query?..
Hi Aidan, thanks a lot for this detailed summary! So, why I don't use prepare here: let's say I'm testing the worst stress case :-) Imagine you have thousands of such kind of queries - you cannot prepare all of them! :-) or you'll maybe prepare it once, but as I showed previously in this thread prepare statement itself takes 16ms, so for a single shot there is no gain! :-) Stressing with such kind of short and simple queries (and again, they have joins, it may be even more simple :-)) will give me a result to show with guarantee my worst case - I know then if I have to deploy a bombarding OLTP-like application my database engine will be able to keep such workload, and if I have performance problems they are inside of application! :-) (well, it's very simplistic, but it's not far from the truth :-)) Now, as you see from your explanation, the Part #2 is the most dominant - so why instead to blame this query not to implement a QUERY PLANNER CACHE??? - in way if any *similar* query is recognized by parser we simply *reuse* the same plan?.. Rgds, -Dimitri On 5/11/09, Aidan Van Dyk wrote: > * Dimitri [090511 11:18]: >> Folks, it's completely crazy, but here is what I found: >> >> - if HISTORY table is analyzed with target 1000 my SELECT response >> time is jumping to 3ms, and the max throughput is limited to 6.000 TPS >> (it's what happenned to 8.3.7) >> >> -if HISTORY table is analyzed with target 5 - my SELECT response time >> is decreasing to 1.2ms (!) and then my max TPS level is ~12.000 ! >> and CPU is used up to 95% even by 8.3.7 :-) and 8.4 performed better >> just because I left its analyze target to default 100 value. >> >> Anyone may explain me why analyze target may have so huge negative >> secondary effect?.. > > It's actually pretty straight forward. > > The PostgreSQL query planner is a "smart planner". It takes into > consideration all the statistics available on the columns/tables, > expected outputs based on inputs, etc, to choose what it thinks will be > the best plan. The more data you have in statistics (the larger > statistics target you have), the more CPU time and longer it's going to > take to "plan" your queries. The tradeoff is hopefully better plans. > > But, in your scenario, where you are hitting the database with the > absolute worst possible way to use PostgreSQL, with small, repeated, > simple queries, you're not getting the advantage of "better" plans. In > your case, you're throwing absolutely simple queries at PG as fast as > you can, and for each query, PostgreSQL has to: > > 1) Parse the given "query string" > 2) Given the statistics available, plan the query and pick the best one > 3) Actually run the query. > > Part 2 is going to dominate the CPU time in your tests, more so the more > statistics it has to evaluate, and unless the data has to come from the > disks (i.e. not in shared buffers or cache) is thus going to dominate the > time before you get your results. More statistics means more time > needed to do the planning/picking of the query. > > If you were to use prepared statements, the cost of #1 and #2 is done > once, and then every time you throw a new execution of the query to > PostgreSQL, you get to just do #3, the easy quick part, especially for > small simple queries where all the data is in shared buffers or the cache. > > a. > > -- > Aidan Van Dyk Create like a god, > ai...@highrise.ca command like a king, > http://www.highrise.ca/ work like a slave. > -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Any better plan for this query?..
>> So, why I don't use prepare here: let's say I'm testing the worst >> stress case :-) Imagine you have thousands of such kind of queries - >> you cannot prepare all of them! :-) > > Thousands? Surely there'll be a dozen or three of most common queries, > to which you pass different parameters. You can prepare thoseu Ok, and if each client just connect to the database, execute each kind of query just *once* and then disconnect?.. - cost of prepare will kill performance here if it's not reused at least 10 times within the same session. Well, I know, we always can do better, and even use stored procedures, etc. etc. > >> Now, as you see from your explanation, the Part #2 is the most >> dominant - so why instead to blame this query not to implement a QUERY >> PLANNER CACHE??? - in way if any *similar* query is recognized by >> parser we simply *reuse* the same plan?.. > > This has been discussed in the past, but it turns out that a real > implementation is a lot harder than it seems. Ok. If I remember well, Oracle have it and it helps a lot, but for sure it's not easy to implement.. Rgds, -Dimitri -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Any better plan for this query?..
Nice to know. But again, if this cache is kept only on the client side it'll be always lost on disconnect. And if clients are "short-lived" it'll not help. BTW, is there an option to say "do execution plan as simple as possible"? If you're sure about your data and your indexes - don't need to spend so much time. Rgds, -Dimitri On 5/12/09, Heikki Linnakangas wrote: > Dimitri wrote: >> Now, as you see from your explanation, the Part #2 is the most >> dominant - so why instead to blame this query not to implement a QUERY >> PLANNER CACHE??? - in way if any *similar* query is recognized by >> parser we simply *reuse* the same plan?.. > > At least in JDBC, there's several open source prepared statement cache > implementations out there that people use. I don't know about other > client libraries, but it certainly is possible to do in the client. > > -- >Heikki Linnakangas >EnterpriseDB http://www.enterprisedb.com > -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] What is the most optimal config parameters to keep stable write TPS ?..
It's just one of the test conditions - "what if there 2000 users?" - I know I may use pgpool or others, but I also need to know the limits of the database engine itself.. For the moment I'm limiting to 256 concurrent sessions, but config params are kept like for 2000 :-) Rgds, -Dimitri On 5/12/09, Laurent Laborde wrote: > On Mon, May 11, 2009 at 6:31 PM, Dimitri wrote: >> Hi Kevin, >> >> PostgreSQL: 8.3.7 & 8.4 >> Server: Sun M5000 32cores >> OS: Solaris 10 >> >> current postgresql.conf: >> >> # >> max_connections = 2000 # (change requires restart) > > Are you sure about the 2000 connections ? > Why don't you use a pgbouncer or pgpool instead ? > > > -- > F4FQM > Kerunix Flan > Laurent Laborde > -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Any better plan for this query?..
Folks, before you start to think "what a dumb guy doing a dumb thing" :-)) I'll explain you few details: it's for more than 10 years I'm using a db_STRESS kit (http://dimitrik.free.fr/db_STRESS.html) to check databases performance and scalability. Until now I was very happy with results it gave me as it stress very well each database engine internals an put on light some things I should probably skip on other workloads. What do you want, with a time the "fast" query executed before in 500ms now runs within 1-2ms - not only hardware was improved but also database engines increased their performance a lot! :-)) In 2007 I've published the first public results with PostgreSQL, and it was 2 times faster on that time comparing to MySQL (http://dimitrik.free.fr/db_STRESS_BMK_Part1.html) Last month for the launching of MySQL 5.4 I've done a long series of tests and at the end for my curiosity I've executed the same load against PostgreSQL 8.3.7 to see if MySQL is more close now. For my big surprise, MySQL was faster! As well observations on PG processing bring me a lot of questions - I supposed something was abnormal on PG side, but I did not have too much time to understand what it was exactly (http://dimitrik.free.fr/db_STRESS_MySQL_540_and_others_Apr2009.html#note_5443) What I'm trying to do now is to understand what exactly is the problem. What I discovered so far with all your help: - the impact of a planner - the impact of the analyze target - the impact of prepare / execute - scalability limit on 32 cores I'll also try to adapt prepare/execute solution to see how much it improves performance and/or scalability. As well helping from the other thread I was able to improve a lot the TPS stability on read+write workload! :-) Any other comments are welcome! Rgds, -Dimitri On 5/12/09, Dimitri Fontaine wrote: > Hi, > > Dimitri writes: > >>>> So, why I don't use prepare here: let's say I'm testing the worst >>>> stress case :-) Imagine you have thousands of such kind of queries - >>>> you cannot prepare all of them! :-) >>> >>> Thousands? Surely there'll be a dozen or three of most common queries, >>> to which you pass different parameters. You can prepare thoseu >> >> Ok, and if each client just connect to the database, execute each kind >> of query just *once* and then disconnect?.. - cost of prepare will >> kill performance here if it's not reused at least 10 times within the >> same session. > > In a scenario which looks like this one, what I'm doing is using > pgbouncer transaction pooling. Now a new connection from client can be > served by an existing backend, which already has prepared your > statement. > > So you first SELECT name FROM pg_prepared_statements; to know if you > have to PREPARE or just EXECUTE, and you not only maintain much less > running backends, lower fork() calls, but also benefit fully from > preparing the statements even when you EXECUTE once per client > connection. > >> Well, I know, we always can do better, and even use stored procedures, >> etc. etc. > > Plain SQL stored procedure will prevent PostgreSQL to prepare your > queries, only PLpgSQL functions will force transparent plan caching. But > calling this PL will cost about 1ms per call in my tests, so it's not a > good solution. > > It's possible to go as far as providing your own PostgreSQL C module > where you PREPARE at _PG_init() time and EXECUTE in a SQL callable > function, coupled with pgbouncer it should max out the perfs. But maybe > you're not willing to go this far. > > Anyway, is hammering the server with always the same query your real > need or just a simplified test-case? If the former, you'll see there are > good ways to theorically obtain better perfs than what you're currently > reaching, if the latter I urge you to consider some better benchmarking > tools, such as playr or tsung. > > https://area51.myyearbook.com/trac.cgi/wiki/Playr > http://tsung.erlang-projects.org/ > http://pgfouine.projects.postgresql.org/tsung.html > http://archives.postgresql.org/pgsql-admin/2008-12/msg00032.php > > Regards, > -- > dim > -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Any better plan for this query?..
Wow, Simon! :-)) yes, I'm working in Sun Benchmark Center :-)) (I'm not using my Sun email on public lists only to avid a spam) and as came here and asking questions it's probably proving my intentions to show PostgreSQL in its best light, no?.. - I never liked "not honest" comparisons :-)) Regarding your bet: from a very young age I learned a one thing - you take any 2 person who betting for any reason - you'll find in them one idiot and one bastard :-)) idiot - because betting while missing knowledge, and bastard - because knowing the truth is not honset to get a profit from idiots :-)) That's why I never betting in my life, but every time telling the same story in such situation... Did you like it? ;-)) However, no problem to give you a credit as well to all pg-perf list as it provides a very valuable help! :-)) Rgds, -Dimitri On 5/12/09, Simon Riggs wrote: > > On Tue, 2009-05-12 at 12:19 +0200, Dimitri wrote: > >> For my big surprise, MySQL was faster! > > Ours too. > > ** I bet you $1000 that I can improve the performance of your benchmark > results with PostgreSQL. You give me $1000 up-front and if I can't > improve your high end numbers I'll give you $2000 back. Either way, you > name me and link to me from your blog. Assuming you re-run the tests as > requested and give me reasonable access to info and measurements. ** > > I note your blog identifies you as a Sun employee. Is that correct? If > you do not give us the opportunity to improve upon the results then > reasonable observers might be persuaded you did not wish to show > PostgreSQL in its best light. You up for it? > > -- > Simon Riggs www.2ndQuadrant.com > PostgreSQL Training, Services and Support > > -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Any better plan for this query?..
For the moment I'm even not considering any scalability issues on the Read+Write workload - it may always be related to the storage box, and storage latency or controller/cache efficiency may play a lot. As problem I'm considering a scalability issue on Read-Only workload - only selects, no disk access, and if on move from 8 to 16 cores we gain near 100%, on move from 16 to 32 cores it's only 10%... I think I have to replay Read-Only with prepare/execute and check how much it'll help (don't know if there are some internal locking used when a planner is involved).. And yes, I'll try to profile on 32 cores, it makes sense. Rgds, -Dimitri On 5/12/09, Heikki Linnakangas wrote: > Dimitri wrote: >> What I discovered so far with all your help: >> - the impact of a planner >> - the impact of the analyze target >> - the impact of prepare / execute >> - scalability limit on 32 cores > > You've received good advice on how to minimize the impact of the first > three points, and using those techniques should bring a benefit. But I'm > pretty surprised by the bad scalability you're seeing and no-one seems > to have a good idea on where that limit is coming from. At a quick > glance, I don't see any inherent bottlenecks in the schema and workload. > > If you could analyze where the bottleneck is with multiple cores, that > would be great. With something like oprofile, it should be possible to > figure out where the time is spent. > > My first guess would be the WALInsertLock: writing to WAL is protected > by that and it an become a bottleneck with lots of small > UPDATE/DELETE/INSERT transactions. But a profile would be required to > verify that. > > -- >Heikki Linnakangas >EnterpriseDB http://www.enterprisedb.com > -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Any better plan for this query?..
Wait wait, currently I'm playing the "stress scenario", so there are only 256 sessions max, but thing time is zero (full stress). Scenario with 1600 users is to test how database is solid just to keep a huge amount of users, but doing only one transaction per second (very low global TPS comparing to what database is able to do, but it's testing how well its internals working to manage the user sessions). I did not plan to do 1600 users test this time (all depends on time :-)) So, do I need to increase WAL buffers for 256 users? My LOG and DATA are placed on separated storage LUNs and controllers from the beginning. I've changed the default 0.5 checkpoint_completion_target to 0.8 now, should I go until 0.95 ?.. Also, to avoid TPS "waves" and bring stability on Read+Write workload I followed advices from a parallel thread: bgwriter_lru_maxpages = 1000 bgwriter_lru_multiplier = 4.0 shared_buffers = 1024MB I've also tried shared_buffers=256MB as it was advised, but then Read-Only workload decreasing performance as PG self caching helps anyway. Also, checkpoint_timeout is 30s now, and of course a huge difference came with moving default_statistics_target to 5 ! -but this one I found myself :-)) Probably checkpoint_timeout may be bigger now with the current settings? - the goal here is to keep Read+Write TPS as stable as possible and also avoid a long recovery in case of system/database/other crash (in theory). Rgds, -Dimitri On 5/12/09, Simon Riggs wrote: > > On Tue, 2009-05-12 at 12:19 +0200, Dimitri wrote: > >> What I'm trying to do now is to understand what exactly is the >> problem. > > You're running with 1600 users, which is above the scalability limit > uncovered (by Sun...) during earlier benchmarking. The scalability > issues are understood but currently considered above the > reasonable-setting limit and so nobody has been inclined to improve > matters. > > You should use a connection concentrator to reduce the number of > sessions down to say 400. > > You're WAL buffers setting is also too low and you will be experiencing > contention on the WALWriteLock. Increase wal_buffers to about x8 where > you have it now. > > You can move pg_xlog to its own set of drives. > > Set checkpoint_completion_target to 0.95. > > -- > Simon Riggs www.2ndQuadrant.com > PostgreSQL Training, Services and Support > > -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Any better plan for this query?..
Hi Stefan, sorry, I did not have a time to bring all details into the toolkit - but at least I published it instead to tell a "nice story" about :-) The client process is a binary compiled with libpq. Client is interpreting a scenario script and publish via SHM a time spent on each SQL request. I did not publish sources yet as it'll also require to explain how to compile them :-)) So for the moment it's shipped as a freeware, but with time everything will be available (BTW, you're the first who asking for sources (well, except IBM guys who asked to get it on POWER boxes, but it's another story :-)) What is good is each client is publishing *live* its internal stats an we're able to get live data and follow any kind of "waves" in performance. Each session is a single process, so there is no contention between clients as you may see on some other tools. The current scenario script contains 2 selects (representing a Read transaction) and delete/insert/update (representing Write transaction). According a start parameters each client executing a given number Reads per Write. It's connecting on the beginning and disconnecting at the end of the test. It's also possible to extend it to do other queries, or simply give to each client a different scenario script - what's important is to able to collect then its stats live to understand what's going wrong (if any).. I'm planning to extend it and give an easy way to run it against any database schema, it's only question of time.. Rgds, -Dimitri On 5/12/09, Stefan Kaltenbrunner wrote: > Dimitri wrote: >> Folks, before you start to think "what a dumb guy doing a dumb thing" :-)) >> I'll explain you few details: >> >> it's for more than 10 years I'm using a db_STRESS kit >> (http://dimitrik.free.fr/db_STRESS.html) to check databases >> performance and scalability. Until now I was very happy with results >> it gave me as it stress very well each database engine internals an >> put on light some things I should probably skip on other workloads. >> What do you want, with a time the "fast" query executed before in >> 500ms now runs within 1-2ms - not only hardware was improved but also >> database engines increased their performance a lot! :-)) > > I was attempting to look into that "benchmark" kit a bit but I find the > information on that page a bit lacking :( a few notices: > > * is the sourcecode for the benchmark actually available? the "kit" > seems to contain a few precompiled binaries and some source/headfiles > but there are no building instructions, no makefile or even a README > which makes it really hard to verify exactly what the benchmark is doing > or if the benchmark client might actually be the problem here. > > * there is very little information on how the toolkit talks to the > database - some of the binaries seem to contain a static copy of libpq > or such? > > * how many queries per session is the toolkit actually using - some > earlier comments seem to imply you are doing a connect/disconnect cycle > for every query ist that actually true? > > > Stefan > -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Any better plan for this query?..
Robert, what I'm testing now is 256 users max. The workload is growing progressively from 1, 2, 4, 8 ... to 256 users. Of course the Max throughput is reached on the number of users equal to 2 * number of cores, but what's important for me here - database should continue to keep the workload! - response time regressing, but the troughput should remain near the same. So, do I really need a pooler to keep 256 users working?? - I don't think so, but please, correct me. BTW, I did not look to put PostgreSQL in bad conditions - the test is the test, and as I said 2 years ago PostgreSQL outperformed MySQL on the same test case, and there was nothing done within MySQL code to improve it explicitly for db_STRESS.. And I'm staying pretty honest when I'm testing something. Rgds, -Dimitri On 5/12/09, Robert Haas wrote: > On Tue, May 12, 2009 at 8:59 AM, Dimitri wrote: >> Wait wait, currently I'm playing the "stress scenario", so there are >> only 256 sessions max, but thing time is zero (full stress). Scenario >> with 1600 users is to test how database is solid just to keep a huge >> amount of users, but doing only one transaction per second (very low >> global TPS comparing to what database is able to do, but it's testing >> how well its internals working to manage the user sessions). > > Didn't we beat this to death in mid-March on this very same list? > Last time I think it was Jignesh Shah. AIUI, it's a well-known fact > that PostgreSQL doesn't do very well at this kind of workload unless > you use a connection pooler. > > *goes and checks the archives* Sure enough, 116 emails under the > subject line "Proposal of tunable fix for scalability of 8.4". > > So, if your goal is to find a scenario under which PostgreSQL performs > as badly as possible, congratulations - you've discovered the same > case that we already knew about. Obviously it would be nice to > improve it, but IIRC so far no one has had any very good ideas on how > to do that. If this example mimics a real-world workload that you > care about, and if using a connection pooler is just not a realistic > option in that scenario for whatever reason, then you'd be better off > working on how to fix it than on measuring it, because it seems to me > we already know it's got problems, per previous discussions. > > ...Robert > -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] What is the most optimal config parameters to keep stable write TPS ?..
Good point! I missed it.. - will 20MB be enough? Rgds, -Dimitri On 5/12/09, Julian v. Bock wrote: > Hi > >>>>>> "D" == Dimitri writes: > > D> current postgresql.conf: > > D> # > D> max_connections = 2000 # (change requires restart) > D> temp_buffers = 200MB > > temp_buffers are kept per connection and not freed until the session > ends. If you use some kind of connection pooling this can eat up a lot > of ram that could be used for caching instead. > > Regards, > Julian > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance > -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Any better plan for this query?..
On 5/12/09, Stefan Kaltenbrunner wrote: > Dimitri wrote: >> Hi Stefan, >> >> sorry, I did not have a time to bring all details into the toolkit - >> but at least I published it instead to tell a "nice story" about :-) > > fair point and appreciated. But it seems important that benchmarking > results can be verified by others as well... until now there were only people running Solaris or Linux :-)) > >> >> The client process is a binary compiled with libpq. Client is >> interpreting a scenario script and publish via SHM a time spent on >> each SQL request. I did not publish sources yet as it'll also require >> to explain how to compile them :-)) So for the moment it's shipped as >> a freeware, but with time everything will be available (BTW, you're >> the first who asking for sources (well, except IBM guys who asked to >> get it on POWER boxes, but it's another story :-)) > > well there is no licence tag(or a copyright notice) or anything als > associated with the download which makes it a bit harder than it really > needs to be. > The reason why I was actually looking for the source is that all my > available benchmark platforms are none of the ones you are providing > binaries for which kinda reduces its usefulness. > agree, will improve this point >> >> What is good is each client is publishing *live* its internal stats an >> we're able to get live data and follow any kind of "waves" in >> performance. Each session is a single process, so there is no >> contention between clients as you may see on some other tools. The >> current scenario script contains 2 selects (representing a Read >> transaction) and delete/insert/update (representing Write >> transaction). According a start parameters each client executing a >> given number Reads per Write. It's connecting on the beginning and >> disconnecting at the end of the test. > > well I have seen clients getting bottlenecked internally (like wasting > more time in getting rid/absorbing of the actual result than it took the > server to generate the answer...). > How sure are you that your "live publishing of data" does not affect the > benchmark results(because it kinda generates an artifical think time) > for example? On all my test tools client are publishing their data via shared memory segment (ISM), all they do is just *incrementing* their current stats values and continuing their processing. Another dedicated program should be executed to print these stats - it's connecting to the same SHM segment and printing a *difference* between values for the current and the next interval. Let me know if you need more details. > But what I get from your answer is that you are basically doing one > connect/disconnect per client and the testcase you are talking about has > 256 clients? Exactly, only one connect/disconnect per test, and number of clients is growing progressively from 1, 2, 4, 8, 16, .. to 256 Rgds, -Dimitri > > > Stefan > -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Any better plan for this query?..
No, they keep connections till the end of the test. Rgds, -Dimitri On 5/12/09, Joshua D. Drake wrote: > On Tue, 2009-05-12 at 17:22 +0200, Dimitri wrote: >> Robert, what I'm testing now is 256 users max. The workload is growing >> progressively from 1, 2, 4, 8 ... to 256 users. Of course the Max >> throughput is reached on the number of users equal to 2 * number of >> cores, but what's important for me here - database should continue to >> keep the workload! - response time regressing, but the troughput >> should remain near the same. >> >> So, do I really need a pooler to keep 256 users working?? - I don't >> think so, but please, correct me. > > If they disconnect and reconnect yes. If they keep the connections live > then no. > > Joshua D. Drake > > -- > PostgreSQL - XMPP: jdr...@jabber.postgresql.org >Consulting, Development, Support, Training >503-667-4564 - http://www.commandprompt.com/ >The PostgreSQL Company, serving since 1997 > > -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Any better plan for this query?..
On 5/12/09, Robert Haas wrote: > On Tue, May 12, 2009 at 1:00 PM, Dimitri wrote: >> On MySQL there is no changes if I set the number of sessions in the >> config file to 400 or to 2000 - for 2000 it'll just allocate more >> memory. > > I don't care whether the setting affects the speed of MySQL. I want > to know if it affects the speed of PostgreSQL. the problem is they both have "max_connections" parameter, so as you asked for MySQL I answered for MySQL, did not test yet for PostgreSQL, will be in the next series.. > >> After latest fix with default_statistics_target=5, version 8.3.7 is >> running as fast as 8.4, even 8.4 is little little bit slower. >> >> I understand your position with a pooler, but I also want you think >> about idea that 128 cores system will become a commodity server very >> soon, and to use these cores on their full power you'll need a >> database engine capable to run 256 users without pooler, because a >> pooler will not help you here anymore.. > > So what? People with 128-core systems will not be running trivial > joins that return in 1-2ms and have one second think times between > them. And if they are, and if they have nothing better to do than > worry about whether MySQL can process those queries in 1/2000th of the > think time rather than 1/1000th of the think time, then they can use > MySQL. If we're going to worry about performance on 128-core system, > we would be much better advised to put our efforts into parallel query > execution than how many microseconds it takes to execute very simple > queries. Do you really think nowdays for example a web forum application having PG as a backend will have queries running slower than 1-2ms to print a thread message within your browser??? or banking transactions?? > > Still, I have no problem with making PostgreSQL faster in the case > you're describing. I'm just not interested in doing it on my own time > for free. I am sure there are a number of people who read this list > regularly who would be willing to do it for money, though. Maybe even > me. :-) > > ...Robert > You don't need to believe me, but I'm doing it for free - I still have my work to finish in parallel :-)) And on the same time I don't see any other way to learn and improve my knowledge, but nobody is perfect :-)) Rgds, -Dimitri -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Any better plan for this query?..
Hi Scott, On 5/12/09, Scott Carey wrote: > Although nobody wants to support it, he should try the patch that Jignesh K. > Shah (from Sun) proposed that makes ProcArrayLock lighter-weight. If it > makes 32 cores much faster, then we have a smoking gun. > > Although everyone here is talking about this as an 'unoptimal' solution, the > fact is there is no evidence that a connection pooler will fix the > scalability from 16 > 32 cores. > Certainly a connection pooler will help most results, but it may not fix the > scalability problem. > > A question for Dimitri: > What is the scalability from 16 > 32 cores at the 'peak' load that occurs > near 2x the CPU count? Is it also poor? If this is also poor, IMO the > community here should not be complaining about this unopimal case -- a > connection pooler at that stage does little and prepared statements will > increase throughput but not likely alter scalability. I'm attaching a small graph showing a TPS level on PG 8.4 depending on number of cores (X-axis is a number of concurrent users, Y-axis is the TPS number). As you may see TPS increase is near linear while moving from 8 to 16 cores, while on 32cores even it's growing slightly differently, what is unclear is why TPS level is staying limited to 11.000 TPS on 32cores. And it's pure read-only workload. > > If that result scales, then the short term answer is a connection pooler. > > In the tests that Jingesh ran -- making the ProcArrayLock faster helped the > case where connections = 2x the CPU core count quite a bit. > > The thread about the CPU scalability is "Proposal of tunable fix for > scalability of 8.4", originally posted by "Jignesh K. Shah" > , March 11 2009. > > It would be very useful to see results of this benchmark with: > 1. A Connection Pooler will not help, as each client is *not* disconnecting/reconnecting during the test, as well PG is keeping well even 256 users. And TPS limit is reached already on 64 users, don't think pooler will help here. > 2. Jignesh's patch I've already tested it and it did not help in my case because the real problem is elsewhere.. (however, I did not test it yet with my latest config params) > 3. Prepared statements > yes, I'm preparing this test. > #3 is important, because prepared statements are ideal for queries that > perform well with low statistics_targets, and not ideal for those that > require high statistics targets. Realistically, an app won't have more than > a couple dozen statement forms to prepare. Setting the default statistics > target to 5 is just a way to make some other query perform bad. Agree, but as you may have a different statistic target *per* table it should not be a problem. What is sure - all time spent on parse and planner will be removed here, and the final time should be a pure execution. Rgds, -Dimitri > > > On 5/12/09 10:53 AM, "Alvaro Herrera" wrote: > >> Andres Freund escribió: >> >>> Naturally it would still be nice to be good in this not optimal >>> workload... >> >> I find it hard to justify wasting our scarce development resources into >> optimizing such a contrived workload. >> >> -- >> Alvaro Herrera >> http://www.CommandPrompt.com/ >> The PostgreSQL Company - Command Prompt, Inc. >> >> -- >> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) >> To make changes to your subscription: >> http://www.postgresql.org/mailpref/pgsql-performance >> > > <> -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Any better plan for this query?..
I'm also confused, but seems discussion giving also other ideas :-) But yes, each client is connecting to the database server only *once*. To presice how the test is running: - 1 client is started => 1 in total - sleep ... - 1 another client is started => 2 in total - sleep .. - 2 another clients are started => 4 in total - sleep .. ... ... ===> 256 in total - sleep ... - kill clients So I even able to monitor how each new client impact all others. The test kit is quite flexible to prepare any kind of stress situations. Rgds, -Dimitri On 5/12/09, Glenn Maynard wrote: > I'm sorry, but I'm confused. Everyone keeps talking about connection > pooling, but Dimitri has said repeatedly that each client makes a > single connection and then keeps it open until the end of the test, > not that it makes a single connection per SQL query. Connection > startup costs shouldn't be an issue. Am I missing something here? > test(N) starts N clients, each client creates a single connection and > hammers the server for a while on that connection. test(N) is run for > N=1,2,4,8...256. This seems like a very reasonable test scenario. > > -- > Glenn Maynard > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance > -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Any better plan for this query?..
On MySQL there is no changes if I set the number of sessions in the config file to 400 or to 2000 - for 2000 it'll just allocate more memory. After latest fix with default_statistics_target=5, version 8.3.7 is running as fast as 8.4, even 8.4 is little little bit slower. I understand your position with a pooler, but I also want you think about idea that 128 cores system will become a commodity server very soon, and to use these cores on their full power you'll need a database engine capable to run 256 users without pooler, because a pooler will not help you here anymore.. Rgds, -Dimitri On 5/12/09, Robert Haas wrote: > On Tue, May 12, 2009 at 11:22 AM, Dimitri wrote: >> Robert, what I'm testing now is 256 users max. The workload is growing >> progressively from 1, 2, 4, 8 ... to 256 users. Of course the Max >> throughput is reached on the number of users equal to 2 * number of >> cores, but what's important for me here - database should continue to >> keep the workload! - response time regressing, but the troughput >> should remain near the same. >> >> So, do I really need a pooler to keep 256 users working?? - I don't >> think so, but please, correct me. > > Not an expert on this, but there has been a lot of discussion of the > importance of connection pooling in this space. Is MySQL still faster > if you lower max_connections to a value that is closer to the number > of users, like 400 rather than 2000? > >> BTW, I did not look to put PostgreSQL in bad conditions - the test is >> the test, and as I said 2 years ago PostgreSQL outperformed MySQL on >> the same test case, and there was nothing done within MySQL code to >> improve it explicitly for db_STRESS.. And I'm staying pretty honest >> when I'm testing something. > > Yeah but it's not really clear what that something is. I believe you > said upthread that PG 8.4 beta 1 is faster than PG 8.3.7, but PG 8.4 > beta 1 is slower than MySQL 5.4 whereas PG 8.3.7 was faster than some > older version of MySQL. So PG got faster and MySQL got faster, but > they sped things up more than we did. If our performance were getting > WORSE, I'd be worried about that, but the fact that they were able to > make more improvement on this particular case than we were doesn't > excite me very much. Sure, I'd love it if PG were even faster than it > is, and if you have a suggested patch please send it in... or if you > want to profile it and send the results that would be great too. But > I guess my point is that the case of a very large number of > simultaneous users with pauses-for-thought between queries has already > been looked at in the very recent past in a way that's very similar to > what you are doing (and by someone who works at the same company you > do, no less!) so I'm not quite sure why we're rehashing the issue. > > ...Robert > -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Any better plan for this query?..
The idea is good, but *only* pooling will be not enough. I mean if all what pooler is doing is only keeping no more than N backends working - it'll be not enough. You never know what exactly your query will do - if you choose your N value to be sure to not overload CPU and then some of your queries start to read from disk - you waste your idle CPU time because it was still possible to run other queries requiring CPU time rather I/O, etc... I wrote some ideas about an "ideal" solution here (just omit the word "mysql" - as it's a theory it's valable for any db engine): http://dimitrik.free.fr/db_STRESS_MySQL_540_and_others_Apr2009.html#note_5442 Rgds, -Dimitri On 5/13/09, Kevin Grittner wrote: > Glenn Maynard wrote: >> I'm sorry, but I'm confused. Everyone keeps talking about >> connection pooling, but Dimitri has said repeatedly that each client >> makes a single connection and then keeps it open until the end of >> the test, not that it makes a single connection per SQL query. >> Connection startup costs shouldn't be an issue. Am I missing >> something here? > > Quite aside from the overhead of spawning new processes, if you have > more active connections than you have resources for them to go after, > you just increase context switching and resource contention, both of > which have some cost, without any offsetting gains. That would tend > to explain why performance tapers off after a certain point. A > connection pool which queues requests prevents this degradation. > > It would be interesting, with each of the CPU counts, to profile > PostgreSQL at the peak of each curve to see where the time goes when > it is operating with an optimal poolsize. Tapering after that point > is rather uninteresting, and profiles would be less useful beyond that > point, as the noise from the context switching and resource contention > would make it harder to spot issues that really matter.. > > -Kevin > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance > -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Any better plan for this query?..
Folks, sorry, I'm outpassed little bit by the events :-)) I've finished tests with PREPARE/EXECUTE - it's much faster of course, and the max TSP is 15.000 now on 24 cores! - I've done various tests to see where is the limit bottleneck may be present - it's more likely something timer or interrupt based, etc. Nothing special via DTrace, or probably it'll say you more things then me, but for a 10sec period it's quite small wait time: # lwlock_wait_8.4.d `pgrep -n postgres` Lock IdMode Count FirstBufMappingLock Exclusive 1 FirstLockMgrLock Exclusive 1 BufFreelistLock Exclusive 3 FirstBufMappingLock Shared 4 FirstLockMgrLock Shared 4 Lock IdMode Combined Time (ns) FirstLockMgrLock Exclusive 803700 BufFreelistLock Exclusive 3001600 FirstLockMgrLock Shared 4586600 FirstBufMappingLock Exclusive 6283900 FirstBufMappingLock Shared 21792900 On the same time those lock waits are appearing only on 24 or 32 cores. I'll plan to replay this case on the bigger server (64 cores or more) - it'll be much more evident if the problem is in locks. Currently I'm finishing my report with all data all of you asked (system graphs, pgsql, and other). I'll publish it on my web site and send you a link. Rgds, -Dimitri On 5/14/09, Simon Riggs wrote: > > On Tue, 2009-05-12 at 14:28 +0200, Dimitri wrote: > >> As problem I'm considering a scalability issue on Read-Only workload - >> only selects, no disk access, and if on move from 8 to 16 cores we >> gain near 100%, on move from 16 to 32 cores it's only 10%... > > Dimitri, > > Will you be re-running the Read-Only tests? > > Can you run the Dtrace script to assess LWlock contention during the > run? > > Would you re-run the tests with a patch? > > Thanks, > > -- > Simon Riggs www.2ndQuadrant.com > PostgreSQL Training, Services and Support > > -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Any better plan for this query?..
It's absolutely great! it'll not help here because a think time is 0. but for any kind of solution with a spooler it's a must to try! Rgds, -Dimitri On 5/13/09, Dimitri Fontaine wrote: > Hi, > > Le 13 mai 09 à 18:42, Scott Carey a écrit : >>> will not help, as each client is *not* disconnecting/reconnecting >>> during the test, as well PG is keeping well even 256 users. And TPS >>> limit is reached already on 64 users, don't think pooler will help >>> here. >> >> Actually, it might help a little. Postgres has a flaw that makes >> backends >> block on a lock briefly based on the number of total backends -- >> active or >> completely passive. Your tool has some (very small) user-side delay >> and a >> connection pooler would probably allow 64 of your users to >> efficiently 'fit' >> in 48 or so connection pooler slots. > > It seems you have think time, and I'm only insisting on what Scott > said, but having thinktime means a connection pool can help. Pgbouncer > is a good choice because it won't even attempt to parse the queries, > and it has a flexible configuration. > >>>> 3. Prepared statements >>> yes, I'm preparing this test. > > It's possible to use prepared statement and benefit from pgbouncer at > the same time, but up until now it requires the application to test > whether its statements are already prepared at connection time, > because the application is not controlling when pgbouncer is reusing > an existing backend or giving it a fresh one. > > As I think I need this solution too, I've coded a PG module to scratch > that itch this morning, and just published it (BSD licenced) on > pgfoundry: >http://preprepare.projects.postgresql.org/README.html >http://cvs.pgfoundry.org/cgi-bin/cvsweb.cgi/preprepare/preprepare/ > > With this module and the proper pgbouncer setup (connect_query='SELECT > prepare_all();') the application has no more to special case the fresh- > backend-nothing-prepared case, it's all transparent, just replace your > SELECT query with its EXECUTE foo(x, y, z) counter part. > > I've took the approach to setup the prepared statements themselves > into a table with columns name and statement, this latter one > containing the full PREPARE SQL command. There's a custom variable > preprepare.relation that has to be your table name (shema qualified). > Each statement that you then put in there will get prepared when you > SELECT prepare_all(); > > Hope this helps, regards, > -- > dim -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Any better plan for this query?..
Hi Scott, let me now finish my report and regroup all data together, and then we'll continue discussion as it'll come more in debug/profile phase.. - I'll be not polite from my part to send some tons of attachments to the mail list :-) Rgds, -Dimitri On 5/13/09, Scott Carey wrote: > > On 5/13/09 3:22 AM, "Dimitri" wrote: > >> Hi Scott, >> >> On 5/12/09, Scott Carey wrote: >>> Although nobody wants to support it, he should try the patch that Jignesh >>> K. >>> Shah (from Sun) proposed that makes ProcArrayLock lighter-weight. If it >>> makes 32 cores much faster, then we have a smoking gun. >>> >>> Although everyone here is talking about this as an 'unoptimal' solution, >>> the >>> fact is there is no evidence that a connection pooler will fix the >>> scalability from 16 > 32 cores. >>> Certainly a connection pooler will help most results, but it may not fix >>> the >>> scalability problem. >>> >>> A question for Dimitri: >>> What is the scalability from 16 > 32 cores at the 'peak' load that occurs >>> near 2x the CPU count? Is it also poor? If this is also poor, IMO the >>> community here should not be complaining about this unopimal case -- a >>> connection pooler at that stage does little and prepared statements will >>> increase throughput but not likely alter scalability. >> >> I'm attaching a small graph showing a TPS level on PG 8.4 depending on >> number of cores (X-axis is a number of concurrent users, Y-axis is the >> TPS number). As you may see TPS increase is near linear while moving >> from 8 to 16 cores, while on 32cores even it's growing slightly >> differently, what is unclear is why TPS level is staying limited to >> 11.000 TPS on 32cores. And it's pure read-only workload. >> > > Interesting. What hardware is this, btw? Looks like the 32 core system > probably has 2x the CPU and a bit less interconnect efficiency versus the 16 > core one (which would be typical). > Is the 16 core case the same, but with fewer cores per processor active? Or > fewer processors total? > Understanding the scaling difference may require a better understanding of > the other differences besides core count. > >>> >>> If that result scales, then the short term answer is a connection pooler. >>> >>> In the tests that Jingesh ran -- making the ProcArrayLock faster helped >>> the >>> case where connections = 2x the CPU core count quite a bit. >>> >>> The thread about the CPU scalability is "Proposal of tunable fix for >>> scalability of 8.4", originally posted by "Jignesh K. Shah" >>> , March 11 2009. >>> >>> It would be very useful to see results of this benchmark with: >>> 1. A Connection Pooler >> >> will not help, as each client is *not* disconnecting/reconnecting >> during the test, as well PG is keeping well even 256 users. And TPS >> limit is reached already on 64 users, don't think pooler will help >> here. >> > > Actually, it might help a little. Postgres has a flaw that makes backends > block on a lock briefly based on the number of total backends -- active or > completely passive. Your tool has some (very small) user-side delay and a > connection pooler would probably allow 64 of your users to efficiently 'fit' > in 48 or so connection pooler slots. > > It is not about connecting and disconnecting in this case, its about > minimizing Postgres' process count. If this does help, it would hint at > certain bottlenecks. If it doesn't it would point elsewhere (and quiet some > critics). > > However, its unrealistic for any process-per-connection system to have less > backends than about 2x the core count -- else any waiting on I/O or network > will just starve CPU. So this would just be done for research, not a real > answer to making it scale better. > > For those who say "but, what if its I/O bound! You don't need more > backends then!": Well you don't need more CPU either if you're I/O bound. > By definition, CPU scaling tests imply the I/O can keep up. > > >>> 2. Jignesh's patch >> >> I've already tested it and it did not help in my case because the real >> problem is elsewhere.. (however, I did not test it yet with my latest >> config params) >> > > Great to hear that! -- That means this case is probably not ProcArrayLock. > If its Solaris, could we get: > 1. What is the CPU stats when it is in the inefficient
Re: [PERFORM] Any better plan for this query?..
Folks, I've just published a full report including all results here: http://dimitrik.free.fr/db_STRESS_PostgreSQL_837_and_84_May2009.html >From my point of view it needs first to understand where the time is wasted on a single query (even when the statement is prepared it runs still slower comparing to MySQL). Then to investigate on scalability issue I think a bigger server will be needed here (I'm looking for 64cores at least :-)) If you have some other ideas or patches (like Simon) - don't hesitate to send them - once I'll get an access to the server again the available test time will be very limited.. Best regards! -Dimitri On 5/18/09, Simon Riggs wrote: > > On Thu, 2009-05-14 at 20:25 +0200, Dimitri wrote: > >> # lwlock_wait_8.4.d `pgrep -n postgres` > >>Lock IdMode Combined Time (ns) >> FirstLockMgrLock Exclusive 803700 >>BufFreelistLock Exclusive 3001600 >> FirstLockMgrLock Shared 4586600 >> FirstBufMappingLock Exclusive 6283900 >> FirstBufMappingLock Shared 21792900 > > I've published two patches to -Hackers to see if we can improve the read > only numbers on 32+ cores. > > Try shared_buffer_partitions = 256 > > -- > Simon Riggs www.2ndQuadrant.com > PostgreSQL Training, Services and Support > > -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Any better plan for this query?..
Thanks Dave for correction, but I'm also curious where the time is wasted in this case?.. 0.84ms is displayed by "psql" once the result output is printed, and I got similar time within my client (using libpq) which is not printing any output.. Rgds, -Dimitri On 5/18/09, Dave Dutcher wrote: > > What I don't understand is the part where you talking about disabling hash > joins: > >>* result: planner replaced hash join is replaced by merge join >>* execution time: 0.84ms ! >>* NOTE: curiously planner is expecting to execute this query in 0.29ms > - so it's supposed from its logic to be faster, so why this plan is not used > from the beginning???... >> >> Sort (cost=4562.83..4568.66 rows=2329 width=176) (actual > time=0.237..0.237 rows=20 loops=1) >> Sort Key: h.horder >> Sort Method: quicksort Memory: 30kB >> -> Merge Join (cost=4345.89..4432.58 rows=2329 width=176) > (actual time=0.065..0.216 rows=20 loops=1) >> Merge Cond: (s.ref = h.ref_stat) >> -> Index Scan using stat_ref_idx on stat s > (cost=0.00..49.25 rows=1000 width=45) (actual time=0.018..0.089 rows=193 > loops=1) >> -> Sort (cost=4345.89..4351.72 rows=2329 width=135) > (actual time=0.042..0.043 rows=20 loops=1) >> Sort Key: h.ref_stat >> Sort Method: quicksort Memory: 30kB >> -> Index Scan using history_ref_idx on history h > (cost=0.00..4215.64 rows=2329 width=135) (actual time=0.012..0.025 rows=20 > loops=1) >> Index Cond: (ref_object = '01'::bpchar) >> Total runtime: 0.288 ms >>(12 rows) > > The explain analyze ran the query in 0.288 ms. That is the actual time it > took to run the query on the server. It is not an estimate of the time. > You measured 0.84 ms to run the query, which seems to imply either a problem > in one of the timing methods or that 66% of your query execution time is > sending the results to the client. I'm curious how you did you execution > time measurements. > > Dave > > -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Any better plan for this query?..
On 5/18/09, Scott Carey wrote: > Great data Dimitri!' Thank you! :-) > > I see a few key trends in the poor scalability: > > The throughput scales roughly with %CPU fairly well. But CPU used doesn't > go past ~50% on the 32 core tests. This indicates lock contention. > You should not look on #1 STATs, but on #2 - they are all with the latest "fixes" - on all of them CPU is used well (90% in pic on 32cores). Also, keep in mind these cores are having 2 threads, and from Solaris point of view they are seen as CPU (so 64 CPU) and %busy is accounted as for 64 CPU > Other proof of lock contention are the mutex locks / sec graph which climbs exactly, except no locking was seen on processes while I tried to trace them.. What I think will be needed here is a global and corelated tracing of all PG processes - I did not expect to do it now, but next time > rapidly as the system gets more inefficient (along with context switches). > > Another trend is the system calls/sec which caps out with the test, at about > 400,000 per sec on the peak (non-prepared statement) result. Note that when > the buffer size is 256MB, the performance scales much worse and is slower. > And correlated with this the system calls/sec per transaction is more than > double, at slower throughput. of course, because even the data were cached by filesystem to get them you still need to call a read() system call.. > > Using the OS to cache pages is not as fast as pages in shared_buffers, by a > more significant amount with many cores and higher concurrency than in the > low concurrency case. exactly, it's what I also wanted to demonstrate because I often hear "PG is delegating caching to the filesystem" - and I don't think it's optimal :-) > > The system is largely lock limited in the poor scaling results. This holds > true with or without the use of prepared statements -- which help a some, > but not a lot and don't affect the scalability. we are agree here, but again - 20K mutex spins/sec is a quite low value, that's why I hope on the bigger server it'll be more clear where is a bottleneck :-) Rgds, -Dimitri > > > 4096MB shared buffers, 32 cores, 8.4, read only: > http://dimitrik.free.fr/Report_20090505/5539_dim_STAT_70.html > > 256MB cache, 32 cores, 8.4, read-only: > http://dimitrik.free.fr/Report_20090505/5539_dim_STAT_52.html > > 4096MB shared buffs, 32 cores, 8.4, read only, prepared statements > http://dimitrik.free.fr/Report_20090505/5539_dim_STAT_70.html > > On 5/18/09 11:00 AM, "Dimitri" wrote: > >> Folks, I've just published a full report including all results here: >> http://dimitrik.free.fr/db_STRESS_PostgreSQL_837_and_84_May2009.html >> >> From my point of view it needs first to understand where the time is >> wasted on a single query (even when the statement is prepared it runs >> still slower comparing to MySQL). >> >> Then to investigate on scalability issue I think a bigger server will >> be needed here (I'm looking for 64cores at least :-)) >> >> If you have some other ideas or patches (like Simon) - don't hesitate >> to send them - once I'll get an access to the server again the >> available test time will be very limited.. >> >> Best regards! >> -Dimitri >> >> >> On 5/18/09, Simon Riggs wrote: >>> >>> On Thu, 2009-05-14 at 20:25 +0200, Dimitri wrote: >>> >>>> # lwlock_wait_8.4.d `pgrep -n postgres` >>> >>>>Lock IdMode Combined Time (ns) >>>> FirstLockMgrLock Exclusive 803700 >>>>BufFreelistLock Exclusive 3001600 >>>> FirstLockMgrLock Shared 4586600 >>>> FirstBufMappingLock Exclusive 6283900 >>>> FirstBufMappingLock Shared 21792900 >>> >>> I've published two patches to -Hackers to see if we can improve the read >>> only numbers on 32+ cores. >>> >>> Try shared_buffer_partitions = 256 >>> >>> -- >>> Simon Riggs www.2ndQuadrant.com >>> PostgreSQL Training, Services and Support >>> >>> >> >> -- >> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) >> To make changes to your subscription: >> http://www.postgresql.org/mailpref/pgsql-performance >> > > -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Any better plan for this query?..
On 5/18/09, Simon Riggs wrote: > > On Mon, 2009-05-18 at 20:00 +0200, Dimitri wrote: > >> >From my point of view it needs first to understand where the time is >> wasted on a single query (even when the statement is prepared it runs >> still slower comparing to MySQL). > > There is still a significant number of things to say about these numbers > and much tuning still to do, so I'm still confident of improving those > numbers if we needed to. > > In particular, running the tests repeatedly using > H.REF_OBJECT = '01' > rather than varying the value seems likely to benefit MySQL. The let me repeat again - the reference is *random*, the '01' value I've used just to show a query execution plan. also, what is important - the random ID is chosen in way that no one user use the same to avoid deadlocks previously seen with PostgreSQL (see the "Deadlock mystery" note 2 years ago http://dimitrik.free.fr/db_STRESS_BMK_Part1.html#note_4355 ) > distribution of values is clearly non-linear; while Postgres picks a > strange plan for that particular value, I would guess there are also > values for which the MySQL plan is sub-optimal. Depending upon the > distribution of selected data we might see the results go either way. > > What I find worrying is your result of a scalability wall for hash > joins. Is that a repeatable issue? I think yes (but of course I did not try to replay it several times) Rgds, -Dimitri > > -- > Simon Riggs www.2ndQuadrant.com > PostgreSQL Training, Services and Support > > -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Any better plan for this query?..
No, Tom, the query cache was off. I put it always explicitly off on MySQL as it has scalability issues. Rgds, -Dimitri On 5/19/09, Tom Lane wrote: > Simon Riggs writes: >> In particular, running the tests repeatedly using >> H.REF_OBJECT = '01' >> rather than varying the value seems likely to benefit MySQL. > > ... mumble ... query cache? > > regards, tom lane > -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Any better plan for this query?..
On 5/19/09, Scott Carey wrote: > > On 5/18/09 3:32 PM, "Dimitri" wrote: > >> On 5/18/09, Scott Carey wrote: >>> Great data Dimitri!' >> >> Thank you! :-) >> >>> >>> I see a few key trends in the poor scalability: >>> >>> The throughput scales roughly with %CPU fairly well. But CPU used >>> doesn't >>> go past ~50% on the 32 core tests. This indicates lock contention. >>> >> >> You should not look on #1 STATs, but on #2 - they are all with the >> latest "fixes" - on all of them CPU is used well (90% in pic on >> 32cores). >> Also, keep in mind these cores are having 2 threads, and from Solaris >> point of view they are seen as CPU (so 64 CPU) and %busy is accounted >> as for 64 CPU >> > > Well, if the CPU usage is actually higher, then it might not be lock waiting > -- it could be spin locks or context switches or cache coherency overhead. > Postgres may also not be very SMT friendly, at least on the hardware tested > here. do you mean SMP or CMT? ;-) however both should work well with PostgreSQL. I also think about CPU affinity - probably it may help to avoid CPU cache misses - but makes sense mostly if pooler will be added as a part of PG. > > (what was the context switch rate? I didn't see that in the data, just > mutex spins). increasing with a load, as this ex.: http://dimitrik.free.fr/Report_20090505/5539_dim_STAT_100.html#bmk_CPU_CtxSwitch_100 > > The scalability curve is definitely showing something. Prepared statements > were tried, as were most of the other suggestions other than one: > > What happens if the queries are more complicated (say, they take 15ms server > side with a more complicated plan required)? That is a harder question to > answer What I observed is: if planner takes more long time (like initially with 8.3.7 and analyze target 1000) the scalability problem is appearing more strange - http://dimitrik.free.fr/Report_20090505/5521_dim_STAT_18.html - as you see CPU even not used more than 60% , and as you may see spin locks are lowering - CPUs are not spinning for locks, there is something else.. I'm supposing a problem of some kind of synchronization - background processes are not waking up on time or something like this... Then, if more time spent on the query execution itself and not planner: - if it'll be I/O time - I/O will hide everything else until you increase a storage performance and/or add more RAM, but then you come back to the initial issue :-) - if it'll be a CPU time it may be interesting! :-) Rgds, -Dimitri -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Any better plan for this query?..
On 5/19/09, Simon Riggs wrote: > > On Tue, 2009-05-19 at 00:33 +0200, Dimitri wrote: >> > >> > In particular, running the tests repeatedly using >> >H.REF_OBJECT = '01' >> > rather than varying the value seems likely to benefit MySQL. The >> >> let me repeat again - the reference is *random*, >> the '01' value I've used just to show a query execution >> plan. >> >> also, what is important - the random ID is chosen in way that no one >> user use the same to avoid deadlocks previously seen with PostgreSQL >> (see the "Deadlock mystery" note 2 years ago >> http://dimitrik.free.fr/db_STRESS_BMK_Part1.html#note_4355 ) > > OK, didn't pick up on that. > > (Like Tom, I was thinking query cache) > > Can you comment on the distribution of values for that column? If you > are picking randomly, this implies distribution is uniform and so I am > surprised we are mis-estimating the selectivity. yes, the distribution of reference values is uniform between '01' to '001000' (10M), only one OBJECT row by one reference, and only 20 rows with the same reference in HISTORY table. > >> I think yes (but of course I did not try to replay it several times) > > If you could that would be appreciated. We don't want to go chasing > after something that is not repeatable. I'll retry and let you know. Rgds, -Dimitri -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Any better plan for this query?..
I may confirm the issue with hash join - it's repeating both with prepared and not prepared statements - it's curious because initially the response time is lowering near ~1ms (the lowest seen until now) and then once workload growing to 16 sessions it's jumping to 2.5ms, then with 32 sessions it's 18ms, etc.. I've retested on 24 isolated cores, so any external secondary effects are avoided. Rgds, -Dimitri On 5/19/09, Dimitri wrote: > On 5/19/09, Simon Riggs wrote: >> >> On Tue, 2009-05-19 at 00:33 +0200, Dimitri wrote: >>> > >>> > In particular, running the tests repeatedly using >>> > H.REF_OBJECT = '01' >>> > rather than varying the value seems likely to benefit MySQL. The >>> >>> let me repeat again - the reference is *random*, >>> the '01' value I've used just to show a query execution >>> plan. >>> >>> also, what is important - the random ID is chosen in way that no one >>> user use the same to avoid deadlocks previously seen with PostgreSQL >>> (see the "Deadlock mystery" note 2 years ago >>> http://dimitrik.free.fr/db_STRESS_BMK_Part1.html#note_4355 ) >> >> OK, didn't pick up on that. >> >> (Like Tom, I was thinking query cache) >> >> Can you comment on the distribution of values for that column? If you >> are picking randomly, this implies distribution is uniform and so I am >> surprised we are mis-estimating the selectivity. > > yes, the distribution of reference values is uniform between > '01' to '001000' (10M), only one OBJECT row by one > reference, and only 20 rows with the same reference in HISTORY table. > >> >>> I think yes (but of course I did not try to replay it several times) >> >> If you could that would be appreciated. We don't want to go chasing >> after something that is not repeatable. > > I'll retry and let you know. > > Rgds, > -Dimitri > -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Any better plan for this query?..
The response time is not progressive, it's simply jumping, it's likely since 16 sessions there is a sort of serialization happening somewhere.. As well on 16 sessions the throughput in TPS is near the same as on 8 (response time is only twice bigger for the moment), but on 32 it's dramatically dropping down.. Rgds, -Dimitri On 5/19/09, Simon Riggs wrote: > > On Tue, 2009-05-19 at 14:00 +0200, Dimitri wrote: > >> I may confirm the issue with hash join - it's repeating both with >> prepared and not prepared statements - it's curious because initially >> the response time is lowering near ~1ms (the lowest seen until now) >> and then once workload growing to 16 sessions it's jumping to 2.5ms, >> then with 32 sessions it's 18ms, etc.. > > Is it just bad all the time, or does it get worse over time? > > Do you get the same behaviour as 32 sessions if you run 16 sessions for > twice as long? > > -- > Simon Riggs www.2ndQuadrant.com > PostgreSQL Training, Services and Support > > -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Any better plan for this query?..
On 5/19/09, Merlin Moncure wrote: > On Mon, May 18, 2009 at 6:32 PM, Dimitri wrote: >> Thanks Dave for correction, but I'm also curious where the time is >> wasted in this case?.. >> >> 0.84ms is displayed by "psql" once the result output is printed, and I >> got similar time within my client (using libpq) which is not printing >> any output.. > > Using libpq? What is the exact method you are using to execute > queries...PQexec? exactly > If you are preparing queries against libpq, the > best way to execute queries is via PQexecPrepared. the query is *once* prepared via PQexec, then it's looping with "execute" via PQexec. Why PQexecPrepared will be better in my case?.. > Also, it's > interesting to see if you can get any benefit from asynchronous > queries (PQsendPrepared), but this might involve more changes to your > application than you are willing to make. > > Another note: I would like to point out again that there are possible > negative side effects in using char(n) vs. varchar(n) that IIRC do not > exist in mysql. When you repeat your test I strongly advise switching > to varchar. if it's true for any case, why not just replace CHAR implementation by VARCHAR directly within PG code?.. > > Another question: how exactly are you connecting to the database? > local machine? if so, domain socket or tcp/ip? local TCP/IP, same as MySQL > What are you doing > with the results...immediately discarding? from PQ side they immediately discarded once all rows are fetched > > One last thing: when you get access to the server, can you run a > custom format query test from pgbench and compare the results to your > test similarly configured (same number of backends, etc) in terms of > tps? I'll try Rgds, -Dimitri -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Any better plan for this query?..
On 5/19/09, Scott Carey wrote: > > On 5/19/09 3:46 AM, "Dimitri" wrote: > >> On 5/19/09, Scott Carey wrote: >>> >>> On 5/18/09 3:32 PM, "Dimitri" wrote: >>> >>>> On 5/18/09, Scott Carey wrote: >>>>> Great data Dimitri!' >>>> >>>> Thank you! :-) >>>> >>>>> >>>>> I see a few key trends in the poor scalability: >>>>> >>>>> The throughput scales roughly with %CPU fairly well. But CPU used >>>>> doesn't >>>>> go past ~50% on the 32 core tests. This indicates lock contention. >>>>> >>>> >>>> You should not look on #1 STATs, but on #2 - they are all with the >>>> latest "fixes" - on all of them CPU is used well (90% in pic on >>>> 32cores). >>>> Also, keep in mind these cores are having 2 threads, and from Solaris >>>> point of view they are seen as CPU (so 64 CPU) and %busy is accounted >>>> as for 64 CPU >>>> >>> >>> Well, if the CPU usage is actually higher, then it might not be lock >>> waiting >>> -- it could be spin locks or context switches or cache coherency >>> overhead. >>> Postgres may also not be very SMT friendly, at least on the hardware >>> tested >>> here. >> >> do you mean SMP or CMT? ;-) >> however both should work well with PostgreSQL. I also think about CPU >> affinity - probably it may help to avoid CPU cache misses - but makes >> sense mostly if pooler will be added as a part of PG. > > Symmetric Multi Threading (HyperThreading in Intels marketing terms, other > marketing terms for Sun or IBM). One CPU core that can handle more than one > concurrently executing thread. > Technically, 'SMT' allows instructions in flight from multiple threads at > once in a superscalar Cpu core while some implementations differ and might > technically CMT (one thread or the other, but can switch fast, or a > non-superscalar core). > > For many implementations of 'multiple threads on one CPU core' many of the > processor resources are reduced per thread when it is active -- caches get > split, instruction re-order buffers are split, etc. That is rather hardware > implementation dependant. > > For Intel's SMT (and other similar), spin-locks hurt scalability if they > aren't using new special instructions for the spin to yield pipeline slots > to the other thread. > > Generally, code that stresses common processor resources more than CPU > execution will scale poorly with SMT/CMT etc. All application are scaling well anyway, except if you have any kind of lock contention inside of the application itself or meet any kind of system resource become hot. But well, here we may spend days to discuss :-) > > So I'm not sure about the Postgres details, but the general case of an > application that doesn't benefit from these technologies exists, and there > is a non-zero chance that Postgres has some characteristics of such an app. > >>> >>> (what was the context switch rate? I didn't see that in the data, just >>> mutex spins). >> >> increasing with a load, as this ex.: >> http://dimitrik.free.fr/Report_20090505/5539_dim_STAT_100.html#bmk_CPU_CtxSwit >> ch_100 >> > > Well, on most systems over 100K context switches/sec is a lot. And those > reach 18 /sec. > However, this is 'only' 10 context switches per transaction and less than > 20% system CPU, so maybe those numbers aren't quite as big as they seem. > > Out of curiosity, what was the context switch rate for MySql at its peak > throughput? the main MySQL problem is a mutex locking like here: http://dimitrik.free.fr/Report_20090504/5465_dim_STAT_31.html#bmk_SpinMtx_31 so you have to limit a number of active threads to lower this contention (similar to pooler idea folks told here) and the context switch is even higher (~200K/sec) >> >>> >>> The scalability curve is definitely showing something. Prepared >>> statements >>> were tried, as were most of the other suggestions other than one: >>> >>> What happens if the queries are more complicated (say, they take 15ms >>> server >>> side with a more complicated plan required)? That is a harder question >>> to >>> answer >> >> What I observed is: if planner takes more long time (like initially >> with 8.3.7 and analyze target 1000) the scalability problem is >> appearing more strange - >> http://dimitrik.free.fr/Report_2009
Re: [PERFORM] Any better plan for this query?..
On 5/19/09, Merlin Moncure wrote: > On Tue, May 19, 2009 at 11:53 AM, Dimitri wrote: >> On 5/19/09, Merlin Moncure wrote: >>> On Mon, May 18, 2009 at 6:32 PM, Dimitri wrote: >>>> Thanks Dave for correction, but I'm also curious where the time is >>>> wasted in this case?.. >>>> >>>> 0.84ms is displayed by "psql" once the result output is printed, and I >>>> got similar time within my client (using libpq) which is not printing >>>> any output.. >>> >>> Using libpq? What is the exact method you are using to execute >>> queries...PQexec? >> >> exactly >> >>> If you are preparing queries against libpq, the >>> best way to execute queries is via PQexecPrepared. >> >> the query is *once* prepared via PQexec, >> then it's looping with "execute" via PQexec. >> Why PQexecPrepared will be better in my case?.. > > It can be better or worse (usually better). the parameters are > separated from the query string. Regardless of performance, the > parametrized interfaces are superior for any queries taking arguments > and should be used when possible. you're probably right, but I don't like either when solution become so complicated - PG has a so elegant way to execute a prepared query! > >>> Another note: I would like to point out again that there are possible >>> negative side effects in using char(n) vs. varchar(n) that IIRC do not >>> exist in mysql. When you repeat your test I strongly advise switching >>> to varchar. >> >> if it's true for any case, why not just replace CHAR implementation by >> VARCHAR directly within PG code?.. > > First, let me explain the difference. char(n) is padded out to 'n' on > disk and when returned. despite this, the length is still stored so > there is no real advantage to using the char(n) type except that the > returned string is of a guaranteed length. mysql, at least the > particular version and storage engine that I am logged into right now, > does not do this for char(n). In other words, select cast('abc' as > char(50)) returns a string of 50 chars on pgsql and 3 chars on mysql. > I will leave it as an exercise to the reader to figure out whom is > following the standard. pg's handling of the situation is not > necessarily optimal, but we just tell everyone to quit using 'char(n)' > type. > > Unless for example your 'NOTE' column is mostly full or mostly null, > your query is not fair because postgres has to both store and return a > proportionally greater amount of data. This makes the comparison > hardly apples to apples. This stuff counts when we are measuring at > microsecond level. Good point! I may confirm only at least at the beginning all fields are fully filled within a database. Will test both engines with VARCHAR next time to be sure it's not an issue. > >>> Another question: how exactly are you connecting to the database? >>> local machine? if so, domain socket or tcp/ip? >> >> local TCP/IP, same as MySQL > > would be curious to see if you get different results from domain socket. at least for PG there was no difference if I remember well. However, before when I tested on the real network I finished by change completely my code to reduce a network traffic (initially I've used cursors), and finally PG traffic was lower or similar to MySQL, it was an interesting stuff too :-) Rgds, -Dimitri > > merlin > -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Scalability in postgres
Few weeks ago tested a customer application on 16 cores with Oracle: - 20,000 sessions in total - 70,000 queries/sec without any problem on a mid-range Sun box + Solaris 10.. Rgds, -Dimitri On 6/3/09, Kevin Grittner wrote: > James Mansion wrote: > >> I'm sure most of us evaluating Postgres from a background in Sybase >> or SQLServer would regard 5000 connections as no big deal. > > Sure, but the architecture of those products is based around all the > work being done by "engines" which try to establish affinity to > different CPUs, and loop through the various tasks to be done. You > don't get a context switch storm because you normally have the number > of engines set at or below the number of CPUs. The down side is that > they spend a lot of time spinning around queue access to see if > anything has become available to do -- which causes them not to play > nice with other processes on the same box. > > If you do connection pooling and queue requests, you get the best of > both worlds. If that could be built into PostgreSQL, it would > probably reduce the number of posts requesting support for bad > configurations, and help with benchmarks which don't use proper > connection pooling for the product; but it would actually not add any > capability which isn't there if you do your homework > > -Kevin > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance > -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Scalability in postgres
Just to say you don't need a mega server to keep thousands connections with Oracle, it's just trivial, nor CPU affinity and other stuff you may or may not need with Sybase :-) Regarding PostgreSQL, I think it'll only benefit to have an integrated connection pooler as it'll make happy all populations anyway: - those who don't like the idea may always disable it :-) - those who have a lot but mostly inactive sessions will be happy to simplify session pooling - those who really seeking for the most optimal workload on their servers will be happy twice: if there are any PG scalability limits, integrated pooler will be in most cases more performant than external; if there are no PG scalability limits - it'll still help to size PG most optimally according a HW or OS capacities.. Rgds, -Dimitri On 6/3/09, Kevin Grittner wrote: > Dimitri wrote: >> Few weeks ago tested a customer application on 16 cores with Oracle: >> - 20,000 sessions in total >> - 70,000 queries/sec >> >> without any problem on a mid-range Sun box + Solaris 10.. > > I'm not sure what point you are trying to make. Could you elaborate? > > (If it's that Oracle doesn't need an external connection pool, then > are you advocating that PostgreSQL include that in the base product?) > > -Kevin > -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] VX_CONCURRENT flag on vxfs( 5.1 or later) for performance for postgresql?
You should rather consider VxFS tuning - it has an auto-discovery for DIRECT I/O according the block size. Just change this setting to 8K or 16-32K depending on your workload - then all I/O operations with a bigger block size will be executed in DIRECT mode and bypass FS cache (which logical as usually it'll correspond to a full scan or a seq scan of some data), while I/O requests with smaller blocks will remain cached which is very useful as it'll mainly cache random I/O (mainly index access).. With such a tuning I've got over %35 performance improvement comparing to any other states (full DIRECT or fully cached). Rgds, -Dimitri Rgds, -Dimitri On 5/5/11, Robert Haas wrote: > On Sat, Apr 30, 2011 at 4:51 AM, Hsien-Wen Chu > wrote: >> since the block size is 8k for the default, and it consisted with many >> tuple/line; as my understand, if any tuple/line is changed(maybe >> update, insert, delete). the block will be marked as dirty block. and >> then it will be flashed to disk by bgwriter. > > True... > >> so my question is if the data block(8k) is aligned with the file >> system block? if it is aligned with file system block, so what's the >> potential issue make it is not safe for direct io. (please assume >> vxfs, vxvm and the disk sector is aligned ).please correct me if any >> incorrect. > > It's not about safety - it's about performance. On a machine with > 64GB of RAM, a typical setting for shared_buffers set to 8GB. If you > start reading blocks into the PostgreSQL cache - or writing them out > of the cache - in a way that bypasses the filesystem cache, you're > going to have only 8GB of cache, instead of some much larger amount. > More cache = better performance. > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance > -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [HACKERS] [PERFORM] EXPLAIN ANALYZE on 8.2
Hi list, Le vendredi 15 décembre 2006 11:50, Martijn van Oosterhout a écrit : > BTW, doing gettimeofday() without kernel entry is not really possible. > You could use the cycle counter but it has the problem that if you have > multiple CPUs you need to calibrate the result. If the CPU goes to > sleep, there's is no way for the userspace process to know. Only the > kernel has all the relevent information about what "time" is to get a > reasonable result. I remember having played with intel RDTSC (time stamp counter) for some timing measurement, but just read from several sources (including linux kernel hackers considering its usage for gettimeofday() implementation) that TSC is not an accurate method to have elapsed time information. May be some others method than gettimeofday() are available (Lamport Timestamps, as PGDG may have to consider having a distributed processing ready EA in some future), cheaper and accurate? After all, the discussion, as far as I understand it, is about having a accurate measure of duration of events, knowing when they occurred in the day does not seem to be the point. My 2¢, hoping this could be somehow helpfull, -- Dimitri Fontaine http://www.dalibo.com/ pgppVKlng4gwN.pgp Description: PGP signature
Re: [PERFORM] Scaling concerns
Le samedi 16 décembre 2006 18:32, Steinar H. Gunderson a écrit : > Instead, try real queries on real data, > and post here if some are too slow for you. To quickly find out a subset of slow queries on your production system, you can use the pgfouine tool: http://pgfouine.projects.postgresql.org/ If you then want to make some measurements of PostgreSQL performances with some different settings and compare them, consider using the tsung tool (and may be tsung-ploter companion tool to graph several benchs onto the same charts for comparing purpose): http://pgfouine.projects.postgresql.org/tsung.html http://tsung.erlang-projects.org/ http://debian.dalibo.org/unstable/ This latter link also contains a .tar.gz archive of tsung-ploter in case you're not running a debian system. Dependencies are python and matplotlib. Regards, -- Dimitri Fontaine http://www.dalibo.com/ pgpyW4ZPkTS5m.pgp Description: PGP signature
Re: [PERFORM] Performance of PostgreSQL on Windows vs Linux
Le jeudi 4 janvier 2007 00:18, Magnus Hagander a écrit : > But to get a good answer on if the difference is > significant enough to matter, you really need to run some kind of simple > benchmark on *your* workload. To easily stress test a couple of servers and compare results on *your* workload, please consider using both pgfouine[1,2] and tsung[3]. The companion tool tsung-ploter[4] (for plotting several results using common graph, hence scales), may also be usefull. [1]: http://pgfouine.projects.postgresql.org/ [2]: http://pgfouine.projects.postgresql.org/tsung.html [3]: http://tsung.erlang-projects.org/ [4]: http://debian.dalibo.org/unstable/tsung-ploter_0.1-1.tar.gz Regards, -- Dimitri Fontaine http://www.dalibo.com/ pgpHLPZaAGz2d.pgp Description: PGP signature