On Sun, May 29, 2011 at 7:04 PM, Greg Smith <g...@2ndquadrant.com> wrote: > On 05/29/2011 03:11 PM, Jeff Janes wrote: >> >> If you use "pgbench -S -M prepared" at a scale where all data fits in >> memory, most of what you are benchmarking is network/IPC chatter, and >> table locking. > > If you profile it, you'll find a large amount of the time is actually spent > doing more mundane things, like memory allocation. The network and locking > issues are really not the bottleneck at all in a surprising number of these > cases.
I wouldn't expect IPC chatter to show up in profiling, because it costs wall time, but not CPU time. The time spent might be attributed to the kernel, or to pgbench, or to nothing at all. As part of the "Eviscerating the parser" discussion, I made a hack that had exec_simple_query do nothing but return a dummy completionTag. So there was no parsing, planning, or execution. Under this mode, I got 44758 TPS, or 22.3 microsecons/transaction, which should represent the cost of IPC chatter and pgbench overhead. The breakdown I get, in microseconds per item, are: 53.70 cost of a select and related overhead via -S -M prepared. of which: 22.34 cost of IPC and pgbench roundtrip, estimated via above discussion 16.91 cost of the actual execution of the select statement, estimated via the newly suggested -P mode. -------- 14.45 residual usec cost, covering table locking, transaction begin and end, plus measurement errors. Because all my tests were single-client, the cost of locking would be much lower than they would be in contended cases. However, I wouldn't trust profiling to accurate reflect the locking time anyway, for the same reason I don't trust it for IPC chatter--wall time is consumed but not spent on the CPU, so is not counted by profiling. As you note memory allocation consumes much profiling time. However, memory allocation is a low level operation which is always in support of some higher purpose, such as parsing, execution, or taking locks. My top down approach attempts to assign these bottom-level costs to the proper higher level purpose. > Your patch isn't really dependent on your being right about the > cause here, which means this doesn't impact your submissions any. Just > wanted to clarify that what people expect are slowing things down in this > situation and what actually shows up when you profile are usually quite > different. > > I'm not sure whether this feature makes sense to add to pgbench, but it's > interesting to have it around for developer testing. Yes, this is what I thought the opinion might be. But there is no repository of such "useful for developer testing" patches, other than this mailing list. That being the case, would it even be worthwhile to fix it up more and submit it to commitfest? >> some numbers for single client runs on 64-bit AMD Opteron Linux: >> 12,567 sps under -S >> 19,646 sps under -S -M prepared >> 58,165 sps under -P >> > > 10,000 is too big of a burst to run at once. The specific thing I'm > concerned about is what happens if you try this mode when using "-T" to > enforce a runtime limit, and your SELECT rate isn't high. If you're only > doing 100 SELECTs/second because your scale is big enough to be seek bound, > you could overrun by nearly two minutes. OK. I wouldn't expect someone to want to use -P under scales that cause that to happen, but perhaps it should deal with it more gracefully. I picked 10,000 just because it obviously large enough for any hardware I have, or will have for the foreseeable future. > I think this is just a matter of turning the optimization around a bit. > Rather than starting with a large batch size and presuming that's ideal, in > this case a different approach is really needed. You want the smallest > batch size that gives you a large win here. My guess is that most of the > gain here comes from increasing batch size to something in the 10 to 100 > range; jumping to 10K is probably overkill. Could you try some smaller > numbers and see where the big increases start falling off at? I've now used a variety of sizes (powers of 2 up to 8192, plus 10000); and the results fit very well to a linear equation, with 17.3 usec/inner select plus 59.0 usec/outer invocation. So at a loop of 512, you would have overhead of 59.0/512=0.115 out of total time of 17.4, or 0.7% overhead. So that should be large enough. Thanks for the other suggestions. Cheers, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers