For the last week or so, in between various other tasks, I've been trying to understand why performance drops off when you run pgbench -n -S -c $CLIENTS -j $CLIENTS -T $A_FEW_MINUTES at very high client counts. The answer, in a word, is SIGetDataEntries(). I believe we need to bite the bullet and rewrite this using a lock-free algorithm, using memory barriers on processors with weak memory ordering. Perhaps there is another way to do it, but nothing I've tried has really worked so far, and I've tried quite a few things. Here's the data.
On unpatched master, performance scales pretty much linearly out to 32 clients. As you add more clients, it drops off: [1 client] tps = 4459.203159 (including connections establishing) tps = 4488.456559 (including connections establishing) tps = 4449.570530 (including connections establishing) [8 clients] tps = 33524.668762 (including connections establishing) tps = 33501.833907 (including connections establishing) tps = 33382.380908 (including connections establishing) [32 clients] tps = 178394.863906 (including connections establishing) tps = 178457.706972 (including connections establishing) tps = 179365.310179 (including connections establishing) [80 clients] tps = 132518.586371 (including connections establishing) tps = 130968.749747 (including connections establishing) tps = 132574.338942 (including connections establishing) With the lazy vxid locks patch, all of those numbers get better by a few percentage points (the more clients, the more points) except at 80 clients, where the performance is sometimes better and sometimes worse. These are 5-minute runs: [80 clients, with lazy vxid locks] tps = 119215.958372 (including connections establishing) tps = 113056.859871 (including connections establishing) tps = 160562.770998 (including connections establishing) I can't explain in detail why there is so much variation between the 5-minute runs, or why some runs perform worse than without the lazy vxid locks, but I can say that apply the first of the two attached patches (sinval-per-backend-mutex.patch) fixes it. The patch gets rid of SInvalReadLock and instead gives each backend its own spinlock. SICleanupQueue() must therefore get somewhat fuzzy answers, but it shouldn't matter. The only real problem is if you need to do the super-duper-cleanup where you subtract a constant from all the values in unison. I just got rid of that completely, by widening the counters to 64 bits. Assuming I haven't botched the implementation, this is clearly a win. There is still some performance drop-off going from 32 clients to 80 clients, but it is much less. [80 clients, with lazy vxid locks and sinval-per-backend-mutex] tps = 167392.042393 (including connections establishing) tps = 171336.145020 (including connections establishing) tps = 170500.529303 (including connections establishing) Profiling this combination of patches reveals that there is still some pretty ugly spinlock contention on sinval's msgNumLock. And it occurs to me that on x86, we really don't need this lock ... or SInvalReadLock ... or a per-backend mutex. The whole of SIGetDataEntries() can pretty easily be made lock-free. The only real changes that seem to be are needed are (1) to use a 64-bit counter, so you never need to decrement and (2) to recheck resetState after reading the entries from the queue, to see if we got reset while we were reading those entries. Since x86 guarantees that writes will become visible in the order they are executed, we only need to make sure that the compiler doesn't rearrange things. As long as we first read the maxMsgNum and then read the messages, we can't read garbage. As long as we read the messages before we check resetState, we will be sure to notice if we got reset before we read all the messages (which is the only way that we can have read garbage messages). Now this implementation (sinval-unlocked.patch) is obviously not a serious proposal as it stands, because on processors with weak memory ordering it will presumably blow up all over the place. But here are the results on x86: [80 clients, with lazy vxid locks and sinval-unlocked] tps = 203256.701227 (including connections establishing) tps = 190637.957571 (including connections establishing) tps = 190228.617178 (including connections establishing) Now, that's actually *faster* then the above 32-core numbers. Turns out, with this approach, there's essentially no degradation between 32 clients and 80 clients. It appears that even one spinlock acquisition in SIGetDataEntries() is too many. Currently, we have *three*: one to get SInvalReadLock, one to get msgnumLock, and one to release SInvalReadLock. For contrast, on a two-core MacBook Pro, I can't measure any difference at all between lazy-vxid, lazy-vxid+sinval-per-backend-mutex, and lazy-vxid+sinval-unlocked. The last might even be a touch slower, although there's enough noise that it's hard to tell for sure, and the effect is very small if there is one. But on the 32-core machine, the differences are dramatic. Thoughts? Comments? Ideas? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
sinval-per-backend-mutex.patch
Description: Binary data
sinval-unlocked.patch
Description: Binary data
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers