I've long harbored a suspicion, based on some testing I did on my home machine, that WALInsertLock is a big performance bottleneck. But I just did some benchmarking that doesn't entirely support that contention. This is on Nate Boley's 32-core machine, with the following settings:
max_connections = 100 shared_buffers = 8GB synchronous_commit = off checkpoint_segments = 100 checkpoint_timeout = 15min checkpoint_completion_target = 0.9 I did 5-minute pgbench runs with unlogged tables and with permanent tables, restarting the database server and reinitializing the tables between each run. The number at the beginning of each line is the number of clients, while the p/u indicates which type of tables were used: 01p tps = 628.626815 (including connections establishing) 01p tps = 635.394288 (including connections establishing) 01p tps = 634.972789 (including connections establishing) 08p tps = 3342.787325 (including connections establishing) 08p tps = 3883.876813 (including connections establishing) 08p tps = 3941.253567 (including connections establishing) 32p tps = 5597.472192 (including connections establishing) 32p tps = 5738.139895 (including connections establishing) 32p tps = 5794.490934 (including connections establishing) 80p tps = 4499.685286 (including connections establishing) 80p tps = 4917.060441 (including connections establishing) 80p tps = 5050.931933 (including connections establishing) 01u tps = 672.469142 (including connections establishing) 01u tps = 671.256686 (including connections establishing) 01u tps = 670.421003 (including connections establishing) 08u tps = 4087.749529 (including connections establishing) 08u tps = 3797.750851 (including connections establishing) 08u tps = 4181.393560 (including connections establishing) 32u tps = 8956.346905 (including connections establishing) 32u tps = 8898.442517 (including connections establishing) 32u tps = 8971.591569 (including connections establishing) 80u tps = 7287.550952 (including connections establishing) 80u tps = 7266.816989 (including connections establishing) 80u tps = 7255.968109 (including connections establishing) The speed-up from using unlogged tables was not as large as I expected. Turning off synchronous_commit here removes commit rate as the bottleneck, and I think perhaps the main benefit of unlogged tables in that case is the avoidance of I/O, and apparently this machine has enough I/O bandwidth, and just enough memory in general, that that's not an issue. With either type of tables, the 8 client results are about 6.1 times the single core results - not great, but not terrible, either. With 32 clients, there is some improvement: 13.4x vs. 9.1x, but even 13.4x is a long way from linear. vmstat reveals that CPU usage is substantially less than 100%. After some investigation, I found that using unlogged tables wasn't actually getting rid of all the write-ahead logging - the commit records were still being issued. So I hacked up RecordTransactionCommit() not to emit transaction commit records in that case. That doesn't actually completely eliminate the WAL activity, because it still emits records for zeroing new SLRU pages for CLOG, but it takes a big chunk out of it. The upshot is that this improved both raw performance and scalability, but not dramatically. Unlogged table results, with this change: 01h tps = 708.189337 (including connections establishing) 01h tps = 704.030175 (including connections establishing) 01h tps = 701.644199 (including connections establishing) 08h tps = 5196.615955 (including connections establishing) 08h tps = 5126.162200 (including connections establishing) 08h tps = 5067.568727 (including connections establishing) 32h tps = 10661.275636 (including connections establishing) 32h tps = 10621.085226 (including connections establishing) 32h tps = 10575.267197 (including connections establishing) 80h tps = 7557.965666 (including connections establishing) 80h tps = 7545.697547 (including connections establishing) 80h tps = 7556.379921 (including connections establishing) Now the 32-client numbers have risen to 15.1x the single-client numbers, but that's still not great. What does this mean? Well, XLogInsert does strikes me as an awfully complex piece of code to be running with a hot LWLock held in exclusive mode. But even so, I think this certainly means that WALInsertLock, at least on this workload, is not the whole problem... in this test, I'm not only eliminating the overhead of inserting the WAL, but also the overhead of writing it, flushing it, and generating it. So there is something, other than WAL insertion, which is taking a big bite out of performance here. As to what that something might be, I reran this last test with LWLOCK_STATS enabled and here are the top things that are blocking: lwlock 310: shacq 96846 exacq 108433 blk 16275 lwlock 3: shacq 64 exacq 2628381 blk 36027 lwlock 7: shacq 0 exacq 2628615 blk 85220 lwlock 11: shacq 84913908 exacq 4539551 blk 2119423 lwlock 4: shacq 28667307 exacq 2628524 blk 3356651 During this 5-minute test run, an LWLock acquisition blocked 6180335 times. As you can see from the above results, ProcArrayLock accounts for 54% of that blocking, and CLogControlLock accounts for another 34%. lwlock 7 is WALInsertLock, which manages to account for more than 1% of the blocking despite the fact that WAL has been largely eliminated in this test; lwlock 3 is XidGenLock. I dunno what LWLock 310 is but it doesn't matter. After that there is a long tail of other stuff with gradually decreasing amounts of blocking. Experience with the read scalability stuff has taught me also to look at which LWLocks have the most shared acquisitions, as that can cause spinlock and cache line contention. The top few culprits are: lwlock 504: shacq 5315030 exacq 0 blk 0 lwlock 45: shacq 5967317 exacq 13284 blk 1722 lwlock 39: shacq 8219988 exacq 13342 blk 2291 lwlock 5: shacq 26306020 exacq 0 blk 0 lwlock 4: shacq 28667307 exacq 2628524 blk 3356651 lwlock 11: shacq 84913908 exacq 4539551 blk 2119423 In all, there were 238777533 shared LWLock acquisitions during this test: 35% CLogControlLock, 12% ProcArrayLock, 11% SInvalReadLock (soon to be dealt with, as discussed elsewhere on-list), and then it gets down into the lock manager locks and a few others. I'm mulling over what to do about all of this, but thought I'd share the raw numbers first, in case anyone's interested. Thanks, -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
skip-commit-record.patch
Description: Binary data
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers