A few weeks ago I posted some performance results showing that increasing NUM_CLOG_BUFFERS was improving pgbench performance.
http://archives.postgresql.org/pgsql-hackers/2011-12/msg00095.php I spent some time today looking at this in a bit more detail. Somewhat obviously in retrospect, it turns out that the problem becomes more severe the longer you run the test. CLOG lookups are induced when we go to update a row that we've previously updated. When the test first starts, just after pgbench -i, all the rows are hinted and, even if they weren't, they all have the same XID. So no problem. But, as the fraction of rows that have been updated increases, it becomes progressively more likely that the next update will hit a row that's already been updated. Initially, that's OK, because we can keep all the CLOG pages of interest in the 8 available buffers. But eaten through enough XIDs - specifically, 8 buffers * 8192 bytes/buffer * 4 xids/byte = 256k - we can't keep all the necessary pages in memory at the same time, and so we have to keep replacing CLOG pages. This effect is not difficult to see even on my 2-core laptop, although I'm not sure whether it causes any material performance degradation. If you have enough concurrent tasks, a probably-more-serious form of starvation can occur. As SlruSelectLRUPage notes: /* * We need to wait for I/O. Normal case is that it's dirty and we * must initiate a write, but it's possible that the page is already * write-busy, or in the worst case still read-busy. In those cases * we wait for the existing I/O to complete. */ On Nate Boley's 32-core box, after running pgbench for a few minutes, that "in the worst case" scenario starts happening quite regularly, apparently because the number of people who simultaneously wish to read a different CLOG pages exceeds the number of available buffers into which they can be read. The ninth and following backends to come along have to wait until the least-recently-used page is no longer read-busy before starting their reads. So, what do we do about this? The obvious answer is "increase NUM_CLOG_BUFFERS", and I'm not sure that's a bad idea. 64kB is a pretty small cache on anything other than an embedded system, these days. We could either increase the hard-coded value, or make it configurable - but it would have to be PGC_POSTMASTER, since there's no way to allocate more shared memory later on. The downsides of this approach are: 1. If we make it configurable, nobody will have a clue what value to set. 2. If we just make it bigger, people laboring under the default 32MB shared memory limit will conceivably suffer even more than they do now if they just initdb and go. A more radical approach would be to try to merge the buffer arenas for the various SLRUs either with each other or with shared_buffers, which would presumably allow a lot more flexibility to ratchet the number of CLOG buffers up or down depending on overall memory pressure. Merging the buffer arenas into shared_buffers seems like the most flexible solution, but it also seems like a big, complex, error-prone behavior change, because the SLRU machinery does things quite differently from shared_buffers: we look up buffers with a linear array search rather than a hash table probe; we have only a per-SLRU lock and a per-page lock, rather than separate mapping locks, content locks, io-in-progress locks, and pins; and while the main buffer manager is content with some loosey-goosey approximation of recency, the SLRU code makes a fervent attempt at strict LRU (slightly compromised for the sake of reduced locking in SimpleLruReadPage_Readonly). Any thoughts on what makes most sense here? I find it fairly tempting to just crank up NUM_CLOG_BUFFERS and call it good, but the siren song of refactoring is whispering in my other ear. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers