I think that the BufFreelistLock can be a contention bottleneck on a system with a lot of CPUs that do a lot of shared-buffer allocations which can fulfilled by the OS buffer cache. That is, read-mostly queries where the working data set fits in RAM, but not in shared_buffers. (You can always increase shared_buffers, but that leads to other problems, and who wants to spend their time micromanaging the size of shared_buffers as work loads slowly change?)
I can't prove it is a contention bottleneck without first solving the putative problem and timing the difference, but it is the dominant blocking lock showing up under LWLOCK_STATS for one benchmark I've done using 8 CPUs. So I had two questions. 1) Would it be useful for BufFreelistLock be partitioned, like BufMappingLock, or via some kind of clever "virtual partitioning" that could get the same benefit via another means? I don't know if both the linked list and the clock sweep would have to be partitioned, or if some other arrangement could be made 2) Could BufFreelistLock simply go away, by reducing it from a lwlock to a spinlock? Or at least in most common paths? For doing away with it, I think that any manipulation of the freelist is short enough (just a few instructions) that it could be done under a spinlock. If you somehow obtained a pinned or usage_count buffer, you would have to retake the spinlock to look at the new head of the chain, but the comments StrategyGetBuffer suggest that that should be rare or impossible. For the clock sweep algorithm, I think you could access nextVictimBuffer without any type of locking. If a non-atomic increment causes an occasional buffer to be skipped or examined twice, that doesn't seem like a correctness problem. When nextVictimBuffer gets reset to zero and completePasses gets incremented, that would probably need to be protected to prevent a double-increment of completePasses from throwing off the background writer's usage estimations. But again, a spinlock should be enough for that. And it shouldn't occur all that often. If potentially inaccurate non-atomic increments of numBufferAllocs are a problem, it could be incremented under the same spinlock used to protect the test firstFreeBuffer>0 to determine if the freelist is empty. Doing away with the lock without some form of partitioning might just move the contention to the BufHdr spinlocks. But if most of the processes entering the code at about the same time perceive each others increments to nextVictimBuffer, they would all start out offset from each other and shouldn't collide too badly. Does any of this sound like it might be fruitful to look into? Cheers, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers