Jignesh Shah's scalability testing on Solaris has revealed further tuning opportunities surrounding the start and end of a transaction. Tuning that should be especially important since async commit is likely to allow much higher transaction rates than were previously possible.
There is strong contention on the ProcArrayLock in Exclusive mode, with the top path being CommitTransaction(). This becomes clear as the number of connections increases, but it seems likely that the contention can be caused in a range of other circumstances. My thoughts on the causes of this contention are that the following 3 tasks contend with each other in the following way: CommitTransaction(): takes ProcArrayLock Exclusive but only needs access to one ProcArray element waits for GetSnapshotData():ProcArrayLock Shared ReadNewTransactionId():XidGenLock Shared which waits for GetNextTransactionId() takes XidGenLock Exclusive ExtendCLOG(): takes ClogControlLock Exclusive, WALInsertLock Exclusive two possible place where I/O is required ExtendSubtrans(): takes SubtransControlLock() one possible place where I/O is required Avoids lock on ProcArrayLock: atomically updates one ProcArray element or more simply: CommitTransaction() -- i.e. once per transaction waits for GetSnapshotData() -- i.e. once per SQL statement which waits for GetNextTransactionId() -- i.e. once per transaction This gives some goals for scalability improvements and some proposals. (1) and (2) are proposals for 8.3 tuning, the others are directions for further research. Goal: Reduce total time that GetSnapshotData() waits for GetNextTransactionId() 1. Increase size of Clog-specific BLCKSZ Clog currently uses BLCKSZ to define the size of clog buffers. This can be changed to use CLOG_BLCKSZ, which would then be set to 32768. This will naturally increase the amount of memory allocated to the clog, so we need not alter CLOG_BUFFERS above 8 if we do this (as previously suggested, with successful results). This will also reduce the number of ExtendClog() calls, which will probably reduce the overall contention also. 2. Perform ExtendClog() as a background activity Background process can look at the next transactionid once each cycle without holding any lock. If the xid is almost at the point where a new clog page would be allocated, then it will allocate one prior to the new page being absolutely required. Doing this as a background task would mean that we do not need to hold the XidGenLock in exclusive mode while we do this, which means that GetSnapshotData() and CommitTransaction() would also be less likely to block. Also, if any clog writes need to be performed when the page is moved forwards this would also be performed in the background. 3. Consider whether ProcArrayLock should use a new queued-shared lock mode that puts a maximum wait time on ExclusiveLock requests. It would be fairly hard to implement this well as a timer, but it might be possible to place a limit on queue length. i.e. allow Share locks to be granted immediately if a Shared holder already exists, but only if there is a queue of no more than N exclusive mode requests queued. This might prevent the worst cases of exclusive lock starvation. 4. Since shared locks are currently queued behind exclusive requests when they cannot be immediately satisfied, it might be worth reconsidering the way LWLockRelease works also. When we wake up the queue we only wake the Shared requests that are adjacent to the head of the queue. Instead we could wake *all* waiting Shared requestors. e.g. with a lock queue like this: (HEAD) S<-S<-X<-S<-X<-S<-X<-S Currently we would wake the 1st and 2nd waiters only. If we were to wake the 3rd, 5th and 7th waiters also, then the queue would reduce in length very quickly, if we assume generally uniform service times. (If the head of the queue is X, then we wake only that one process and I'm not proposing we change that). That would mean queue jumping right? Well thats what already happens in other circumstances, so there cannot be anything intrinsically wrong with allowing it, the only question is: would it help? We need not wake the whole queue, there may be some generally more beneficial heuristic. The reason for considering this is not to speed up Shared requests but to reduce the queue length and thus the waiting time for the Xclusive requestors. Each time a Shared request is dequeued, we effectively re-enable queue jumping, so a Shared request arriving during that point will actually jump ahead of Shared requests that were unlucky enough to arrive while an Exclusive lock was held. Worse than that, the new incoming Shared requests exacerbate the starvation, so the more non-adjacent groups of Shared lock requests there are in the queue, the worse the starvation of the exclusive requestors becomes. We are effectively randomly starving some shared locks as well as exclusive locks in the current scheme, based upon the state of the lock when they make their request. The situation is worst when the lock is heavily contended and the workload has a 50/50 mix of shared/exclusive requests, e.g. serializable transactions or transactions with lots of subtransactions. Goal: Reduce the total time that CommitTransaction() waits for GetSnapshotData() 5. Reduce the time that GetSnapshotData holds ProcArray lock. To do this, we split the ProcArrayLock into multiple partitions (as suggested by Alvaro). There are comments in GetNewTransactionId() about having one spinlock per ProcArray entry. This would be too many and we could reduce contention by having one lock for each N ProcArray entries. Since we don't see too much contention with 100 users (default) it would seem sensible to make N ~ 120. Striped or contiguous? If we stripe the lock partitions then we will need multiple partitions however many users we have connected, whereas using contiguous ranges would allow one lock for low numbers of users and yet enough locks for higher numbers of users. 6. Reduce the number of times ProcArrayLock is called in Exclusive mode. To do this, optimise group commit so that all of the actions for multiple transactions are executed together: flushing WAL, updating CLOG and updating ProcArray, whenever it is appropriate to do so. There's no point in having a group commit facility that optimises just one of those contention points when all 3 need to be considered. That needs to be done as part of a general overhaul of group commit. This would include making TransactionLogMultiUpdate() take CLogControlLock once for each page that it needs to access, which would also reduce contention from TransactionIdCommitTree(). (1) and (2) can be patched fairly easily for 8.3. I have a prototype patch for (1) on the shelf already from 6 months ago. (3), (4) and (5) seem like changes that would require significant testing time to ensure we did it correctly, even though the patches might be fairly small. I'm thinking this is probably an 8.4 change, but I can get test versions out fairly quickly I think. (6) seems definitely an 8.4 change. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org