Good summary. I agree checkpoint should look like as normal a Proc as possible.
> At the just-past OSDN database conference, Bruce and I were annoyed by > some benchmark results showing that Postgres performed poorly on an > 8-way SMP machine. Based on past discussion, it seems likely that the > culprit is the known inefficiency in our spinlock implementation. > After chewing on it for awhile, we came up with an idea for a solution. > > The following proposal should improve performance substantially when > there is contention for a lock, but it creates no portability risks > because it uses the same system facilities (TAS and SysV semaphores) > that we have always relied on. Also, I think it'd be fairly easy to > implement --- I could probably get it done in a day. > > Comments anyone? > > regards, tom lane > > > Plan: > > Replace most uses of spinlocks with "lightweight locks" (LW locks) > implemented by a new lock manager. The principal remaining use of true > spinlocks (TAS locks) will be to provide mutual exclusion of access to > LW lock structures. Therefore, we can assume that spinlocks are never > held for more than a few dozen instructions --- and never across a kernel > call. > > It's pretty easy to rejigger the spinlock code to work well when the lock > is never held for long. We just need to change the spinlock retry code > so that it does a tight spin (continuous retry) for a few dozen cycles --- > ideally, the total delay should be some small multiple of the max expected > lock hold time. If lock still not acquired, yield the CPU via a select() > call (10 msec minimum delay) and repeat. Although this looks inefficient, > it doesn't matter on a uniprocessor because we expect that backends will > only rarely be interrupted while holding the lock, so in practice a held > lock will seldom be encountered. On SMP machines the tight spin will win > since the lock will normally become available before we give up and yield > the CPU. > > Desired properties of the LW lock manager include: > * very fast fall-through when no contention for lock > * waiting proc does not spin > * support both exclusive and shared (read-only) lock modes > * grant lock to waiters in arrival order (no starvation) > * small lock structure to allow many LW locks to exist. > > Proposed contents of LW lock structure: > > spinlock mutex (protects LW lock state and PROC queue links) > count of exclusive holders (always 0 or 1) > count of shared holders (0 .. MaxBackends) > queue head pointer (NULL or ptr to PROC object) > queue tail pointer (could do without this to save space) > > If a backend sees it must wait to acquire the lock, it adds its PROC > struct to the end of the queue, releases the spinlock mutex, and then > sleeps by P'ing its per-backend wait semaphore. A backend releasing the > lock will check to see if any waiter should be granted the lock. If so, > it will update the lock state, release the spinlock mutex, and finally V > the wait semaphores of any backends that it decided should be released > (which it removed from the lock's queue while holding the sema). Notice > that no kernel calls need be done while holding the spinlock. Since the > wait semaphore will remember a V occurring before P, there's no problem > if the releaser is fast enough to release the waiter before the waiter > reaches its P operation. > > We will need to add a few fields to PROC structures: > * Flag to show whether PROC is waiting for an LW lock, and if so > whether it waits for read or write access > * Additional PROC queue link field. > We can't reuse the existing queue link field because it is possible for a > PROC to be waiting for both a heavyweight lock and a lightweight one --- > this will occur when HandleDeadLock or LockWaitCancel tries to acquire > the LockMgr module's lightweight lock (formerly spinlock). > > It might seem that we also need to create a second wait semaphore per > backend, one to wait on HW locks and one to wait on LW locks. But I > believe we can get away with just one, by recognizing that a wait for an > LW lock can never be interrupted by a wait for a HW lock, only vice versa. > After being awoken (V'd), the LW lock manager must check to see if it was > actually granted the lock (easiest way: look at own PROC struct to see if > LW lock wait flag has been cleared). If not, the V must have been to > grant us a HW lock --- but we still have to sleep to get the LW lock. So > remember this happened, then loop back and P again. When we finally get > the LW lock, if there was an extra P operation then V the semaphore once > before returning. This will allow ProcSleep to exit the wait for the HW > lock when we return to it. > > Fine points: > > While waiting for an LW lock, we need to show in our PROC struct whether > we are waiting for read or write access. But we don't need to remember > this after getting the lock; if we know we have the lock, it's easy to > see by inspecting the lock whether we hold read or write access. > > ProcStructLock cannot be replaced by an LW lock, since a backend cannot > use an LW lock until it has obtained a PROC struct and a semaphore, > both of which are protected by this lock. It seems okay to use a plain > spinlock for this purpose. NOTE: it's okay for SInvalLock to be an LW > lock, as long as the LW mgr does not depend on accessing the SI array > of PROC objects, but only chains through the PROCs themselves. > > Another tricky point is that some of the setup code executed by the > postmaster may try to to grab/release LW locks. Here, we can probably > allow a special case for MyProc=NULL. It's likely that we should never > see a block under these circumstances anyway, so finding MyProc=NULL when > we need to block may just be a fatal error condition. > > A nastier case is checkpoint processes; these expect to grab BufMgr and > WAL locks. Perhaps okay for them to do plain sleeps in between attempts > to grab the locks? This says that the MyProc=NULL case should release > the spinlock mutex, sleep 10 msec, try again, rather than any sort of error > or expectation of no conflict. Are there any cases where this represents > a horrid performance loss? Checkpoint itself seems noncritical. > > Alternative is for checkpoint to be allowed to create a PROC struct (but > not to enter it in SI list) so's it can participate normally in LW lock > operations. That seems a good idea anyway, actually, so that the PROC > struct's facility for releasing held LW locks at elog time will work > inside the checkpointer. (But that means we need an extra sema too? > Okay, but don't want an extra would-be backend to obtain the extra sema > and perhaps cause a checkpoint proc to fail. So must allocate the PROC > and sema for checkpoint process separately from those reserved for > backends.) > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED] > -- Bruce Momjian | http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026 ---------------------------(end of broadcast)--------------------------- TIP 4: Don't 'kill -9' the postmaster