Does the original version of my stress test not repro the problem on 9.2? My primary concern with the fix is that I simply didn't understand what might happen after a failed lock attempt called CleanUpLock freeing the PROCLOCK but leaving some LOCALLOCK still pointing at it. As long as "nLocks == 0" is guarenteed I guess we are OK although LockRelease will elog a WARNING and LockReleaseAll believes "/* ... we must've run out of shared memory ... */".
Why does LockAcquireExtended() test for "nLocks == 0" in the "if (dontWait)" block before calling RemoveLocalLock()? Can nLocks be anything other than 0 if we've just freed the PROCLOCK in this block of code? If nLocks is > 0 AND pointing at a freed PROCLOCK can't we still have a problem. Given that "dontWait==true" seems to be associated with DDL and other rare things it might be hard to stress test this case. On Tue, Nov 26, 2013 at 5:30 PM, Tom Lane <t...@sss.pgh.pa.us> wrote: > Daniel Wood <dw...@salesforce.com> writes: > > Sorry but I don't know how to respond to an old thread I found on > > postgresql.org: > > http://www.postgresql.org/message-id/20766.1357318...@sss.pgh.pa.us > > > I presume this is still an open issue. While working on a new feature I > > wrote a stress test for it. After fixing my bugs, I couldn't get rid of: > > ERROR: lock RowExclusiveLock on object 16384/16393/0 is already held > > In addition, with asserts enabled in lock.c I would see Asserts in > > UnGrantLock, LockRelease, LockReleaseAll, etc. In one run everything > hung > > waiting on SQL locks. > > > The asserts or hang occurs within seconds of running the stress test. > > Before I get into the details I want to see if this is something already > > being worked on. I can repro it easily in vanilla 9.3. > > Dan sent me this test case off-list. Here's a somewhat cleaned-up > version: > > 1. Create a table: > create table lock_bug(f1 int, f2 int); > > 2. Compile the attached lockbug.c. > > 3. Run the attached lockbug.sh, with adjustment of parameters to taste. > > This spits up in a remarkable variety of ways in 9.3 and HEAD, > especially if you have assertions on. > > After some investigation I found the cause: LockRelease() expects that > if a LOCALLOCK object represents a valid lock (nLocks > 0), then > either its lock and proclock fields point to associated shared-memory > entries, or they're NULL. However, it's possible for that to not be > true, because if we fail to acquire a lock, the code leaves around a > LOCALLOCK object pointing at the shared objects we couldn't get lock > on. *These objects are subject to reclamation, because no lock is > actually held*. So if we make another attempt to get the same lock > later in the same transaction, LockAcquire finds and re-uses that > LOCALLOCK object. Pre fast-path locks, it would always recompute the > lock and proclock pointers, so the fact that they might have been stale > wasn't a problem. But the fast-path patch added a code path in which > we could succeed in acquiring a fast-path lock, and then exit without > having done anything with the pointer fields. Now we have something > that looks like a valid locallock but could be pointing at entirely > unrelated shared objects. This leads to all sorts of fun at release > time, with the symptoms depending on exactly what those shared > hashtable entries are being used for at the instant we stomp on them. > > So the fix is pretty simple: > > diff --git a/src/backend/storage/lmgr/lock.c > b/src/backend/storage/lmgr/lock.c > index f8dc951..37c605f 100644 > *************** LockAcquireExtended(const LOCKTAG *lockt > *** 837,842 **** > --- 844,851 ---- > LWLockRelease(MyProc->backendLock); > if (acquired) > { > + locallock->lock = NULL; > + locallock->proclock = NULL; > GrantLockLocal(locallock, owner); > return LOCKACQUIRE_OK; > } > > although this needs to be backed up with a lot more comments of course. > > When I showed this to Dan, he objected that it'd be better if > the failing lock operation had cleaned up the LOCALLOCK instead. > That would be a significantly bigger change though, and I think > it'd net out being less robust. The data structure was designed > to not need such cleanup, because the more stuff you have to do > to clean up after a failure, the more likely it is that you'll > have additional problems during the cleanup, leaving you hosed. > In particular, we'd have to be certain that we could remove the > useless shared objects during the cleanup step, since once the > LOCALLOCK is gone there is nothing else that will remind us to > garbage-collect them at end of transaction. > > BTW, although I'm certain that 9.2 has this bug as well, this > test case fails to show a problem in that branch. I've not > looked into why not --- it's probably a timing issue or something. > > Comments? > > regards, tom lane > >