Does the original version of my stress test not repro the problem on 9.2?

My primary concern with the fix is that I simply didn't understand what
might happen after a failed lock attempt called CleanUpLock freeing the
PROCLOCK but leaving some LOCALLOCK still pointing at it.  As long as
"nLocks == 0" is guarenteed I guess we are OK although LockRelease will
elog a WARNING and LockReleaseAll believes "/* ... we must've run out of
shared memory ... */".

Why does LockAcquireExtended() test for "nLocks == 0" in the "if
(dontWait)" block before calling RemoveLocalLock()?  Can nLocks be anything
other than 0 if we've just freed the PROCLOCK in this block of code?  If
nLocks is > 0 AND pointing at a freed PROCLOCK can't we still have a
problem.  Given that "dontWait==true" seems to be associated with DDL and
other rare things it might be hard to stress test this case.



On Tue, Nov 26, 2013 at 5:30 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:

> Daniel Wood <dw...@salesforce.com> writes:
> > Sorry but I don't know how to respond to an old thread I found on
> > postgresql.org:
> > http://www.postgresql.org/message-id/20766.1357318...@sss.pgh.pa.us
>
> > I presume this is still an open issue.  While working on a new feature I
> > wrote a stress test for it.  After fixing my bugs, I couldn't get rid of:
> > ERROR:  lock RowExclusiveLock on object 16384/16393/0 is already held
> > In addition, with asserts enabled in lock.c I would see Asserts in
> > UnGrantLock, LockRelease, LockReleaseAll, etc.  In one run everything
> hung
> > waiting on SQL locks.
>
> > The asserts or hang occurs within seconds of running the stress test.
> > Before I get into the details I want to see if this is something already
> > being worked on.  I can repro it easily in vanilla 9.3.
>
> Dan sent me this test case off-list.  Here's a somewhat cleaned-up
> version:
>
> 1. Create a table:
>    create table lock_bug(f1 int, f2 int);
>
> 2. Compile the attached lockbug.c.
>
> 3. Run the attached lockbug.sh, with adjustment of parameters to taste.
>
> This spits up in a remarkable variety of ways in 9.3 and HEAD,
> especially if you have assertions on.
>
> After some investigation I found the cause: LockRelease() expects that
> if a LOCALLOCK object represents a valid lock (nLocks > 0), then
> either its lock and proclock fields point to associated shared-memory
> entries, or they're NULL.  However, it's possible for that to not be
> true, because if we fail to acquire a lock, the code leaves around a
> LOCALLOCK object pointing at the shared objects we couldn't get lock
> on.  *These objects are subject to reclamation, because no lock is
> actually held*.  So if we make another attempt to get the same lock
> later in the same transaction, LockAcquire finds and re-uses that
> LOCALLOCK object.  Pre fast-path locks, it would always recompute the
> lock and proclock pointers, so the fact that they might have been stale
> wasn't a problem.  But the fast-path patch added a code path in which
> we could succeed in acquiring a fast-path lock, and then exit without
> having done anything with the pointer fields.  Now we have something
> that looks like a valid locallock but could be pointing at entirely
> unrelated shared objects.  This leads to all sorts of fun at release
> time, with the symptoms depending on exactly what those shared
> hashtable entries are being used for at the instant we stomp on them.
>
> So the fix is pretty simple:
>
> diff --git a/src/backend/storage/lmgr/lock.c
> b/src/backend/storage/lmgr/lock.c
> index f8dc951..37c605f 100644
> *************** LockAcquireExtended(const LOCKTAG *lockt
> *** 837,842 ****
> --- 844,851 ----
>                 LWLockRelease(MyProc->backendLock);
>                 if (acquired)
>                 {
> +                       locallock->lock = NULL;
> +                       locallock->proclock = NULL;
>                         GrantLockLocal(locallock, owner);
>                         return LOCKACQUIRE_OK;
>                 }
>
> although this needs to be backed up with a lot more comments of course.
>
> When I showed this to Dan, he objected that it'd be better if
> the failing lock operation had cleaned up the LOCALLOCK instead.
> That would be a significantly bigger change though, and I think
> it'd net out being less robust.  The data structure was designed
> to not need such cleanup, because the more stuff you have to do
> to clean up after a failure, the more likely it is that you'll
> have additional problems during the cleanup, leaving you hosed.
> In particular, we'd have to be certain that we could remove the
> useless shared objects during the cleanup step, since once the
> LOCALLOCK is gone there is nothing else that will remind us to
> garbage-collect them at end of transaction.
>
> BTW, although I'm certain that 9.2 has this bug as well, this
> test case fails to show a problem in that branch.  I've not
> looked into why not --- it's probably a timing issue or something.
>
> Comments?
>
>                         regards, tom lane
>
>

Reply via email to