On 03/02/17 19:38, Fujii Masao wrote: > On Sat, Feb 4, 2017 at 12:49 AM, Fujii Masao <masao.fu...@gmail.com> wrote: >> On Fri, Feb 3, 2017 at 10:56 AM, Kyotaro HORIGUCHI >> <horiguchi.kyot...@lab.ntt.co.jp> wrote: >>> At Fri, 3 Feb 2017 01:02:47 +0900, Fujii Masao <masao.fu...@gmail.com> >>> wrote in >>> <CAHGQGwHqQVHmQ7wM=elnnp1_oy-gvssacajxwje4nc2twsq...@mail.gmail.com> >>>> On Thu, Feb 2, 2017 at 2:36 PM, Michael Paquier >>>> <michael.paqu...@gmail.com> wrote: >>>>> On Thu, Feb 2, 2017 at 2:13 PM, Tom Lane <t...@sss.pgh.pa.us> wrote: >>>>>> Kyotaro HORIGUCHI <horiguchi.kyot...@lab.ntt.co.jp> writes: >>>>>>> Then, the reason for the TRY-CATCH cluase is that I found that >>>>>>> some functions called from there can throw exceptions. >>>>>> >>>>>> Yes, but all LWLocks should be released by normal error recovery. >>>>>> It should not be necessary for this code to clean that up by hand. >>>>>> If it were necessary, there would be TRY-CATCH around every single >>>>>> LWLockAcquire in the backend, and we'd have an unreadable and >>>>>> unmaintainable system. Please don't add a TRY-CATCH unless it's >>>>>> *necessary* -- and you haven't explained why this one is. >>>> >>>> Yes. >>> >>> Thank you for the suggestion. I minunderstood that. >>> >>>>> Putting hands into the code and at the problem, I can see that >>>>> dropping a subscription on a node makes it unresponsive in case of a >>>>> stop. And that's just because calls to LWLockRelease are missing as in >>>>> the patch attached. A try/catch problem should not be necessary. >>>> >>>> Thanks for the patch! >>>> >>>> With the patch, LogicalRepLauncherLock is released at the end of >>>> DropSubscription(). But ISTM that the lock should be released just after >>>> logicalrep_worker_stop() and there is no need to protect the removal of >>>> replication slot with the lock. >>> >>> That's true. logicalrep_worker_stop returns after confirmig that >>> worker->proc is cleard, so no false relaunch cannot be caused. >>> After all, logicalrep_worker_stop is surrounded by >>> LWLockAcquire/Relase pair. So it can be moved into the funciton >>> and make the lock secrion to be more narrower. > > If we do this, Assert(LWLockHeldByMe(LogicalRepLauncherLock)) should be > removed and the comment for logicalrep_worker_stop() should be updated. > > Your approach may cause the deadlock. The launcher takes LogicalRepWorkerLock > while holding LogicalRepLauncherLock. OTOH, with your approach, > logicalrep_worker_stop() takes LogicalRepLauncherLock while holding > LogicalRepWorkerLock. > > Therefore I pushed the simple patch which adds LWLockRelease() just after > logicalrep_worker_stop(). > > Another problem that I found while reading the code is that the launcher can > start up the worker with the subscription that DROP SUBSCRIPTION just removed. > That is, DROP SUBSCRIPTION removes the target entry from pg_subscription, > but the launcher can see it and start new worker until the transaction for > DROP has been committed. >
That was the reason why DropSubscription didn't release the lock in the first place. It was supposed to be released at the end of the transaction though. > To fix this issue, I think that DROP SUBSCRIPTION should take > AccessExclusiveLock on pg_subscription, instead of RowExclusiveLock, > so that the launcher cannot see the entry to be being removed. > The whole point of having LogicalRepLauncherLock is to avoid having to do this, so if we do this we could probably get rid of it. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers