On Tue, 2 Apr 2019 at 21:34, Andres Freund <and...@anarazel.de> wrote: > > Hi, > > On 2019-04-02 15:26:52 +0530, Amit Khandekar wrote: > > On Thu, 14 Mar 2019 at 15:00, Amit Khandekar <amitdkhan...@gmail.com> wrote: > > > I managed to get a recovery conflict by : > > > 1. Setting hot_standby_feedback to off > > > 2. Creating a logical replication slot on standby > > > 3. Creating a table on master, and insert some data. > > > 2. Running : VACUUM FULL; > > > > > > This gives WARNING messages in the standby log file. > > > 2019-03-14 14:57:56.833 IST [40076] WARNING: slot decoding_standby w/ > > > catalog xmin 474 conflicts with removed xid 477 > > > 2019-03-14 14:57:56.833 IST [40076] CONTEXT: WAL redo at 0/3069E98 > > > for Heap2/CLEAN: remxid 477 > > > > > > But I did not add such a testcase into the test file, because with the > > > current patch, it does not do anything with the slot; it just keeps on > > > emitting WARNING in the log file; so we can't test this scenario as of > > > now using the tap test. > > > > I am going ahead with drop-the-slot way of handling the recovery > > conflict. I am trying out using ReplicationSlotDropPtr() to drop the > > slot. It seems the required locks are already in place inside the for > > loop of ResolveRecoveryConflictWithSlots(), so we can directly call > > ReplicationSlotDropPtr() when the slot xmin conflict is found. > > Cool. > > > > As explained above, the only way I could reproduce the conflict is by > > turning hot_standby_feedback off on slave, creating and inserting into > > a table on master and then running VACUUM FULL. But after doing this, > > I am not able to verify whether the slot is dropped, because on slave, > > any simple psql command thereon, waits on a lock acquired on sys > > catache, e.g. pg_authid. Working on it. > > I think that indicates a bug somewhere. If replay progressed, it should > have killed the slot, and continued replaying past the VACUUM > FULL. Those symptoms suggest replay is stuck somewhere. I suggest a) > compiling with WAL_DEBUG enabled, and turning on wal_debug=1, b) looking > at a backtrace of the startup process.
Oops, it was my own change that caused the hang. Sorry for the noise. After using wal_debug, found out that after replaying the LOCK records for the catalog pg_auth, it was not releasing it because it had actually got stuck in ReplicationSlotDropPtr() itself. In ResolveRecoveryConflictWithSlots(), a shared ReplicationSlotControlLock was already held before iterating through the slots, and now ReplicationSlotDropPtr() again tries to take the same lock in exclusive mode for setting slot->in_use, leading to a deadlock. I fixed that by releasing the shared lock before calling ReplicationSlotDropPtr(), and then re-starting the slots' scan over again since we released it. We do similar thing for ReplicationSlotCleanup(). Attached is a rebased version of your patch logical-decoding-on-standby.patch. This v2 version also has the above changes. It also includes the tap test file which is still in WIP state, mainly because I have yet to add the conflict recovery handling scenarios. I see that you have already committed the move-latestRemovedXid-computation-for-nbtree-xlog related changes. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
logical-decoding-on-standby_v2.patch
Description: Binary data