Hi, On 2019-04-02 15:26:52 +0530, Amit Khandekar wrote: > On Thu, 14 Mar 2019 at 15:00, Amit Khandekar <amitdkhan...@gmail.com> wrote: > > I managed to get a recovery conflict by : > > 1. Setting hot_standby_feedback to off > > 2. Creating a logical replication slot on standby > > 3. Creating a table on master, and insert some data. > > 2. Running : VACUUM FULL; > > > > This gives WARNING messages in the standby log file. > > 2019-03-14 14:57:56.833 IST [40076] WARNING: slot decoding_standby w/ > > catalog xmin 474 conflicts with removed xid 477 > > 2019-03-14 14:57:56.833 IST [40076] CONTEXT: WAL redo at 0/3069E98 > > for Heap2/CLEAN: remxid 477 > > > > But I did not add such a testcase into the test file, because with the > > current patch, it does not do anything with the slot; it just keeps on > > emitting WARNING in the log file; so we can't test this scenario as of > > now using the tap test. > > I am going ahead with drop-the-slot way of handling the recovery > conflict. I am trying out using ReplicationSlotDropPtr() to drop the > slot. It seems the required locks are already in place inside the for > loop of ResolveRecoveryConflictWithSlots(), so we can directly call > ReplicationSlotDropPtr() when the slot xmin conflict is found.
Cool. > As explained above, the only way I could reproduce the conflict is by > turning hot_standby_feedback off on slave, creating and inserting into > a table on master and then running VACUUM FULL. But after doing this, > I am not able to verify whether the slot is dropped, because on slave, > any simple psql command thereon, waits on a lock acquired on sys > catache, e.g. pg_authid. Working on it. I think that indicates a bug somewhere. If replay progressed, it should have killed the slot, and continued replaying past the VACUUM FULL. Those symptoms suggest replay is stuck somewhere. I suggest a) compiling with WAL_DEBUG enabled, and turning on wal_debug=1, b) looking at a backtrace of the startup process. Greetings, Andres Freund