Re: Minimal logical decoding on standbys

Amit Khandekar Wed, 03 Apr 2019 07:28:09 -0700

On Tue, 2 Apr 2019 at 21:34, Andres Freund <[email protected]> wrote:
>
> Hi,
>
> On 2019-04-02 15:26:52 +0530, Amit Khandekar wrote:
> > On Thu, 14 Mar 2019 at 15:00, Amit Khandekar <[email protected]> wrote:
> > > I managed to get a recovery conflict by :
> > > 1. Setting hot_standby_feedback to off
> > > 2. Creating a logical replication slot on standby
> > > 3. Creating a table on master, and insert some data.
> > > 2. Running : VACUUM FULL;
> > >
> > > This gives WARNING messages in the standby log file.
> > > 2019-03-14 14:57:56.833 IST [40076] WARNING:  slot decoding_standby w/
> > > catalog xmin 474 conflicts with removed xid 477
> > > 2019-03-14 14:57:56.833 IST [40076] CONTEXT:  WAL redo at 0/3069E98
> > > for Heap2/CLEAN: remxid 477
> > >
> > > But I did not add such a testcase into the test file, because with the
> > > current patch, it does not do anything with the slot; it just keeps on
> > > emitting WARNING in the log file; so we can't test this scenario as of
> > > now using the tap test.
> >
> > I am going ahead with drop-the-slot way of handling the recovery
> > conflict. I am trying out using ReplicationSlotDropPtr() to drop the
> > slot. It seems the required locks are already in place inside the for
> > loop of ResolveRecoveryConflictWithSlots(), so we can directly call
> > ReplicationSlotDropPtr() when the slot xmin conflict is found.
>
> Cool.
>
>
> > As explained above, the only way I could reproduce the conflict is by
> > turning hot_standby_feedback off on slave, creating and inserting into
> > a table on master and then running VACUUM FULL. But after doing this,
> > I am not able to verify whether the slot is dropped, because on slave,
> > any simple psql command thereon, waits on a lock acquired on sys
> > catache, e.g. pg_authid. Working on it.
>
> I think that indicates a bug somewhere. If replay progressed, it should
> have killed the slot, and continued replaying past the VACUUM
> FULL. Those symptoms suggest replay is stuck somewhere. I suggest a)
> compiling with WAL_DEBUG enabled, and turning on wal_debug=1, b) looking
> at a backtrace of the startup process.


Oops, it was my own change that caused the hang. Sorry for the noise.
After using wal_debug, found out that after replaying the LOCK records
for the catalog pg_auth, it was not releasing it because it had
actually got stuck in ReplicationSlotDropPtr() itself. In
ResolveRecoveryConflictWithSlots(), a shared
ReplicationSlotControlLock was already held before iterating through
the slots, and now ReplicationSlotDropPtr() again tries to take the
same lock in exclusive mode for setting slot->in_use, leading to a
deadlock. I fixed that by releasing the shared lock before calling
ReplicationSlotDropPtr(), and then re-starting the slots' scan over
again since we released it. We do similar thing for
ReplicationSlotCleanup().

Attached is a rebased version of your patch
logical-decoding-on-standby.patch. This v2 version also has the above
changes. It also includes the tap test file which is still in WIP
state, mainly because I have yet to add the conflict recovery handling
scenarios.

I see that you have already committed the
move-latestRemovedXid-computation-for-nbtree-xlog related changes.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

logical-decoding-on-standby_v2.patch
Description: Binary data

Re: Minimal logical decoding on standbys

Reply via email to