On Fri, Jul 12, 2024 at 4:54 AM Euler Taveira <eu...@eulerto.com> wrote:
>
> On Thu, Jul 11, 2024, at 2:00 PM, Alexander Lakhin wrote:
>
> May I ask you to look at another failure of the test occurred today [1]?
>
>
> Thanks for the report!
>
> You are observing the same issue that Amit explained in [1]. The
> pg_create_logical_replication_slot returns the EndRecPtr (see
> slot->data.confirmed_flush in DecodingContextFindStartpoint()). EndRecPtr 
> points
> to the next record and it is a future position for an idle server. That's why
> the recovery takes some time to finish because it is waiting for an activity 
> to
> increase the LSN position. Since you modified LOG_SNAPSHOT_INTERVAL_MS to 
> create
> additional WAL records soon, the EndRecPtr position is reached rapidly and the
> recovery ends quickly.
>

If the recovery ends quickly (which is expected due to reduced
LOG_SNAPSHOT_INTERVAL_MS ) then why do we see "error: recovery timed
out"?

> Hayato proposes a patch [2] to create an additional WAL record that has the 
> same
> effect from you little hack: increase the LSN position to allow the recovery
> finishes soon. I don't like the solution although it seems simple to 
> implement.
> As Amit said if we know the ReadRecPtr, we could use it as consistent LSN. The
> problem is that it is used by logical decoding but it is not exposed. [reading
> the code...] When the logical replication slot is created, restart_lsn points 
> to
> the lastReplayedEndRecPtr (see ReplicationSlotReserveWal()) that is the last
> record replayed.
>

The last 'lastReplayedEndRecPtr' should be the value of restart_lsn on
standby (when RecoveryInProgress is true) but here we are creating
slots on the publisher/primary, so shouldn't restart_lsn point to
"latest WAL insert pointer"?

-- 
With Regards,
Amit Kapila.


Reply via email to