On Fri, Jul 12, 2024 at 4:54 AM Euler Taveira <eu...@eulerto.com> wrote: > > On Thu, Jul 11, 2024, at 2:00 PM, Alexander Lakhin wrote: > > May I ask you to look at another failure of the test occurred today [1]? > > > Thanks for the report! > > You are observing the same issue that Amit explained in [1]. The > pg_create_logical_replication_slot returns the EndRecPtr (see > slot->data.confirmed_flush in DecodingContextFindStartpoint()). EndRecPtr > points > to the next record and it is a future position for an idle server. That's why > the recovery takes some time to finish because it is waiting for an activity > to > increase the LSN position. Since you modified LOG_SNAPSHOT_INTERVAL_MS to > create > additional WAL records soon, the EndRecPtr position is reached rapidly and the > recovery ends quickly. >
If the recovery ends quickly (which is expected due to reduced LOG_SNAPSHOT_INTERVAL_MS ) then why do we see "error: recovery timed out"? > Hayato proposes a patch [2] to create an additional WAL record that has the > same > effect from you little hack: increase the LSN position to allow the recovery > finishes soon. I don't like the solution although it seems simple to > implement. > As Amit said if we know the ReadRecPtr, we could use it as consistent LSN. The > problem is that it is used by logical decoding but it is not exposed. [reading > the code...] When the logical replication slot is created, restart_lsn points > to > the lastReplayedEndRecPtr (see ReplicationSlotReserveWal()) that is the last > record replayed. > The last 'lastReplayedEndRecPtr' should be the value of restart_lsn on standby (when RecoveryInProgress is true) but here we are creating slots on the publisher/primary, so shouldn't restart_lsn point to "latest WAL insert pointer"? -- With Regards, Amit Kapila.