Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4

Ashutosh Bapat Tue, 12 Nov 2024 04:09:25 -0800

On Tue, Nov 12, 2024 at 4:55 PM Tomas Vondra <[email protected]> wrote:


>
> > A candidate_restart_lsn and candidate_restart_valid pair just tells
> > that we may set slot's data.restart_lsn to candidate_restart_lsn when
> > the downstream confirms an LSN >= candidate_restart_valid. That pair
> > can never get inaccurate. It may get stale but never inaccurate. So
> > wiping those fields from ReplicationSlot is unnecessary.
> >
>
> Isn't this issue a proof that those fields *can* get inaccurate? Or what
> do you mean by "stale but not inaccurate"?

While processing a running transaction log record at RT1 the WAL
sender gathers that slot's data.restart_lsn can be set to
candidate_restart_lsn1 when confirmed_flush_lsn is past
candidate_valid_restart1. In that sense it's accurate irrespective of
when those candidates are generated. But when the next running
transactions record RT2 is processed a new pair of
candidate_restart_lsn2 and candidate_restart_valid2 is produced
(candidate_restart_valid2 > candidate_restart_valid1). Thus if
confirmed_flush_lsn >= candidate_restart_valid2, it's better to set
data.restart_lsn = candidate_restart_lsn2 instead of
candidate_restart_lsn1. In that sense the first pair becomes stale. If
we could maintain all such pairs in a sorted fashion, we would be able
to set data.restart_lsn to the latest candiate_restart_lsn for a
received confirmed_flush_lsn and remove all the pairs including the
latest one after setting data.restart_lsn. But we don't maintain such
a list and hence the business of resetting candiate_restart_lsn and
candidate_restart_valid to indicate that we have consumed the previous
pair and are ready for a new one. We don't keep updating
candidate_restart_lsn and candidate_restart_valid to avoid chasing a
moving target and never updating data.restart_lsn.

>
> > What should ideally happen is we should ignore candidates produced by
> > older running transactions WAL records after WAL sender restart. This
> > is inline with what logical replication does with transactions
> > committed before slot's confirmed_flush_lsn - those are ignored. But
> > the criteria for ignoring running transactions records is slightly
> > different from that for transactions. If we ignore
> > candidate_restart_lsn which has candidate_restart_valid <=
> > confirmed_flush_lsn, we might lose some opportunity to advance
> > data.restart_lsn. Instead we should ignore any candidate_restart_lsn
> > <= data.restart_lsn especially before WAL sender finds first change to
> > send downstream. We can do that in SnapBuildProcessRunningXacts() by
> > accessing MyReplicationSlot, taking lock on it and then comparing
> > data.restart_lsn with txn->restart_decoding_lsn before calling
> > LogicalIncreaseRestartDecodingForSlot(). But then
> > LogicalIncreaseRestartDecodingForSlot() would be doing the same anyway
> > after applying your patch 0004. The only downside of 0004 is that the
> > logic to ignore candidates produced by a running transactions record
> > is not clearly visible in SnapBuildProcessRunningXacts(). For a
> > transaction which is ignored the logic to ignore the transaction is
> > visible in DecodeCommit() or DecodeAbort() - where people are likely
> > to look for that logic. We may add a comment to that effect in
> > SnapBuildProcessRunningXacts().
> >
>
> I have thought about just doing something like:
>
>    slot->data.restart_lsn = Max(slot->data.restart_lsn, new_lsn);
>
> and similar for the other LSN fields. And it did resolve the issue at
> hand, of course. But it seems sloppy, and I'm worried it might easily
> mask actual issues in other cases.

Agreed.

>
> I'm still of the opinion that (with the exception of a reconnect), these
> places should not need to deal with values that go backwards. It should
> work just fine without the Max(), and we should add Asserts() to check
> that it's always a higher LSN.

Agreed.


>
> I haven't thought about making SnapBuildProcessRunningXacts() more
> complex to consider this stuff. But I doubt we'd like to be accessing
> slots from that code - it has nothing to do with slots. If anything,
> tweaking LogicalIncreaseRestartDecodingForSlot() seems more appropriate,

Agree.

> but I'm still wondering if the current coding was intentional and we're
> just missing why it was written like this.

Interestingly, the asymmetry between the functions is added in the
same commit b89e151054a05f0f6d356ca52e3b725dd0505e53. I doubt it was
intentional; the comments at both places say the same thing. This
problem is probably as old as that commit.

>
> For the reconnect, I think it's a bit as if the primary restarted right
> before the reconnect. That could happen anyway, and we need to handle
> that correctly - if not, we have yet another issue, IMHO. And with the
> restart it's the same as writing the slot to disk and reading it back,
> which also doesn't retain most of the fields. So it seems cleaner to do
> the same thing and just reset the various fields.
>

>
> There's also the question of backpatching - the simpler the better, and
> this I think just resetting the fields wins in this regard. The main
> question is whether it's correct - I think it is. I'm not too worried
> about efficiency very much, on the grounds that this should not matter
> very often (only after unexpected restart). Correctness > efficiency.

If the slot's restart_lsn is very old before disconnect we will lose
an opportunity to update the restart_lsn and thus release some
resources earlier. However, that opportunity is only for a short
duration. On a fast enough machine the data.restart_lsn will be
updated anyway after processing all running transactions wal records
anyway. So I am also of the opinion that the argument of efficiency
won't stand here. But I doubt if "not resetting" is wrong. It looks
more intentional to me than the asymmetry between
LogicalIncreaseRestartDecodingForSlot and LogicalIncreaseXminForSlot.
This is our chance to settle that asymmetry forever :D.

I don't have a strong argument against resetting candidates in slots
at SlotRelease. However, resetting effective_xmin and
effective_catalog_xmin may does not look good. The old values may have
been used to advance xmin horizon. I have not closely read the code to
know whether the on-disk xmin and effective_xmin are always in sync
when computing xmin horizon or not.

--
Best Wishes,
Ashutosh Bapat

Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4

Reply via email to